Stream: brlcad

Topic: shootrays_in_parallel


view this post on Zulip scorp08 (Apr 03 2020 at 13:31):

Hello All
I was planning to complete shootrays_in_parallel function to speed up. I could not understand " (while (rays->ap[i].a_magic != RT_AP_MAGIC)) "why Magic numbers are neccessary in the shootrays_in_parallel function. struct application has already initilalized with RT_AP_MAGIC so the while loop is already False ?
It would be good to get some help
stay safe

view this post on Zulip Erik (Apr 03 2020 at 15:12):

it's a BU_LIST iirc, which is circular and uses that to understand the "beginning"/"end" somehow

view this post on Zulip scorp08 (Apr 03 2020 at 15:22):

@Erik still could not understand. what is iirc ?

view this post on Zulip Erik (Apr 03 2020 at 16:30):

iirc == "if I recall correctly". I'm old. it's been a while. I could be wrong.

view this post on Zulip scorp08 (Apr 03 2020 at 16:41):

@Erik why BU_BITSET is neccessary ? to shift bits after the shoot ?

view this post on Zulip scorp08 (Apr 08 2020 at 19:28):

Before the shoot do I need to run rt_prerp_parallel or rt_prep ? I do not know the difference actually...

view this post on Zulip scorp08 (Apr 12 2020 at 16:08):

scorp08 said:

Before the shoot do I need to run rt_prerp_parallel or rt_prep ? I do not know the difference actually...

Any help would be great :)

view this post on Zulip Erik (Apr 12 2020 at 19:17):

if I remember, just one or the other. they both do the same thing, one just does it in parallel (good for smp, not so much for single core). might wanna check the docs, though

view this post on Zulip Sean (Apr 18 2020 at 04:53):

@scorp08 it's not that BU_BITSET is necessary, but the code must define some way for multiple threads to coordinate, and there needs to be some book keeping.

BU_BITSET does both, conveniently.

view this post on Zulip Sean (Apr 18 2020 at 04:53):

it's atomic, or at least close enough to atomic for testing purposes, and it's a set of flags that can help keep track of what's been done.

view this post on Zulip scorp08 (Jun 14 2020 at 16:04):

Hello
I finally got time yesterday. I tried a couple of stuff, shooted rays without error but not sure if it is parallel. Could you pls check and guide me ?? Here is the file rt_shootrays_parallel.c

view this post on Zulip Sean (Jun 15 2020 at 15:22):

Hi @scorp08 at a glance, this looks like it could work in parallel. I'm just not sure it's right book-keeping without some in-depth testing. You're loop is centered around the nbits array becoming zero, so it may depend on how the caller initialized the initial threads and ray_num. It's definitely on the right track for a semaphore-protected approach. You do want to call rt_shootray outside of semaphore protection, otherwise the threads will all end up waiting on each other.

view this post on Zulip scorp08 (Jun 18 2020 at 07:01):

I tested but It seems my loop is not a proper way. Could not figure it out, also tried with different ways

view this post on Zulip scorp08 (Jun 20 2020 at 05:28):

scorp08 said:

I tested but It seems my loop is not a proper way. Could not figure it out, also tried with different ways

It would be great to get help :)

view this post on Zulip Sean (Jun 25 2020 at 12:52):

@scorp08 Can you post the latest version of your code and any other changes you've made?

view this post on Zulip scorp08 (Aug 26 2020 at 10:39):

Sean said:

scorp08 Can you post the latest version of your code and any other changes you've made?

@Sean Hello, and sorry for my late reply , holiday and virus issues. Did not do much changes. Just added an nray variable to the struct and while loop inside the function. I am not sure how to proceed :)

view this post on Zulip Sean (Aug 26 2020 at 15:40):

@scorp08 can you share your code here?

view this post on Zulip scorp08 (Oct 04 2020 at 12:45):

Sean said:

scorp08 can you share your code here?

@Sean Hello , l Iost the drive but did not do much change than posted code. I could not spend time more on it unfortunately. :/

view this post on Zulip scorp08 (Oct 20 2020 at 10:40):

scorp08 said:

Sean said:

scorp08 can you share your code here?

Sean Hello , l Iost the drive but did not do much change than posted code. I could not spend time more on it unfortunately. :/

Any help is great for me :)))

view this post on Zulip Sean (Oct 21 2020 at 16:45):

@scorp08 Did you take a look at how the rt* tools shoot rays in parallel? take a look at src/rt/worker.c

view this post on Zulip Sean (Oct 21 2020 at 16:45):

understanding how they work is probably a good first step towards making a parallel function as they shoot in parallel

view this post on Zulip Sean (Oct 21 2020 at 16:45):

one possible approach is to simply do what they're doing in a function.

view this post on Zulip scorp08 (Oct 23 2020 at 10:39):

Sean said:

scorp08 Did you take a look at how the rt* tools shoot rays in parallel? take a look at src/rt/worker.c

Hello, thanks I did some changes and it worked ! I run with 10 million ray (all hit) and got around x8 speed-up with my old 12 core laptop (measured with chrono and disabled all log functions in my bundle_hit function). However at the end rt_clean(rtip) returned with error , probably issue about resource free operation. Could not figure it out. Can I upload here my bundle.c file here ?

view this post on Zulip Sean (Oct 23 2020 at 17:24):

@scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.

view this post on Zulip scorp08 (Oct 23 2020 at 18:16):

Sean said:

scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.

ok, here... bundle.c

view this post on Zulip scorp08 (Oct 25 2020 at 07:33):

scorp08 said:

Sean said:

scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.

ok, here... bundle.c

is there a good way to shoot 10^7 or more rays , eliminating memory limits. I tried with heap allocation but it is still near limits...

view this post on Zulip Sean (Oct 25 2020 at 14:42):

@scorp08 the typical way for shooting massive quantities of rays is to do them in batches. typically done with something like a streaming queue where rays are added as needed, and rays are bundled together into smaller sets (like 64x64) and evaluated until there are no more.

view this post on Zulip Sean (Oct 25 2020 at 14:44):

I looked over your bundle.c code .. impressive work from different directions!

so I see you have one version that is basically the entire guts to rt_shootrays and you have this shootrays_in_parallel version, right? which one are you working on ?

view this post on Zulip Sean (Oct 25 2020 at 14:45):

I ask only because it looks like shootrays_in_parallel is disabled.

view this post on Zulip Sean (Oct 25 2020 at 14:45):

and I don't see the other being used

view this post on Zulip scorp08 (Oct 25 2020 at 20:01):

the typical way for shooting massive quantities of rays is to do them in batches. typically done with something like a streaming queue where rays are added as needed, and rays are bundled together into smaller sets (like 64x64) and evaluated until there are no more.
Sean said:

I looked over your bundle.c code .. impressive work from different directions!

so I see you have one version that is basically the entire guts to rt_shootrays and you have this shootrays_in_parallel version, right? which one are you working on ?

@Sean Thanks It should be active, I worked in shootrays_in_parallel (putted a #define SHOOTRAYS_IN_PARALLEL). I could not imagine how to use a queue, could you direct me to a tutorial or any other resource so I understand this kind of streaming proccess ?

view this post on Zulip scorp08 (Oct 27 2020 at 10:19):

Sean said:

scorp08 the typical way for shooting massive quantities of rays is to do them in batches. typically done with something like a streaming queue where rays are added as needed, and rays are bundled together into smaller sets (like 64x64) and evaluated until there are no more.

@Sean If divided into batches, continously respawning of threads can lead similar time of running as serial shooting(not parallel) ?

view this post on Zulip scorp08 (Nov 02 2020 at 07:47):

Sean said:

scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.

@Sean
I did some changes both rt_shootrays and shootrays_in_parallel. Measured the speedup wtih time.h. It is about x2 , I do not know why it is so low :joy: :joy: . I compared with rtcheck which is quite fast with same number of rays. Is It because I build dll files and link my exe (librt and deps.), check all stuff in run time ? Here is the bundle.c bundle.c

view this post on Zulip scorp08 (Nov 26 2020 at 05:36):

scorp08 said:

Sean said:

scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.

Sean
I did some changes both rt_shootrays and shootrays_in_parallel. Measured the speedup wtih time.h. It is about x2 , I do not know why it is so low :joy: :joy: . I compared with rtcheck which is quite fast with same number of rays. Is It because I build dll files and link my exe (librt and deps.), check all stuff in run time ? Here is the bundle.c bundle.c

does anybody had a chance to check the file:)) ??

view this post on Zulip scorp08 (Dec 08 2020 at 03:15):

scorp08 said:

scorp08 said:

Sean said:

scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.

Sean
I did some changes both rt_shootrays and shootrays_in_parallel. Measured the speedup wtih time.h. It is about x2 , I do not know why it is so low :joy: :joy: . I compared with rtcheck which is quite fast with same number of rays. Is It because I build dll files and link my exe (librt and deps.), check all stuff in run time ? Here is the bundle.c bundle.c

does anybody had a chance to check the file:)) ??

guess no :joy: :joy:

view this post on Zulip Sean (Dec 08 2020 at 03:16):

Sorry @scorp08 I haven't had a chance to set up the profiling to see why

view this post on Zulip Sean (Dec 08 2020 at 03:16):

It's not due to dll linkage

view this post on Zulip Sean (Dec 08 2020 at 03:16):

I can say that :)

view this post on Zulip scorp08 (Dec 08 2020 at 03:17):

Sean said:

Sorry scorp08 I haven't had a chance to set up the profiling to see why

not a technical issue I guess

view this post on Zulip Sean (Dec 08 2020 at 03:18):

A low rate means you have a bottleneck. Threads are waiting on something they'll all trying to access simultaneously.

view this post on Zulip scorp08 (Dec 08 2020 at 03:20):

Sean said:

A low rate means you have a bottleneck. Threads are waiting on something they'll all trying to access simultaneously.

I actually gained around x3 speedup , measured before and after bu_parallel(), but free struct re and others after bu_parallel, takes too long, thats I guess one of the problem

view this post on Zulip Sean (Dec 08 2020 at 03:21):

well it's technical in the sense that it requires setting up profile to see what's going on -- that is something fun you can set up if you're interesting in learning more about performance. The Intel profiler is free and really quite good.

view this post on Zulip scorp08 (Dec 08 2020 at 03:27):

Sean said:

well it's technical in the sense that it requires setting up profile to see what's going on -- that is something fun you can set up if you're interesting in learning more about performance. The Intel profiler is free and really quite good.

Oh, okay I'll look, V-tune . I dont know it is required a pre- knowledge of architecture or any other technical point of view. It seems very hand to use

view this post on Zulip Sean (Dec 08 2020 at 04:15):

It doesn't require knowledge of arch

view this post on Zulip Sean (Dec 08 2020 at 04:16):

you basically compile and then run from inside vtune and it watches the program. then it shows you were in the code it spent the most time.

view this post on Zulip scorp08 (Dec 19 2020 at 13:29):

Sean said:

you basically compile and then run from inside vtune and it watches the program. then it shows you were in the code it spent the most time.

@Sean I did a check with Vtune and bu_parallel seems ok (all threads are active) and it is faster than serial shoots but Rtfreeheap in windows seems the most time spent So in total execution time is longer than serial shoots. I missed something while free all the memory in parallel. How should I free resources and partion lists etc. after bu_parallel??
By the way, Vtune is very handy to use, shows everything you need.

view this post on Zulip scorp08 (Dec 30 2020 at 09:10):

scorp08 said:

Sean said:

you basically compile and then run from inside vtune and it watches the program. then it shows you were in the code it spent the most time.

Sean I did a check with Vtune and bu_parallel seems ok (all threads are active) and it is faster than serial shoots but Rtfreeheap in windows seems the most time spent So in total execution time is longer than serial shoots. I missed something while free all the memory in parallel. How should I free resources and partion lists etc. after bu_parallel??
By the way, Vtune is very handy to use, shows everything you need.

Any cleanup help ?:))

view this post on Zulip scorp08 (Jan 30 2021 at 13:34):

scorp08 said:

scorp08 said:

Sean said:

you basically compile and then run from inside vtune and it watches the program. then it shows you were in the code it spent the most time.

Sean I did a check with Vtune and bu_parallel seems ok (all threads are active) and it is faster than serial shoots but Rtfreeheap in windows seems the most time spent So in total execution time is longer than serial shoots. I missed something while free all the memory in parallel. How should I free resources and partion lists etc. after bu_parallel??
By the way, Vtune is very handy to use, shows everything you need.

Any cleanup help ?:))

I feel my cleanup help in the box :joy: :joy:

view this post on Zulip Sean (Feb 03 2021 at 03:40):

@scorp08 Sorry I was on vacation when you first wrote and then I've been focused on our github migration since. If vtune is showing most time in rtfreeheap, that sounds like a problem...

view this post on Zulip scorp08 (Feb 13 2021 at 03:46):

Sean said:

scorp08 Sorry I was on vacation when you first wrote and then I've been focused on our github migration since. If vtune is showing most time in rtfreeheap, that sounds like a problem...

Is it the free of resource and partion_list the problem ( I guess so)

view this post on Zulip Sean (Feb 18 2021 at 07:10):

That would be a good guess. It should only be freeing the heap when it's all done. Perhaps you have it re-initializing and releasing/cleaning the rtip every ray or something.

view this post on Zulip Sean (Feb 18 2021 at 07:11):

try not releasing any memory, does performance improve dramatically? You don't want that for anything production, but it can help figure out where time is going in parallel.

view this post on Zulip scorp08 (Mar 27 2021 at 12:28):

Sean said:

That would be a good guess. It should only be freeing the heap when it's all done. Perhaps you have it re-initializing and releasing/cleaning the rtip every ray or something.

after long time :) If I call rt_clean_resource , do I need to call also RT_FREE_SEG_LIST ? or rt_clean_resource also handle it ?

view this post on Zulip Sean (Mar 29 2021 at 04:04):

@scorp08 My point earlier was that you may be calling some expensive cleanup too frequently... like cleaning up resources. You shouldn't need to clean very often and Rtfreeheap shouldn't be where most time is spent.

view this post on Zulip Sean (Mar 29 2021 at 04:10):

Still, to answer your question, rt_clean_resource is only going to reset the re_seg segments (by simply dropping the pointers), and free the rt_seg_blocks (pool of segments). If you're not dealing with those, then you very well will likely need to call RT_FREE_SEG_LIST (e.g., on finished_segs).

view this post on Zulip scorp08 (Jul 03 2021 at 16:55):

here is a small result from my experiment. I run with rt and parallel (12 cpu) ,1 cpu. what I did is , I generate same nhit counts and arrange the nrays according to nhit since I shoot from the same point in rt_shootrays but it is changing in rt . So basically nray is lower in rt_shootrays than rt but at the end , same nhit. results.dat


Last updated: Jan 09 2025 at 00:46 UTC