Hello All
I was planning to complete shootrays_in_parallel function to speed up. I could not understand " (while (rays->ap[i].a_magic != RT_AP_MAGIC)) "why Magic numbers are neccessary in the shootrays_in_parallel function. struct application has already initilalized with RT_AP_MAGIC so the while loop is already False ?
It would be good to get some help
stay safe
it's a BU_LIST iirc, which is circular and uses that to understand the "beginning"/"end" somehow
@Erik still could not understand. what is iirc ?
iirc == "if I recall correctly". I'm old. it's been a while. I could be wrong.
@Erik why BU_BITSET is neccessary ? to shift bits after the shoot ?
Before the shoot do I need to run rt_prerp_parallel or rt_prep ? I do not know the difference actually...
scorp08 said:
Before the shoot do I need to run rt_prerp_parallel or rt_prep ? I do not know the difference actually...
Any help would be great :)
if I remember, just one or the other. they both do the same thing, one just does it in parallel (good for smp, not so much for single core). might wanna check the docs, though
@scorp08 it's not that BU_BITSET is necessary, but the code must define some way for multiple threads to coordinate, and there needs to be some book keeping.
BU_BITSET does both, conveniently.
it's atomic, or at least close enough to atomic for testing purposes, and it's a set of flags that can help keep track of what's been done.
Hello
I finally got time yesterday. I tried a couple of stuff, shooted rays without error but not sure if it is parallel. Could you pls check and guide me ?? Here is the file rt_shootrays_parallel.c
Hi @scorp08 at a glance, this looks like it could work in parallel. I'm just not sure it's right book-keeping without some in-depth testing. You're loop is centered around the nbits array becoming zero, so it may depend on how the caller initialized the initial threads and ray_num. It's definitely on the right track for a semaphore-protected approach. You do want to call rt_shootray outside of semaphore protection, otherwise the threads will all end up waiting on each other.
I tested but It seems my loop is not a proper way. Could not figure it out, also tried with different ways
scorp08 said:
I tested but It seems my loop is not a proper way. Could not figure it out, also tried with different ways
It would be great to get help :)
@scorp08 Can you post the latest version of your code and any other changes you've made?
Sean said:
scorp08 Can you post the latest version of your code and any other changes you've made?
@Sean Hello, and sorry for my late reply , holiday and virus issues. Did not do much changes. Just added an nray variable to the struct and while loop inside the function. I am not sure how to proceed :)
@scorp08 can you share your code here?
Sean said:
scorp08 can you share your code here?
@Sean Hello , l Iost the drive but did not do much change than posted code. I could not spend time more on it unfortunately. :/
scorp08 said:
Sean said:
scorp08 can you share your code here?
Sean Hello , l Iost the drive but did not do much change than posted code. I could not spend time more on it unfortunately. :/
Any help is great for me :)))
@scorp08 Did you take a look at how the rt* tools shoot rays in parallel? take a look at src/rt/worker.c
understanding how they work is probably a good first step towards making a parallel function as they shoot in parallel
one possible approach is to simply do what they're doing in a function.
Sean said:
scorp08 Did you take a look at how the rt* tools shoot rays in parallel? take a look at src/rt/worker.c
Hello, thanks I did some changes and it worked ! I run with 10 million ray (all hit) and got around x8 speed-up with my old 12 core laptop (measured with chrono and disabled all log functions in my bundle_hit function). However at the end rt_clean(rtip) returned with error , probably issue about resource free operation. Could not figure it out. Can I upload here my bundle.c file here ?
@scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.
Sean said:
scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.
ok, here... bundle.c
scorp08 said:
Sean said:
scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.
ok, here... bundle.c
is there a good way to shoot 10^7 or more rays , eliminating memory limits. I tried with heap allocation but it is still near limits...
@scorp08 the typical way for shooting massive quantities of rays is to do them in batches. typically done with something like a streaming queue where rays are added as needed, and rays are bundled together into smaller sets (like 64x64) and evaluated until there are no more.
I looked over your bundle.c code .. impressive work from different directions!
so I see you have one version that is basically the entire guts to rt_shootrays and you have this shootrays_in_parallel version, right? which one are you working on ?
I ask only because it looks like shootrays_in_parallel is disabled.
and I don't see the other being used
the typical way for shooting massive quantities of rays is to do them in batches. typically done with something like a streaming queue where rays are added as needed, and rays are bundled together into smaller sets (like 64x64) and evaluated until there are no more.
Sean said:
I looked over your bundle.c code .. impressive work from different directions!
so I see you have one version that is basically the entire guts to rt_shootrays and you have this shootrays_in_parallel version, right? which one are you working on ?
@Sean Thanks It should be active, I worked in shootrays_in_parallel (putted a #define SHOOTRAYS_IN_PARALLEL). I could not imagine how to use a queue, could you direct me to a tutorial or any other resource so I understand this kind of streaming proccess ?
Sean said:
scorp08 the typical way for shooting massive quantities of rays is to do them in batches. typically done with something like a streaming queue where rays are added as needed, and rays are bundled together into smaller sets (like 64x64) and evaluated until there are no more.
@Sean If divided into batches, continously respawning of threads can lead similar time of running as serial shooting(not parallel) ?
Sean said:
scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.
@Sean
I did some changes both rt_shootrays and shootrays_in_parallel. Measured the speedup wtih time.h. It is about x2 , I do not know why it is so low :joy: :joy: . I compared with rtcheck which is quite fast with same number of rays. Is It because I build dll files and link my exe (librt and deps.), check all stuff in run time ? Here is the bundle.c bundle.c
scorp08 said:
Sean said:
scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.
Sean
I did some changes both rt_shootrays and shootrays_in_parallel. Measured the speedup wtih time.h. It is about x2 , I do not know why it is so low :joy: :joy: . I compared with rtcheck which is quite fast with same number of rays. Is It because I build dll files and link my exe (librt and deps.), check all stuff in run time ? Here is the bundle.c bundle.c
does anybody had a chance to check the file:)) ??
scorp08 said:
scorp08 said:
Sean said:
scorp08 that's amazing, congratulations! absolutely, can share files here any time without asking.
Sean
I did some changes both rt_shootrays and shootrays_in_parallel. Measured the speedup wtih time.h. It is about x2 , I do not know why it is so low :joy: :joy: . I compared with rtcheck which is quite fast with same number of rays. Is It because I build dll files and link my exe (librt and deps.), check all stuff in run time ? Here is the bundle.c bundle.cdoes anybody had a chance to check the file:)) ??
guess no :joy: :joy:
Sorry @scorp08 I haven't had a chance to set up the profiling to see why
It's not due to dll linkage
I can say that :)
Sean said:
Sorry scorp08 I haven't had a chance to set up the profiling to see why
not a technical issue I guess
A low rate means you have a bottleneck. Threads are waiting on something they'll all trying to access simultaneously.
Sean said:
A low rate means you have a bottleneck. Threads are waiting on something they'll all trying to access simultaneously.
I actually gained around x3 speedup , measured before and after bu_parallel(), but free struct re and others after bu_parallel, takes too long, thats I guess one of the problem
well it's technical in the sense that it requires setting up profile to see what's going on -- that is something fun you can set up if you're interesting in learning more about performance. The Intel profiler is free and really quite good.
Sean said:
well it's technical in the sense that it requires setting up profile to see what's going on -- that is something fun you can set up if you're interesting in learning more about performance. The Intel profiler is free and really quite good.
Oh, okay I'll look, V-tune . I dont know it is required a pre- knowledge of architecture or any other technical point of view. It seems very hand to use
It doesn't require knowledge of arch
you basically compile and then run from inside vtune and it watches the program. then it shows you were in the code it spent the most time.
Sean said:
you basically compile and then run from inside vtune and it watches the program. then it shows you were in the code it spent the most time.
@Sean I did a check with Vtune and bu_parallel seems ok (all threads are active) and it is faster than serial shoots but Rtfreeheap in windows seems the most time spent So in total execution time is longer than serial shoots. I missed something while free all the memory in parallel. How should I free resources and partion lists etc. after bu_parallel??
By the way, Vtune is very handy to use, shows everything you need.
scorp08 said:
Sean said:
you basically compile and then run from inside vtune and it watches the program. then it shows you were in the code it spent the most time.
Sean I did a check with Vtune and bu_parallel seems ok (all threads are active) and it is faster than serial shoots but Rtfreeheap in windows seems the most time spent So in total execution time is longer than serial shoots. I missed something while free all the memory in parallel. How should I free resources and partion lists etc. after bu_parallel??
By the way, Vtune is very handy to use, shows everything you need.
Any cleanup help ?:))
scorp08 said:
scorp08 said:
Sean said:
you basically compile and then run from inside vtune and it watches the program. then it shows you were in the code it spent the most time.
Sean I did a check with Vtune and bu_parallel seems ok (all threads are active) and it is faster than serial shoots but Rtfreeheap in windows seems the most time spent So in total execution time is longer than serial shoots. I missed something while free all the memory in parallel. How should I free resources and partion lists etc. after bu_parallel??
By the way, Vtune is very handy to use, shows everything you need.Any cleanup help ?:))
I feel my cleanup help in the box :joy: :joy:
@scorp08 Sorry I was on vacation when you first wrote and then I've been focused on our github migration since. If vtune is showing most time in rtfreeheap, that sounds like a problem...
Sean said:
scorp08 Sorry I was on vacation when you first wrote and then I've been focused on our github migration since. If vtune is showing most time in rtfreeheap, that sounds like a problem...
Is it the free of resource and partion_list the problem ( I guess so)
That would be a good guess. It should only be freeing the heap when it's all done. Perhaps you have it re-initializing and releasing/cleaning the rtip every ray or something.
try not releasing any memory, does performance improve dramatically? You don't want that for anything production, but it can help figure out where time is going in parallel.
Sean said:
That would be a good guess. It should only be freeing the heap when it's all done. Perhaps you have it re-initializing and releasing/cleaning the rtip every ray or something.
after long time :) If I call rt_clean_resource , do I need to call also RT_FREE_SEG_LIST ? or rt_clean_resource also handle it ?
@scorp08 My point earlier was that you may be calling some expensive cleanup too frequently... like cleaning up resources. You shouldn't need to clean very often and Rtfreeheap shouldn't be where most time is spent.
Still, to answer your question, rt_clean_resource is only going to reset the re_seg segments (by simply dropping the pointers), and free the rt_seg_blocks (pool of segments). If you're not dealing with those, then you very well will likely need to call RT_FREE_SEG_LIST (e.g., on finished_segs).
here is a small result from my experiment. I run with rt and parallel (12 cpu) ,1 cpu. what I did is , I generate same nhit counts and arrange the nrays according to nhit since I shoot from the same point in rt_shootrays but it is changing in rt . So basically nray is lower in rt_shootrays than rt but at the end , same nhit. results.dat
Last updated: Jan 09 2025 at 00:46 UTC