| 00:19.53 | *** join/#brlcad infobot (~infobot@rikers.org) | |
| 00:19.53 | *** topic/#brlcad is GSoC students: if you have a question, ask and wait for an answer ... responses may take minutes or hours. Ask and WAIT. ;) | |
| 00:27.59 | *** join/#brlcad efjrugungwcmohmu (~armin@dslb-092-075-157-134.092.075.pools.vodafone-ip.de) | |
| 03:41.06 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 04:40.52 | Notify | 03BRL-CAD Wiki:Ravilogaiya * 0 /wiki/User:Ravilogaiya: | 
| 05:31.46 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 05:32.36 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 05:33.21 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 05:34.12 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 05:35.02 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 05:35.46 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 06:04.33 | *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:69b5:1646:8ec0:c796) | |
| 06:34.41 | *** join/#brlcad DaRock (~Thunderbi@mail.unitedinsong.com.au) | |
| 06:58.47 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 07:36.20 | *** join/#brlcad Caterpillar (~caterpill@unaffiliated/caterpillar) | |
| 07:43.44 | *** join/#brlcad merzo (~merzo@93.94.41.67) | |
| 07:49.32 | Notify | 03BRL-CAD:Amritpal singh * 10057 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */ | 
| 07:53.11 | Notify | 03BRL-CAD:Amritpal singh * 10058 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */ | 
| 08:12.20 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 08:34.21 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 08:51.53 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 10:38.01 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 11:43.38 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 11:46.04 | *** join/#brlcad d_rossberg (~rossberg@104.225.5.10) | |
| 12:15.57 | *** join/#brlcad DaRock (~Thunderbi@mail.unitedinsong.com.au) | |
| 12:19.55 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 12:58.55 | *** join/#brlcad kintel (~kintel@unaffiliated/kintel) | |
| 13:33.22 | *** join/#brlcad gabbar1947 (uid205515@gateway/web/irccloud.com/x-aihythsembbzujjz) | |
| 13:48.51 | d_rossberg | gabbar1947: i run on a cmake error: inlude/rt/primitives/annot.h isn't there, and annot.h is twice inserted in the CMakeLists.txt | 
| 13:49.40 | gabbar1947 | I'll check, give me a second | 
| 13:51.40 | gabbar1947 | Rectified: I'm building on my system, just a moment | 
| 14:03.08 | gabbar1947 | Uploaded, this should pass | 
| 14:07.08 | d_rossberg | gabbar1947: what is your intention behind your changes to the root CMakeLists.txt | 
| 14:07.14 | d_rossberg | ? | 
| 14:08.31 | *** join/#brlcad yorik (~yorik@2804:431:f720:80d8:290:f5ff:fedc:3bb2) | |
| 14:08.55 | gabbar1947 | Actually I did not make any changes to the file, Its somehow reflected in the patch. | 
| 14:09.01 | gabbar1947 | I'll check once | 
| 14:10.12 | gabbar1947 | I'm unaware of any such change made by me, I have no idea why this reflects in the patch | 
| 14:12.21 | d_rossberg | i'm recommending looking at a patch file with a text editor and to check if it looks reasonable | 
| 14:13.18 | gabbar1947 | I'll revert the change to the CmakeList.txt, Is there anything else that you want me to look into ? | 
| 14:13.36 | d_rossberg | in addition you have trailing spaces in your code | 
| 14:14.13 | d_rossberg | the "good" text editors provide a function to remove them all in one step | 
| 14:14.58 | gabbar1947 | I tried to remove as much as I could, anyways I'll go through the files once again. I'll use a text editor other than vim. | 
| 14:16.41 | d_rossberg | google says in vi it is :%s/\s\+$//e | 
| 14:18.22 | d_rossberg | see https://vi.stackexchange.com/questions/454/whats-the-simplest-way-to-strip-trailing-whitespace-from-all-lines-in-a-file/ | 
| 14:18.22 | gcibot | [ What's the simplest way to strip trailing whitespace from all lines in a file? - Vi and Vim Stack Exchange ] | 
| 14:18.35 | d_rossberg | and its reference to vim.wikia | 
| 14:18.43 | gabbar1947 | works! thank you | 
| 14:19.45 | gabbar1947 | On it! just a moment | 
| 14:22.20 | d_rossberg | i wrote my documents with vi for many years, you can write great literature with it - if you are tough enough for it | 
| 14:29.50 | gabbar1947 | can you give it a try now, and let me know if there are more errors ! | 
| 14:34.33 | d_rossberg | ok: typein.c has trailing spaces and a C++ comment (i.e. //) | 
| 14:35.41 | d_rossberg | maybe you should simply remove the line - and the number of segments in p_annot for the sake of simplicity (see seans mail) | 
| 14:35.52 | gabbar1947 | I'll remove it ! The C++ instinct ! | 
| 14:38.01 | gabbar1947 | Actually I wanted the "l" command to display the annotation container details as well, that was the reason for its inclusion. anyways i'm removing it! | 
| 14:38.13 | d_rossberg | run the :%s/\s\+$//e on all files you've touched to make sure that no trailing space left | 
| 14:38.38 | d_rossberg | isn't typein.c the in command? | 
| 14:42.31 | d_rossberg | sorry, i had an old patch, in the actual one all trailing spaces seem to be gone | 
| 14:42.40 | gabbar1947 | :) | 
| 14:43.41 | gabbar1947 | typein.c is the "in" command! but for the describe() function for annotation displays the container params as well, so just wanted to see the details on the screen, that's it ! | 
| 14:44.22 | d_rossberg | ok, it's on your decision | 
| 14:46.58 | gabbar1947 | I'm submitting the patch, once build completes ! | 
| 14:55.45 | gabbar1947 | UPLOADED | 
| 14:59.34 | d_rossberg | however, i've to go now :( | 
| 14:59.41 | d_rossberg | i'll see ... | 
| 15:11.36 | gabbar1947 | :) | 
| 15:38.26 | *** join/#brlcad merzo (~merzo@93.94.41.67) | |
| 15:49.55 | *** join/#brlcad vasc (~vasc@bl14-42-31.dsl.telepac.pt) | |
| 15:50.01 | vasc | hey | 
| 15:50.04 | vasc | hello mdtwenty[m] | 
| 15:50.23 | mdtwenty[m] | Hi :) | 
| 15:51.02 | vasc | so... you said something about only supporting one partition? | 
| 15:52.03 | mdtwenty[m] | not only one partition.. one region | 
| 15:52.09 | vasc | right. | 
| 15:52.52 | vasc | but that isn't in the weaving part proper right? | 
| 15:52.58 | vasc | it's in the evaluation for rendering? | 
| 15:53.37 | mdtwenty[m] | yes it in the evaluation part | 
| 15:53.51 | vasc | well. you haven't implemented rt_boolfinal yet. | 
| 15:54.35 | vasc | is there anything left to do in the boolean weaving? | 
| 15:55.36 | mdtwenty[m] | i don't think so.. i think the boolean weaving is already fine | 
| 15:56.10 | mdtwenty[m] | i was looking now into the rt_boolfinal | 
| 15:58.02 | mdtwenty[m] | *i also uploaded today the weave patch without using the pointers in the cl_partition structure | 
| 15:58.38 | vasc | yes. i've seen that. i'll have to review it more in depth later but it seems fine on a cursory glance. | 
| 15:59.28 | vasc | there's still the question of linked lists vs arrays, but without more complex test scenes there's no good way to benchmark it. | 
| 16:02.15 | vasc | the most complex test scenes on the standard database files are probably goliath and havoc. but i'm not sure if they use much csg | 
| 16:02.35 | vasc | they probably don't. | 
| 16:03.07 | vasc | have you tried rendering those to see what happens? | 
| 16:03.15 | vasc | and the operators scene. | 
| 16:03.22 | vasc | oh right. rendering issues. | 
| 16:03.34 | vasc | well. | 
| 16:03.38 | mdtwenty[m] | the havoc and the goliath? | 
| 16:03.47 | vasc | yes. havoc.g and goliath.g i think | 
| 16:03.53 | vasc | in share/db or something. | 
| 16:04.05 | mdtwenty[m] | yes i see it | 
| 16:04.08 | mdtwenty[m] | one sec | 
| 16:04.38 | vasc | if it doesn't crash that would be good enough i guess. | 
| 16:04.43 | vasc | for now. | 
| 16:05.27 | vasc | hopefully it's not horrendously slow either. | 
| 16:06.23 | mdtwenty[m] | hm the goliath scene fails the assertion of 32 segments per ray | 
| 16:06.30 | mdtwenty[m] | so i think 32 is not enough | 
| 16:06.30 | vasc | hah. | 
| 16:06.50 | vasc | see what's the max depth. | 
| 16:07.41 | vasc | it's still prolly under 64,. | 
| 16:08.24 | vasc | i think i have some dynamic bitvector code in opencl or ANSI C in here somewhere you could use if it's bigger than that. | 
| 16:11.35 | mdtwenty[m] | is there a function to see what is the max depth or something? | 
| 16:12.22 | vasc | no. just in that for loop where you do the assert, keep track of the max segments per ray size, and then print it out when the loop ends. | 
| 16:13.18 | vasc | there might be something like that in one of the brlcad tools, but i'm not sure if it would work with the opencl backend. | 
| 16:20.00 | vasc | ok found it | 
| 16:20.01 | vasc | == host | 
| 16:20.02 | vasc | cl_uint ND = N/WORD_BITS + 1; | 
| 16:20.02 | vasc | mD = clCreateBuffer(gpuCtx, CL_MEM_READ_WRITE, sizeof(cl_uint) * ND, NULL, NULL); | 
| 16:20.02 | vasc | == device | 
| 16:20.02 | vasc | inline uint bindex(const uint b) { | 
| 16:20.03 | vasc | <PROTECTED> | 
| 16:20.05 | vasc | } | 
| 16:20.07 | vasc | inline uint bmask(const uint b) { | 
| 16:20.09 | vasc | <PROTECTED> | 
| 16:20.11 | vasc | } | 
| 16:20.13 | vasc | inline uint isset(__global uint *bitset, const uint b) { | 
| 16:20.15 | vasc | <PROTECTED> | 
| 16:20.17 | vasc | } | 
| 16:20.19 | vasc | inline uint clr(__global uint *bitset, const uint b) { | 
| 16:20.21 | vasc | <PROTECTED> | 
| 16:20.23 | vasc | } | 
| 16:20.25 | vasc | inline uint set(__global uint *bitset, const uint b) { | 
| 16:20.27 | vasc | <PROTECTED> | 
| 16:20.29 | vasc | } | 
| 16:20.31 | vasc | - | 
| 16:20.33 | vasc | this is my code, so i give you permission to it for any purpose. | 
| 16:20.49 | vasc | where WORD_BITS is 32 since 'D' is an array of cl_uints | 
| 16:21.33 | vasc | and N is the amount of bits you want the bitvector to have. | 
| 16:21.42 | mdtwenty[m] | i got maxdepth of 957571 | 
| 16:21.48 | vasc | WHAT | 
| 16:21.52 | vasc | do the math properly dude. | 
| 16:21.58 | vasc | :-) | 
| 16:22.02 | vasc | that can't be truel. | 
| 16:22.13 | vasc | max per segment, not the sum of everything. | 
| 16:22.55 | Stragus | Still not a fan of allocating chunks out of big buffers through atomics, eh :) | 
| 16:23.27 | vasc | we might do that eventually. but for now there's a lot of gfx card memory we don't use. | 
| 16:24.01 | Stragus | All right. It goes up quickly when buffering all hits for millions of rays | 
| 16:24.52 | vasc | well sure. we could find the warp size and only allocate a buffer of that size or something. | 
| 16:24.57 | vasc | it's too much work :-) | 
| 16:25.27 | vasc | those microoptimizations can be done later. | 
| 16:25.59 | vasc | i kinda doubt we need to do it this way anyway. | 
| 16:26.12 | vasc | i suspect we could do the csg processing in an iterative fashion with a modified algorithm. | 
| 16:26.48 | Stragus | So you want to allocate the "max depth" for every ray... and how do you determine that max depth? | 
| 16:27.22 | vasc | we count the amount of segments per before allocating and actually storing the segments. | 
| 16:27.33 | vasc | we count the amount of segments per ray before allocating and actually storing the segments. | 
| 16:27.37 | Stragus | Ideally, you would process the segments as they come rather than buffering the whole thing. That complicates the code though | 
| 16:27.51 | Stragus | Ah yes, the two passes thing, count then trace | 
| 16:28.04 | vasc | yeah i suspect that could be done across the whole pipeline, but it requires rethinking the algorithm. | 
| 16:28.14 | vasc | it's probably non-trivial. but yeah it's worthwhile in the long run. | 
| 16:28.48 | vasc | i'm just kinda reticent about doing it first hand without understanding how the current algorithm works properly. | 
| 16:29.14 | vasc | not just mechanically but in terms of performance as well. | 
| 16:29.40 | Stragus | My raytracer used an inlined callback, the user could do whatever it wants with the hits. They can be processed on the fly (recommended) or buffered by a custom solution in the callback. And importantly, the inlined callback can terminate rays early | 
| 16:29.48 | vasc | especially considering the guys who originally wrote the code didn't do it, and they worked on it for decades. | 
| 16:30.10 | vasc | well, we kind of have something like that, | 
| 16:30.36 | vasc | there's no storing of segments in the single pass version of the renderer. | 
| 16:30.45 | Stragus | Cool. Is the callback truly inlined? You want to avoid any function call on GPU, especially function pointers | 
| 16:30.45 | vasc | but that doesn't do CSG. | 
| 16:31.09 | vasc | and it only returns the first hit, or an accumulation of the result of all the hits. | 
| 16:31.28 | vasc | you can't have function pointers on opencl. | 
| 16:31.40 | vasc | but yeah its some function. | 
| 16:31.45 | Stragus | Eheh. You can with CUDA, but it's a Very Bad Idea anyway | 
| 16:32.38 | mdtwenty[m] | i was doing it wrongly :D | 
| 16:32.58 | vasc | that version of the renderer is way faster than the current ANSI C one. | 
| 16:33.04 | mdtwenty[m] | i got 493 maxdepth for the goliath and 105 for the havoc | 
| 16:33.12 | vasc | but it doesn't do CSG so it isn't a proper comparison | 
| 16:33.16 | vasc | really? | 
| 16:33.21 | vasc | it's still way more than i expected. | 
| 16:33.32 | Stragus | vasc: I'm sure. My raytracer of triangles reached a billion rays per second... while a CPU core does 20M per second at most | 
| 16:33.44 | Stragus | (SSE optimized CPU code) | 
| 16:33.56 | vasc | mdtwenty[m], what's the amount of primitives in each scene? | 
| 16:34.33 | vasc | just for curiosity's sake. | 
| 16:35.06 | vasc | i think there's a 'list' command in mged or something | 
| 16:35.25 | mdtwenty[m] | goliath has 10499 primitives and havoc 2429 | 
| 16:35.46 | vasc | pfew, it's still smaller at least. | 
| 16:35.58 | vasc | but... | 
| 16:36.09 | Stragus | There could be some ray hitting a bunch of aligned screws? :) | 
| 16:36.10 | mdtwenty[m] | well the good news is that the boolean weaving doenst crash | 
| 16:36.49 | vasc | well it might use an outrageous amount of memory. | 
| 16:37.10 | vasc | so perhaps Stragus will get his thing. | 
| 16:37.26 | vasc | :-) | 
| 16:37.48 | vasc | the havoc with 105 is ok i guess. | 
| 16:37.51 | vasc | but 493 | 
| 16:38.10 | vasc | that's 16 double words | 
| 16:38.25 | vasc | i.e. cl_uint [16] | 
| 16:38.41 | mdtwenty[m] | yeah i got 493 while rendering with the front view | 
| 16:39.09 | vasc | can you compute the amount of memory that would take with that size of bitvector? | 
| 16:39.17 | vasc | the whole segments array | 
| 16:39.39 | Stragus | Why use the max depth for a ray though? Why not compute the sum of all rays, through a reduction kernel, if you are going to perform an identical trace right away? | 
| 16:40.08 | vasc | we have a bitvector we use inside each ray's segment list | 
| 16:40.15 | Stragus | (I still prefer dynamically allocated memory, but your way would work fine, except for the tracing-twice thing) | 
| 16:41.00 | vasc | well | 
| 16:41.02 | vasc | it's like | 
| 16:41.06 | vasc | we have a list of segments | 
| 16:41.11 | vasc | which gets computed into a list of partitions | 
| 16:41.44 | vasc | and then that gets evaluated | 
| 16:42.05 | Stragus | So each ray computes how much storage it requires, it stores that number, and you reduce all these numbers to a grand total? | 
| 16:42.13 | vasc | each segment only belongs in one object right? but the partitions can belong to more than one. | 
| 16:42.30 | vasc | like the ray pierce one and exits the other. but it's the same partition solid space. | 
| 16:42.52 | vasc | yeah its kinda like that. | 
| 16:43.22 | vasc | but that's only used to compute the amount of space we'll need. | 
| 16:43.30 | vasc | the actual algorithm isn't just a reduction. | 
| 16:44.17 | Stragus | Right... but what I'm saying is that it doesn't matter if the "max depth" is 40000 due to a bunch of aligned screws somewhere | 
| 16:44.25 | Stragus | You want the total for all rays | 
| 16:44.38 | vasc | mdtwenty[m], another thing we could do is dynamically allocate the bitvector, so rays with more segments would get larger bitvectors, but i wonder if that would complicate the code too much. | 
| 16:45.22 | vasc | hmm | 
| 16:47.59 | vasc | so it's max_partitions*sizeof(cl_partition_without the bitvector)+max_partitions*sizeof(cl_uint)*(493/32) | 
| 16:48.09 | vasc | how much is that in bytes? | 
| 16:48.17 | vasc | mdtwenty[m] | 
| 16:48.44 | vasc | and this max_partitions is the total amount of partitions for all the rays. | 
| 16:48.53 | vasc | sum | 
| 16:49.48 | mdtwenty[m] | one sec | 
| 17:03.16 | vasc | hmm | 
| 17:03.31 | vasc | perhaps this is not as a big of a deal as i thought | 
| 17:04.22 | vasc | +struct cl_partition { | 
| 17:04.22 | vasc | + struct cl_seg inseg; | 
| 17:04.22 | vasc | + struct cl_hit inhit; | 
| 17:04.22 | vasc | + struct cl_seg outseg; | 
| 17:04.22 | vasc | + struct cl_hit outhit; | 
| 17:04.23 | vasc | + cl_uint segs; /* 32-bit vector to represent the segments in the partition */ | 
| 17:04.25 | vasc | +}; | 
| 17:04.30 | vasc | but | 
| 17:04.35 | vasc | struct cl_hit { | 
| 17:04.35 | vasc | <PROTECTED> | 
| 17:04.35 | vasc | <PROTECTED> | 
| 17:04.35 | vasc | <PROTECTED> | 
| 17:04.37 | vasc | <PROTECTED> | 
| 17:04.39 | vasc | <PROTECTED> | 
| 17:04.41 | vasc | }; | 
| 17:04.43 | vasc | and | 
| 17:04.46 | vasc | struct cl_seg { | 
| 17:04.47 | vasc | <PROTECTED> | 
| 17:04.49 | vasc | <PROTECTED> | 
| 17:04.51 | vasc | <PROTECTED> | 
| 17:04.53 | vasc | }; | 
| 17:04.55 | vasc | so | 
| 17:04.57 | vasc | who cares. | 
| 17:05.03 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 17:05.35 | vasc | it's like just a cl_hit has 8+8+8+2+1 words | 
| 17:05.39 | vasc | double words | 
| 17:05.57 | vasc | i.e. 27 | 
| 17:06.25 | Stragus | That cl_hit struct is kind of heavy, like 84 bytes | 
| 17:06.29 | vasc | and each partition has like 6 of those | 
| 17:07.10 | vasc | so using even 16 double words for the bitvector seems pathetic in comparison. | 
| 17:07.50 | vasc | still i'm kinda interested to know how much memory the whole thing uses right now. | 
| 17:08.29 | vasc | Stragus, it's much, much worse than that. | 
| 17:08.43 | vasc | coz cl_double3's are ACTUALLY cl_double4s. | 
| 17:09.15 | vasc | it's an opencl thing. | 
| 17:09.40 | vasc | and then there's struct alignment | 
| 17:09.49 | vasc | which reminds me | 
| 17:09.58 | vasc | mdtwenty[m], instead of this: | 
| 17:10.22 | vasc | +struct cl_partition { | 
| 17:10.22 | vasc | + struct cl_seg inseg; | 
| 17:10.22 | vasc | + struct cl_hit inhit; | 
| 17:10.22 | vasc | + struct cl_seg outseg; | 
| 17:10.22 | vasc | + struct cl_hit outhit; | 
| 17:10.23 | vasc | + cl_uint segs; /* 32-bit vector to represent the segments in the partition */ | 
| 17:10.24 | vasc | +}; | 
| 17:10.27 | vasc | try this: | 
| 17:10.38 | vasc | +struct cl_partition { | 
| 17:10.38 | vasc | + struct cl_seg inseg; | 
| 17:10.38 | vasc | + struct cl_seg outseg; | 
| 17:10.38 | vasc | + struct cl_hit inhit; | 
| 17:10.38 | vasc | + struct cl_hit outhit; | 
| 17:10.39 | vasc | + cl_uint segs; /* 32-bit vector to represent the segments in the partition */ | 
| 17:10.41 | vasc | +}; | 
| 17:10.46 | vasc | and see if it's sizeof() is smaller. | 
| 17:11.01 | Stragus | That shouldn't make a difference, both cl_hit and cl_seg have the same alignment | 
| 17:11.08 | vasc | i hope so. | 
| 17:11.55 | Stragus | If these cl_double3 waste memory, then perhaps it should be packed differently | 
| 17:11.58 | vasc | its the buildin cl_types that are an issue usually. | 
| 17:12.02 | vasc | builtin | 
| 17:12.45 | Stragus | Although frankly, this whole data storage scheme is very unfriendly to GPUs and SIMD | 
| 17:12.55 | vasc | hm | 
| 17:13.05 | vasc | only because it isn't z-orderered. | 
| 17:13.09 | vasc | oh i see. | 
| 17:13.34 | vasc | well | 
| 17:13.39 | Stragus | No! Because you will have scattered loads/stores all over | 
| 17:13.48 | Stragus | All memory transactions will be 8 times slower than necessary | 
| 17:14.10 | vasc | the thing is you probably don't need the whole thing across all stages of the algorithm | 
| 17:14.20 | vasc | so you could fraction this | 
| 17:14.33 | vasc | and increase memory locality. | 
| 17:14.39 | Stragus | For best performance, all threads of a warp/wavefront need to access consecutive memory addresses | 
| 17:14.44 | *** join/#brlcad merzo (~merzo@194.140.108.146) | |
| 17:15.01 | Stragus | So you need some struct where struct foo { float x[32]; float y[32]; etc. }; | 
| 17:15.02 | vasc | its like i said, you don't need to whole thing. | 
| 17:15.07 | vasc | even before that. | 
| 17:15.45 | vasc | yeah i know but if we minimize the size of the elements it ain't a big deal. | 
| 17:15.59 | vasc | the problem is the structs are too fat right now. | 
| 17:16.20 | vasc | still | 
| 17:16.35 | vasc | in comparison to the ANSI C code, it's incredibly memory coherent ya know? | 
| 17:16.40 | Stragus | Not a big deal? The stride between elements doesn't matter unless it's in the same cache lines | 
| 17:17.00 | Stragus | Well, these memory operations will be 8 times slower than if they were reorganized differently | 
| 17:17.14 | Stragus | That may or may not be a bottleneck, you'll decide that | 
| 17:17.31 | vasc | even with a cache? | 
| 17:17.34 | vasc | i'm not sure about that. | 
| 17:17.53 | vasc | i think the main issue is to have poor memory locality in accesses. | 
| 17:18.00 | vasc | rather than the access patterns themselves. | 
| 17:18.41 | Stragus | It's not about the cache, it's about memory transactions | 
| 17:18.52 | vasc | memory bank conflicts? | 
| 17:19.00 | Stragus | I am very sure about that, been doing CUDA for 8 years, and probably the biggest helper in #cuda... | 
| 17:19.16 | Stragus | Bank conflicts are for shared memory | 
| 17:19.58 | vasc | well the thing is | 
| 17:20.09 | vasc | if you're gonna need to access the rest of the struct in the same kernel | 
| 17:20.29 | vasc | it's all going to have to be loaded anyway. | 
| 17:20.30 | Stragus | Presumably, all threads of the same warp/wavefront will also access the rest of their structs, no? | 
| 17:21.15 | vasc | that's the thing i said, i think we don't need to store everything in that struct in all the stages of the algorithm. it's just that currently we're slavishly following the way the existing ANSI C code is structured. | 
| 17:21.37 | vasc | like | 
| 17:21.54 | Stragus | Okay! But it should still be designed so that consecutive threads access consecutive values in memory | 
| 17:22.02 | Stragus | You don't want a stride between threads | 
| 17:23.08 | vasc | i'll give you an example. i thought about doing that in the intersections code. | 
| 17:23.29 | Stragus | Consecutive addresses is _the_ solution that is fast on all GPUs from all vendors, for all generations. Beyond that, there are particularities if the accesses are shuffled, out of order, with gaps between chunks of 128 bytes, etc. | 
| 17:23.33 | vasc | well it turns out each kernel still has so many branches. a lot of threads will be idling and it's awfully low performance. | 
| 17:23.46 | vasc | the GPU isn't maxed out. | 
| 17:23.57 | Stragus | All right. But the threads that are active would still access a bunch of packed addresses | 
| 17:24.09 | vasc | no it's actually terrible. | 
| 17:24.33 | vasc | imagine one thread is doing a quadric like a sphere, and the other is doing a thorus intersection. | 
| 17:24.55 | Stragus | Indeed, paths should be merged as much as possible. Coherent rays can help a lot with that | 
| 17:25.14 | vasc | well i thought about that. that actually kind of happens as it is. | 
| 17:25.28 | vasc | since i'm using a thread block. | 
| 17:25.30 | vasc | but | 
| 17:25.36 | Stragus | If there are some common operations between spheres and thorus, like storing data (memory transactions), they should be merged together | 
| 17:25.46 | vasc | i think the best thing would be to reorder the intersection calculations. | 
| 17:25.47 | Stragus | As little code as possible should be specific to branches | 
| 17:25.57 | Stragus | That's possible, yes | 
| 17:26.21 | vasc | like group the ones that use the same kernel solver together. | 
| 17:27.18 | vasc | but the whole existing ANSI C code is more built to minimize the amount of operations than either memory consumption, or maximize memory coherency | 
| 17:27.18 | Stragus | It's possible to have warp-wide votes to decide how many threads need to perform operation X, before deciding to do it with a bunch of threads | 
| 17:27.38 | vasc | or minimize branches | 
| 17:27.38 | Stragus | But these aren't as critical issues as properly organizing memory. Reshuffling memory implies rewriting a lot of code, so it must be done early | 
| 17:27.59 | Stragus | Right, it's very different to optimize for scalar execution and for wide parallelism | 
| 17:28.09 | vasc | its not just that | 
| 17:28.16 | vasc | it's optimized for 1980s machines | 
| 17:28.20 | Stragus | Oh I see, yes | 
| 17:28.32 | Stragus | Memory was fast, ALUs were slow. And now it's the other way around | 
| 17:28.35 | vasc | yep+ | 
| 17:31.01 | vasc | so mdtwenty[m] any luck with that? | 
| 17:33.19 | mdtwenty[m] | sent a long message: mdtwenty[m]_2017-06-23_17:33:18.txt <https://matrix.org/_matrix/media/v1/download/matrix.org/hVZtHbkIpDKOvbklzQAQwlpz> | 
| 17:33.39 | vasc | ok | 
| 17:33.47 | vasc | what about the memory size of the whole thing? | 
| 17:35.35 | vasc | <vasc> hmm | 
| 17:35.36 | vasc | <vasc> so it's max_partitions*sizeof(cl_partition_without the bitvector)+max_partitions*sizeof(cl_uint)*(493/32) | 
| 17:35.36 | vasc | <vasc> how much is that in bytes? | 
| 17:35.36 | vasc | <vasc> mdtwenty[m] | 
| 17:35.36 | vasc | <vasc> and this max_partitions is the total amount of partitions for all the rays. | 
| 17:35.38 | vasc | <vasc> sum | 
| 17:35.41 | mdtwenty[m] | i got 2337015024 | 
| 17:35.55 | mdtwenty[m] | for the goliath that has 493 depth | 
| 17:36.35 | vasc | 2 GB?! | 
| 17:37.04 | vasc | ok, how much is max_partitions*sizeof(cl_partition_without the bitvector) alone? | 
| 17:37.39 | mdtwenty[m] | compiling | 
| 17:39.06 | mdtwenty[m] | 2 179 756 800 | 
| 17:39.49 | vasc | also i wanna see the code for bool_Eval | 
| 17:39.59 | vasc | so the bitvectors aren't the real problem | 
| 17:40.16 | vasc | since they use a "mere" 200 MB or less. | 
| 17:40.31 | vasc | ok i think i got the idea | 
| 17:40.36 | vasc | +struct cl_partition { | 
| 17:40.36 | vasc | + struct cl_seg inseg; | 
| 17:40.36 | vasc | + struct cl_hit inhit; | 
| 17:40.36 | vasc | + struct cl_seg outseg; | 
| 17:40.37 | vasc | + struct cl_hit outhit; | 
| 17:40.37 | vasc | + cl_uint segs; /* 32-bit vector to represent the segments in the partition */ | 
| 17:40.39 | vasc | +}; | 
| 17:40.41 | vasc | + | 
| 17:40.53 | vasc | instead of storing copies of the cl_segs, why not use indexes instead? | 
| 17:41.51 | mdtwenty[m] | yes i think that it would work | 
| 17:43.40 | vasc | ((8*4*3+8+4)*2+4)*2 vs 4*2 | 
| 17:43.45 | vasc | 440 vs 8 | 
| 17:43.56 | vasc | that should shrink things down | 
| 17:47.10 | vasc | so you have the code for bool_eval so i can look at it? i kind of want to understand which data in the partitions will get accessed in rt_boolfinal and rendering. | 
| 17:49.30 | mdtwenty[m] | posted a file: ocl_bool_eval.patch (51KB) <https://matrix.org/_matrix/media/v1/download/matrix.org/SFtMrGVveJAkLiBFABODKZMf> | 
| 17:49.40 | mdtwenty[m] | this should be it | 
| 17:51.27 | vasc | ok | 
| 17:51.40 | mdtwenty[m] | i think that only the segments in the partition are relevant for boolean evaluation and shading | 
| 17:51.48 | vasc | rt_boolfinal seems to hammer a partition's inhit/outhit .hit_hist's over and over. | 
| 17:52.01 | vasc | .hit_dist | 
| 17:52.11 | vasc | and then it computes segment regions | 
| 17:54.50 | vasc | in a later optimization we may want to simplify the partition structures. | 
| 17:55.03 | vasc | struct cl_hit { | 
| 17:55.03 | vasc | <PROTECTED> | 
| 17:55.03 | vasc | <PROTECTED> | 
| 17:55.03 | vasc | <PROTECTED> | 
| 17:55.03 | vasc | <PROTECTED> | 
| 17:55.04 | vasc | <PROTECTED> | 
| 17:55.06 | vasc | }; | 
| 17:55.15 | vasc | of all of this, it looks as if rt_boolfinal only accesses the hit_dist. | 
| 17:55.31 | vasc | i think the rest is only accessed in the final rendering stages. | 
| 17:56.14 | vasc | but for now, just not storing the whole cl_segs in the cl_partitions should be enough. | 
| 17:57.21 | Stragus | Eh, still keep in mind that repacking of hits in structs of arrays of 32 or 64 | 
| 17:58.09 | vasc | well the current code is horrible in several ways. | 
| 17:58.25 | vasc | just the amount of memory copies going on is kind of insane. | 
| 17:58.34 | vasc | i'm surprised it's as fast as it is. | 
| 17:58.44 | Stragus | How "fast" is fast? :) | 
| 17:59.11 | Stragus | The reference CPU code is also abysmally slow, so it's not a great point of comparison | 
| 17:59.44 | vasc | mdtwenty[m], after you use indexes instead of storing the whole cl_segs, test goliath and havok again. tell me the time it takes and if it crashes or not. | 
| 18:00.00 | vasc | Stragus, well it's what we have. | 
| 18:00.26 | vasc | mdtwenty[m], once you get that working, it's time to work on rt_boolfinal | 
| 18:00.54 | vasc | mdtwenty[m], oh and tell me how much memory it uses to store the partitions now vs what it used before. | 
| 18:01.05 | vasc | with the indexes. | 
| 18:01.10 | mdtwenty[m] | and what about the bitvector for now? | 
| 18:01.25 | vasc | yes, use the dynamic bitvector too | 
| 18:01.35 | vasc | i paste the code above? did you get it? | 
| 18:01.37 | vasc | pasted | 
| 18:01.59 | vasc | <vasc> ok found it | 
| 18:01.59 | vasc | <vasc> == host | 
| 18:01.59 | vasc | <vasc> cl_uint ND = N/WORD_BITS + 1; | 
| 18:01.59 | vasc | <vasc> mD = clCreateBuffer(gpuCtx, CL_MEM_READ_WRITE, sizeof(cl_uint) * ND, NULL, NULL); | 
| 18:01.59 | vasc | <vasc> == device | 
| 18:02.00 | vasc | <vasc> inline uint bindex(const uint b) { | 
| 18:02.02 | vasc | <vasc> return (b >> 5); | 
| 18:02.04 | vasc | <vasc> } | 
| 18:02.06 | vasc | <vasc> inline uint bmask(const uint b) { | 
| 18:02.08 | vasc | <vasc> return (1 << (b & 31)); | 
| 18:02.10 | vasc | <vasc> } | 
| 18:02.12 | vasc | <vasc> inline uint isset(__global uint *bitset, const uint b) { | 
| 18:02.14 | vasc | <vasc> return (bitset[bindex(b)] & bmask(b)); | 
| 18:02.16 | vasc | <vasc> } | 
| 18:02.18 | vasc | <vasc> inline uint clr(__global uint *bitset, const uint b) { | 
| 18:02.20 | vasc | <vasc> return (bitset[bindex(b)] &= ~bmask(b)); | 
| 18:02.22 | vasc | <vasc> } | 
| 18:02.24 | vasc | <vasc> inline uint set(__global uint *bitset, const uint b) { | 
| 18:02.26 | vasc | <vasc> return (bitset[bindex(b)] |= bmask(b)); | 
| 18:02.28 | vasc | <vasc> } | 
| 18:02.30 | vasc | <vasc> - | 
| 18:02.32 | vasc | <vasc> this is my code, so i give you permission to it for any purpose. | 
| 18:02.34 | vasc | <vasc> where WORD_BITS is 32 since 'D' is an array of cl_uints | 
| 18:02.35 | vasc | <vasc> and N is the amount of bits you want the bitvector to have. | 
| 18:03.10 | vasc | basically you have a cl_uint array per bitvector | 
| 18:03.32 | vasc | and you an use the isset, clr, or set functions to twiddle the bits. | 
| 18:03.39 | vasc | or query them. | 
| 18:04.48 | vasc | for a start you can just use a cl_uint segs[16]; or whatever | 
| 18:04.59 | vasc | but eventually you want to dynamically determine the size of this | 
| 18:05.39 | vasc | the quick and dirty way to do it, is basically to pass it as a #define before the kernels are compiled. | 
| 18:06.07 | vasc | but don't do that. | 
| 18:06.28 | vasc | we'll probably need to optimize this some other way. but without more tests, it's hard to determine the appropriate solution,. | 
| 18:07.05 | Stragus | That bitvector is to determine entry/exit status through solids? | 
| 18:07.15 | vasc | no | 
| 18:07.29 | vasc | it states which segments are within a partition | 
| 18:07.43 | vasc | it's per ray | 
| 18:08.01 | vasc | we could do this some other way though | 
| 18:08.01 | Stragus | Okay. I guess I'm not familiar enough with the terminology used by the BRL-CAD raytracer | 
| 18:08.27 | vasc | if the bitvector is too sparse, we would probably be better off with using a list, like the current code already does. | 
| 18:09.11 | Stragus | Without knowing what the bitvector was, that was my thought | 
| 18:10.31 | vasc | well. | 
| 18:10.47 | vasc | i thought there would be less depth complexity in the average scene than ther actually is. | 
| 18:10.50 | vasc | my mistake. | 
| 18:11.35 | vasc | a typical game scene has like 3 depth complexity. | 
| 18:12.05 | vasc | in here we don't cull stuff. | 
| 18:12.28 | vasc | i thought 32 was enough. so much for that. | 
| 18:12.38 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 18:13.28 | vasc | i know there are hard limits in the amount of intersections per ray on triangle meshes in the current code for example. | 
| 18:15.59 | vasc | https://svn.code.sf.net/p/brlcad/code/brlcad/trunk/src/librt/primitives/bot/tie.c | 
| 18:16.06 | vasc | <PROTECTED> | 
| 18:17.32 | vasc | mdtwenty[m], if the bitvector is too slow on the goliath scene, we'll have to use lists again... | 
| 18:18.29 | mdtwenty[m] | ok, i will change the bool weave to use the indexes for segments intead of the segs and will test | 
| 18:18.35 | vasc | ok | 
| 18:22.20 | vasc | hm | 
| 18:23.00 | vasc | so this is what the ANSI C code does for bool_eval of solids. | 
| 18:23.02 | vasc | case OP_SOLID: | 
| 18:23.02 | vasc | <PROTECTED> | 
| 18:23.02 | vasc | register struct soltab *seek_stp = treep->tr_a.tu_stp; | 
| 18:23.02 | vasc | register struct seg **segpp; | 
| 18:23.02 | vasc | for (BU_PTBL_FOR(segpp, (struct seg **), &partp->pt_seglist)) { | 
| 18:23.03 | vasc | <PROTECTED> | 
| 18:23.05 | vasc | ret = 1; | 
| 18:23.07 | vasc | goto pop; | 
| 18:23.09 | vasc | <PROTECTED> | 
| 18:23.11 | vasc | } | 
| 18:23.13 | vasc | ret = 0; | 
| 18:23.15 | vasc | <PROTECTED> | 
| 18:23.17 | vasc | <PROTECTED> | 
| 18:27.39 | vasc | so | 
| 18:28.00 | vasc | you need to know if a partition has a solid in it | 
| 18:31.31 | mdtwenty[m] | a solid? | 
| 18:33.43 | vasc | a solid is basically a primitive object. | 
| 18:33.54 | vasc | like a sphere. | 
| 18:34.03 | vasc | in brlcad parlance. | 
| 18:35.08 | vasc | a solid object. | 
| 18:39.01 | vasc | i'll go jog for a while. should be back in 30-45 mins | 
| 18:39.56 | mdtwenty[m] | Ok :) i will also take a break to get dinner | 
| 19:04.28 | *** join/#brlcad DaRock (~Thunderbi@mail.unitedinsong.com.au) | |
| 20:17.32 | vasc | . | 
| 20:38.16 | mdtwenty[m] | so i changed the struct partition and it is working fine.. i am working on the dynamic bit vector right now | 
| 20:39.35 | vasc | good. make sure to make a backup. :-) | 
| 20:40.12 | vasc | you'll have to initialize the data though | 
| 20:40.30 | vasc | prolly the easiest way is to bzero the memory before using it. | 
| 21:03.27 | mdtwenty[m] | ok i will have to leave the house for a bit and probably will only get back to this tomorrow morning | 
| 21:05.17 | mdtwenty[m] | i will notify you once its done | 
| 21:07.27 | vasc | ok | 
| 21:08.42 | vasc | see you later then! |