| 00:17.03 | Notify | 03BRL-CAD:mdtwenty * 70033 (brlcad/branches/opencl/include/rt/shoot.h brlcad/branches/opencl/src/librt/primitives/bool.cl and 3 others): skipping unevaluated partitions |
| 00:18.06 | *** join/#brlcad https_GK1wmSU (~deep-book@169.55.27.131) | |
| 00:19.05 | *** part/#brlcad https_GK1wmSU (~deep-book@169.55.27.131) | |
| 00:38.53 | *** join/#brlcad sqrosuofcvzwzywq (~armin@dslb-094-216-164-255.094.216.pools.vodafone-ip.de) | |
| 00:41.13 | Notify | 03BRL-CAD:Marco-domingues * 10149 /wiki/User:Marco-domingues/GSoC17/Log: /* 31 July */ |
| 02:13.37 | *** join/#brlcad yukonbob (~bch@S0106b81619e90b96.ok.shawcable.net) | |
| 02:13.40 | yukonbob | hello #brlcad |
| 02:43.36 | *** join/#brlcad https_GK1wmSU (~deep-book@77.234.42.183) | |
| 02:44.37 | *** part/#brlcad https_GK1wmSU (~deep-book@77.234.42.183) | |
| 03:34.33 | *** join/#brlcad gabbar1947 (uid205515@gateway/web/irccloud.com/x-uthcarhicquzqbhn) | |
| 05:55.07 | *** join/#brlcad merzo (~merzo@20-21-132-95.pool.ukrtel.net) | |
| 06:45.42 | *** join/#brlcad Caterpillar (~caterpill@unaffiliated/caterpillar) | |
| 07:14.18 | *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:8dd6:e12e:6013:7834) | |
| 07:15.00 | *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:8dd6:e12e:6013:7834) | |
| 07:18.05 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 07:27.36 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 07:53.52 | *** join/#brlcad Stragus_ (~alexis@modemcable131.0-202-24.mc.videotron.ca) | |
| 08:14.30 | *** join/#brlcad merzo (~merzo@user-94-45-58-139.skif.com.ua) | |
| 08:25.19 | *** join/#brlcad Caterpillar (~caterpill@unaffiliated/caterpillar) | |
| 08:34.02 | *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) | |
| 08:39.31 | *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) | |
| 08:42.54 | *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) | |
| 08:51.44 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 09:11.44 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 09:15.52 | *** join/#brlcad KimK_ (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) | |
| 09:17.31 | *** join/#brlcad KimK_ (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) | |
| 09:21.01 | *** join/#brlcad KimK_ (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) | |
| 09:32.36 | *** join/#brlcad gabbar1947 (uid205515@gateway/web/irccloud.com/x-ltbxxbrtphejfhrl) | |
| 09:33.46 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 10:30.52 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 11:55.55 | Notify | 03BRL-CAD:Amritpal singh * 10150 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */ |
| 12:02.13 | Notify | 03BRL-CAD:Amritpal singh * 0 /wiki/File:Bentrebar.png: |
| 12:07.20 | Notify | 03BRL-CAD:Amritpal singh * 0 /wiki/File:PopimageUshapeRebar.png: |
| 12:08.05 | Notify | 03BRL-CAD:Amritpal singh * 10153 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */ |
| 12:09.25 | Notify | 03BRL-CAD:Amritpal singh * 10154 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */ |
| 12:20.59 | *** join/#brlcad teepee (~teepee@unaffiliated/teepee) | |
| 13:22.06 | *** join/#brlcad gabbar1947 (uid205515@gateway/web/irccloud.com/x-uzhthiaatgjhxdcu) | |
| 13:35.38 | Notify | 03BRL-CAD:mdtwenty * 70034 brlcad/branches/opencl/src/librt/primitives/rt.cl: Fix bug in partitions shading. Shading closest partition |
| 13:39.29 | *** join/#brlcad yorik (~yorik@2804:431:f720:4929:290:f5ff:fedc:3bb2) | |
| 13:39.45 | *** join/#brlcad vasc (~vasc@bl4-147-121.dsl.telepac.pt) | |
| 13:41.09 | Notify | 03BRL-CAD:Marco-domingues * 0 /wiki/File:Shade_closest_partition.png: |
| 13:48.37 | Notify | 03BRL-CAD:Marco-domingues * 10156 /wiki/User:Marco-domingues/GSoC17/Log: /* 1 August */ |
| 14:16.26 | *** join/#brlcad mdtwenty (bc519789@gateway/web/freenode/ip.188.81.151.137) | |
| 14:21.57 | vasc | hey mdtwenty |
| 14:22.07 | mdtwenty | hello! |
| 14:22.38 | *** join/#brlcad mdtwenty[m] (mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh) | |
| 14:23.09 | vasc | once you put the changes to skip unevaluated partitions and the bug fix for the closest partition on svn then tell me so i can look at the code |
| 14:23.15 | vasc | about bool_eval()... |
| 14:23.35 | vasc | did you see patch i posted on the patches tracker? |
| 14:23.51 | vasc | <PROTECTED> |
| 14:24.22 | mdtwenty[m] | I have commited the changes already in the opencl branch |
| 14:24.58 | vasc | https://sourceforge.net/p/brlcad/patches/473/ |
| 14:24.58 | i5o | [ BRL-CAD / Patches / #473 Eliminate gotos from bool_eval() ] |
| 14:25.34 | vasc | apply this patch to brlcad trunk/ and try running your tests on that to see what's the performance difference vs the old code. |
| 14:25.59 | vasc | if that's fast enough, then we can work on a simplified prototype in ANSI C to port to OpenCL later. |
| 14:26.22 | vasc | with a linearized binary tree in an array, instead of storing the expression tree in rpn format. |
| 14:26.32 | vasc | so the tree would be in infix notation. |
| 14:27.08 | vasc | it shouldn't even use more memory if we do it right. |
| 14:27.43 | mdtwenty[m] | hm okay I will apply the patch and test it |
| 14:28.03 | vasc | ok. i'll try to advance the pseudocode for the solution to speed this up. |
| 14:28.58 | mdtwenty[m] | yeah I think most of the bugs with the csg code are fixed now |
| 14:31.19 | mdtwenty[m] | the images match the results from the ANSI C code, although some scenes have missing geometry from the primitives that are not supported yet (goliath.g is one example of this, where some pipes are missing) |
| 14:31.33 | vasc | i'm gonna need to know the first material in a solid so i can shade the primitives properly in the one-hit shader though. |
| 14:31.45 | vasc | yeah pipe is used quite often. |
| 14:32.15 | vasc | but the code is kinda more complicated than that of regular primitive. that's why i didn't ask you to work on it. |
| 14:32.28 | vasc | pipe is a composite primitive. |
| 14:32.44 | vasc | its a list of segments of things. |
| 14:33.37 | vasc | it would distract from the purpose of getting the csg working. |
| 14:33.51 | vasc | let's see... |
| 14:34.16 | vasc | i need some data structure. |
| 14:34.17 | mdtwenty[m] | hm yes I remember you saying that porting that primitive would take some work |
| 14:35.21 | vasc | in case you're curious you can look at src/librt/primitives/pipe/pipe_shot.c or something like that. |
| 14:36.04 | vasc | https://svn.code.sf.net/p/brlcad/code/brlcad/trunk/src/librt/primitives/pipe/pipe.c |
| 14:36.53 | vasc | if you do a 'wc -l' to get a linecount on the pipe.c file vs, say, ehy.c you would see that's it's a lot more complex. |
| 14:37.10 | vasc | anyway... |
| 14:38.07 | vasc | implementing bool_eval() is easier than that. |
| 14:38.15 | vasc | seriously. |
| 14:40.36 | mdtwenty[m] | hm yes lets first optimize the code and then I can look into porting some more primitives into OpenCL :) |
| 14:42.58 | vasc | nah. |
| 14:43.05 | vasc | we want to optimize this to the max. |
| 14:43.12 | vasc | screw the primitives. |
| 14:43.22 | vasc | we only have one month. |
| 14:44.01 | mdtwenty[m] | ah yes, I meant eventually after the gsoc period |
| 14:44.16 | vasc | maybe next year. :-) |
| 14:45.01 | vasc | and mind you, implementing pipe would be a breeze compared to the other main ticket item left to do... the brep intersector. |
| 14:45.44 | vasc | gah. |
| 14:45.49 | vasc | that's another gsoc or two. |
| 14:46.36 | vasc | if we get all that working, i think this will have a more complete GPU intersector than ANYTHING ELSE in the market right now. |
| 14:47.01 | vasc | if it happens. |
| 14:47.16 | vasc | anyway... |
| 14:51.27 | vasc | we'll use something like this bin tree representation |
| 14:51.32 | vasc | https://github.com/mmp/pbrt-v2/blob/master/src/core/kdtree.h |
| 14:51.33 | i5o | [ pbrt-v2/kdtree.h at master · mmp/pbrt-v2 · GitHub ] |
| 14:52.13 | vasc | that's for a spatial kd-tree, but the data structure is more or less the same a bintree. in fact ours is simpler because we don't need to store as much data on the nodes. |
| 14:59.31 | mdtwenty[m] | okay, IIRC the idea is to store the jumps in the data structure, along with the boolean tree information? |
| 15:00.14 | vasc | yes. |
| 15:00.23 | vasc | this is the data structure the code uses now: |
| 15:00.37 | vasc | union tree { |
| 15:00.37 | vasc | <PROTECTED> |
| 15:00.37 | vasc | <PROTECTED> |
| 15:00.37 | vasc | <PROTECTED> |
| 15:00.37 | vasc | <PROTECTED> |
| 15:00.38 | vasc | <PROTECTED> |
| 15:00.40 | vasc | <PROTECTED> |
| 15:00.42 | vasc | <PROTECTED> |
| 15:00.44 | vasc | <PROTECTED> |
| 15:00.46 | vasc | <PROTECTED> |
| 15:00.48 | vasc | <PROTECTED> |
| 15:00.50 | vasc | <PROTECTED> |
| 15:00.52 | vasc | <PROTECTED> |
| 15:00.54 | vasc | <PROTECTED> |
| 15:00.56 | vasc | <PROTECTED> |
| 15:00.58 | vasc | <PROTECTED> |
| 15:01.00 | vasc | <PROTECTED> |
| 15:01.02 | vasc | <PROTECTED> |
| 15:01.04 | vasc | <PROTECTED> |
| 15:01.06 | vasc | <PROTECTED> |
| 15:01.08 | vasc | <PROTECTED> |
| 15:01.10 | vasc | <PROTECTED> |
| 15:01.12 | vasc | <PROTECTED> |
| 15:01.14 | vasc | <PROTECTED> |
| 15:01.16 | vasc | <PROTECTED> |
| 15:01.18 | vasc | <PROTECTED> |
| 15:01.21 | vasc | <PROTECTED> |
| 15:01.23 | vasc | <PROTECTED> |
| 15:01.25 | vasc | <PROTECTED> |
| 15:01.27 | vasc | <PROTECTED> |
| 15:01.29 | vasc | <PROTECTED> |
| 15:01.31 | vasc | <PROTECTED> |
| 15:01.33 | vasc | <PROTECTED> |
| 15:01.35 | vasc | <PROTECTED> |
| 15:01.37 | vasc | }; |
| 15:01.39 | vasc | and this is what we'll use: (pseudo-code) |
| 15:01.41 | vasc | / assume NOPs have been removed and all leafs are SOLIDs. |
| 15:01.43 | vasc | / pseudo-code. |
| 15:01.45 | vasc | struct tree { |
| 15:01.47 | vasc | <PROTECTED> |
| 15:01.49 | vasc | <PROTECTED> |
| 15:01.51 | vasc | <PROTECTED> |
| 15:01.53 | vasc | <PROTECTED> |
| 15:01.55 | vasc | <PROTECTED> |
| 15:01.57 | vasc | <PROTECTED> |
| 15:01.59 | vasc | <PROTECTED> |
| 15:02.01 | vasc | <PROTECTED> |
| 15:02.03 | vasc | <PROTECTED> |
| 15:02.05 | vasc | }; |
| 15:02.07 | vasc | ta-da. |
| 15:02.09 | vasc | a linearized array binary tree. |
| 15:02.24 | vasc | where nodes are either branches with binary ops (has_lchild = 1) or leafs with solids (has_lchild = 0) |
| 15:03.10 | vasc | the current code is a horrible, horrible waste of memory and has lotsa pointer chasing. |
| 15:03.19 | vasc | we're gonna fix that. |
| 15:04.06 | Stragus | Having the whole acceleration structure as a big linear blob of memory is good, and you can use relative offsets to jump around |
| 15:05.06 | Stragus | 16 or 32 bits with a << 4 is pretty good. 32 bits is easier, 16 bits involves a complicated shuffling/ordering solution |
| 15:05.08 | vasc | right. when it's a branch, left child is always at current_offset+1, and the right child is that current_offset+tree.rchild. or was that tree.rchild. i don't remember. |
| 15:05.24 | Stragus | Ah, or that |
| 15:06.09 | vasc | so each tree node should fit into 32-bits. |
| 15:06.18 | vasc | 4 bytes. |
| 15:06.44 | vasc | instead of... 4*2 + 8*3 bytes. |
| 15:06.56 | vasc | 32 |
| 15:07.12 | vasc | 8x smaller. |
| 15:07.22 | vasc | and not as much pointer chasing. |
| 15:07.29 | vasc | so |
| 15:07.42 | vasc | that's the whole idea. |
| 15:08.09 | vasc | now the small detail of getting it to work. |
| 15:08.29 | vasc | "small detail" |
| 15:08.52 | Stragus | *nods* My raytracers always used a linear blob of memory for each graph. Another wonderful detail is that you can save the blob to disk and use "as is", you can use on both CPU and GPU directly, it just works |
| 15:09.24 | vasc | yes. we'll use a simple recursive function the CPU to linearize this tree, and then we'll just copy that to the GPU. |
| 15:10.33 | Stragus | Linearizing on CPU would be a good idea too |
| 15:10.46 | Stragus | (and it makes things easier, same data structure everywhere) |
| 15:18.59 | vasc | meh... the code for the kd-tree in pbrt is too complicated. i think I also saw this in Wald's thesis someplace. |
| 15:21.47 | vasc | I think I first read about this in Gordon Stoll's SIGGRAPH presentation. |
| 15:22.48 | vasc | hmmm... |
| 15:23.09 | vasc | i probably has some of my own code someplace with this. |
| 15:23.11 | vasc | have |
| 15:24.00 | mdtwenty[m] | I've just tested your patch (https://sourceforge.net/p/brlcad/patches/473/) and it seems to match the current times with the version on the trunk |
| 15:24.01 | i5o | [ A 404 Error has Occurred ] |
| 15:24.13 | vasc | ok, but what about the times? |
| 15:24.18 | vasc | oh |
| 15:24.21 | vasc | great. |
| 15:24.34 | vasc | so we got ourselves an algorithm without gotos. |
| 15:24.36 | mdtwenty[m] | goliath is the same (0.33 vs 0.33) |
| 15:24.47 | vasc | which can't be used in OpenCL. |
| 15:25.21 | vasc | now we need an algorithm without gotos, wasted memory, needless pointer chasing, and no dynamic memory. |
| 15:25.32 | mdtwenty[m] | havoc (0.62 new version vs 0.63 in the trunk) |
| 15:25.33 | vasc | that takes care of the first problem. |
| 15:25.35 | mdtwenty[m] | so no big difference |
| 15:25.47 | vasc | yeah, that could just be measuring error. |
| 15:25.53 | vasc | should be anyway. |
| 15:26.33 | vasc | you wanna code the ANSI C prototype? |
| 15:27.09 | mdtwenty[m] | yeah I can give it a try! |
| 15:27.13 | vasc | we basically need to code a tree builder and a tree traverser. |
| 15:31.06 | vasc | i was trying to find some of my old code so you could use that as a base. |
| 15:33.26 | vasc | ah well. |
| 15:33.46 | vasc | dunno if it survived my HD crash 3 years ago. |
| 15:37.24 | vasc | meh. finding mostly code snippets of bin trees accessed with shifts. |
| 15:37.37 | vasc | but that means it has to be a balanced bin tree, or it wastes a lot of memory so... |
| 15:38.52 | vasc | anyway |
| 15:41.12 | vasc | https://pastebin.com/R4SZpDMD |
| 15:41.13 | i5o | [ [C] #define OP_UNION 0 /**< @brief Binary: L union R */ #define OP_INTERS - Pastebin.com ] |
| 15:41.15 | vasc | this should help |
| 15:41.42 | vasc | i don't want you to waste a lot of time reinventing things that have already been invented. |
| 15:41.53 | vasc | so we can use the remaining time to actually invent new stuff. |
| 15:41.57 | vasc | :-) |
| 15:43.49 | vasc | start by building the tree builder, that converts a trunk/ union tree into that array of uints with the layout i mentioned before, and a function to print it to examine the output. |
| 15:44.38 | vasc | you can use the KdTree() constructor and recursiveBuild() function I put in the pastebin from PBRT as a reference. |
| 15:45.04 | vasc | but you'll need to use shifts and things like that to store things in 32-bits. |
| 15:45.19 | vasc | because OpenCL does not have support for bitfields in structs either. |
| 15:45.23 | vasc | gotta love OpenCL. |
| 15:46.08 | Stragus | Managing the bits yourself sometimes allows producing better code than with bitfields anyway |
| 15:46.18 | vasc | https://stackoverflow.com/questions/9033947/why-are-bitfields-not-allowed-in-opencl |
| 15:46.19 | i5o | [ bit fields - Why are bitfields not allowed in OpenCL? - Stack Overflow ] |
| 15:47.14 | vasc | ah so it's because of endianess? bah. |
| 15:47.29 | vasc | then what about ints? those would need to be converted too... |
| 15:47.36 | vasc | scratches his head. |
| 15:48.01 | vasc | so it's just because of padding it seems. |
| 15:49.18 | vasc | anyway the idea is the first bit says if its a branch or a leaf. if it's a branch the next 3 bits define the binary OP, and the remaining bits where right child is. |
| 15:49.37 | vasc | if it's a leaf, then the other 30 bits have the solid id. |
| 15:49.39 | vasc | that's it. |
| 15:49.49 | vasc | 31 bits actually |
| 15:50.15 | vasc | i hope 2G solids are enough. |
| 15:51.22 | vasc | and like 256 M bool nodes. |
| 15:51.37 | vasc | if it's not enough we can use 64 bits. |
| 15:53.49 | vasc | mdtwenty[m], thx for conducting the tests. |
| 15:53.59 | vasc | i'll look at your code now. |
| 15:54.27 | mdtwenty[m] | no worries :) |
| 15:54.45 | mdtwenty[m] | i think I got the idea.. I will start working on this and see what I can get |
| 15:55.40 | vasc | sure. if your ANSI C prototype code turns out to be faster than the one in trunk/ then we can ask Sean to test it and maybe we'll use it in the main ANSI C path as well. |
| 15:56.10 | vasc | it might be because the nodes are going to be smaller and the memory accesses more coherent. |
| 15:56.33 | vasc | although i haven't looked at how the code allocates the nodes yet. |
| 16:00.17 | vasc | you can use the RPN bool patch I made way back when to see which functions you need to implement to completely replace the existing boolean trees in trunk/. |
| 16:02.04 | mdtwenty[m] | okay thanks! I'll keep that in mind |
| 16:02.44 | vasc | but like i said, start with the tree builder and the print function. |
| 16:02.56 | vasc | to debug. |
| 16:13.22 | vasc | duh... ok, i see. the existing code has memory pool for tree nodes if the nodes get deleted and reallocated. but fresh nodes are just allocated with posix_memalign() or calloc()... |
| 16:13.33 | vasc | yeah that should make them memory coherent. not. |
| 16:13.53 | vasc | well maybe if the malloc implementation is smart. which it usually isn't. |
| 16:15.19 | *** join/#brlcad Caterpillar2 (~caterpill@unaffiliated/caterpillar) | |
| 16:15.44 | vasc | yes... the results of changing this code should prove quite interesting. |
| 16:16.23 | vasc | notice how my code which always evaluates the entire expression actually was either the same speed or faster in the simpler expressions. |
| 16:17.00 | mdtwenty[m] | i hope so! |
| 16:17.55 | mdtwenty[m] | one thing that i noticed the other day was that the max rpn tree lenght in the havoc scene was 583 IIRC |
| 16:19.16 | vasc | yeah that doesn't seem so good. |
| 16:19.31 | vasc | well you know there are several possible strategies to optimize the tree evaluator. |
| 16:21.47 | vasc | say a solid is never intersected by any ray. so why evaluate that for every partition or whatever? |
| 16:22.10 | vasc | it's possible to optimize the tree in runtime or prior to render. |
| 16:22.49 | Stragus | A linear blob with small offsets rathre than pointers, all packed in memory, will definitely be faster on the CPU |
| 16:23.33 | Stragus | And you can use something like a 3D space filling curve to sort data so it's spatially close, better filling cache lines. I remember getting a few extra percents from that |
| 16:25.13 | vasc | yeah none of the buffers are in z-order. that likely has a negative impact. |
| 16:25.39 | vasc | i also need to reduce the amount of memory used in between stages. |
| 16:26.48 | Stragus | If you ever want some pretty fast space filling curve code: http://www.rayforce.net/ccspacecurve.c http://www.rayforce.net/ccspacecurve.h |
| 16:27.01 | vasc | about using SoA vs AoS... |
| 16:28.07 | vasc | once the code is cleaned up we might try that. |
| 16:28.29 | vasc | but right now i'm more concerned about thread divergence and poor usage of capacity. |
| 16:28.47 | Stragus | Yup, that's really something for the final touch |
| 16:28.57 | Stragus | (I gained 2-3% with that, woohoo) |
| 16:29.39 | vasc | i had like order of magnitude improvements once because full GPU units weren't being used because of divergence. |
| 16:30.00 | vasc | that one can really bite. |
| 16:30.42 | Stragus | Instead of testing 32 rays against a single sector or primitive, have you considered one warp testing one ray against 32 sectors or primitives? |
| 16:30.58 | Stragus | For highly incoherent rays, that seems like a better idea than trying to make coherent non-diverging bundles |
| 16:31.27 | Stragus | (replace the 32 with 4 or 8 on CPUs, 64 on AMD, etc.) |
| 16:32.00 | vasc | well right now the code only does primaries. well it's generic enough to be used with any rays, but those are the only rays the generator generates. |
| 16:32.39 | vasc | i've had a lot of luck with sorting rays to decrease divergence. |
| 16:33.19 | Stragus | Really?... Sorting rays either eat through your global memory banwidth, or you can't sort enough in on-chip shared memory |
| 16:33.37 | Stragus | For me, it was always faster to just trace incoherently than to bother sorting them |
| 16:33.39 | vasc | in my own engine i actually fudged the RNG for the ambient occlusion rays so that those are more coherent. but don't tell anyone that... :-) |
| 16:33.55 | Stragus | But that could change if the acceleration structure was designed to test one ray against 32 primitives, instead of the other way around |
| 16:34.06 | Stragus | Eh, I did the same thing |
| 16:34.12 | vasc | ahah |
| 16:35.02 | vasc | well the sorting is useful if you have a fast sort. |
| 16:35.09 | vasc | did you try using the CUB sort? |
| 16:35.20 | Stragus | The problem wasn't the fast sort, the problem was the global memory banwdith |
| 16:35.28 | vasc | oh that? |
| 16:35.33 | Stragus | I could trace incoherent rays for the same amount of bandwidth as sorting them |
| 16:35.34 | vasc | you use compress-sort-decompress. |
| 16:36.29 | vasc | have you read this paper? |
| 16:36.30 | vasc | http://www.keldysh.ru/pages/cgraph/articles/dep20/publ2010/GPU-RayTracing.pdf |
| 16:36.32 | Stragus | Assuming rays bouncing everywhere, how do you compress the origin, vector and other contextual information about that ray's purpose? |
| 16:36.51 | vasc | look at that paper. :-) |
| 16:37.29 | Stragus | The "compression" is for the sort, sure |
| 16:37.52 | Stragus | But the write,read,write,read for bundling rays takes a lot of bandwidth |
| 16:39.44 | Stragus | Considering the typical incoherent ray for me had to touch about ~17 cache lines of 32 bytes... |
| 16:39.51 | Stragus | (if I remember the metric correctly) |
| 16:40.17 | vasc | hm |
| 16:40.21 | vasc | well you know |
| 16:40.37 | vasc | for my own render pipeline i only use that for incoherent rays. |
| 16:41.19 | vasc | also i think it's not just a problem of memory coherence but of scheduling as well |
| 16:41.38 | vasc | like, adjacent rays typically have similar complexity. |
| 16:42.12 | Stragus | Right, less waste from threads having nothing to do |
| 16:42.43 | Stragus | Well, like I said, if I had to trace a ton of incoherent rays again, I would just use a different acceleration structure |
| 16:42.52 | Stragus | Test a single ray against batches of 32 triangles instead |
| 16:46.21 | vasc | i once read a poster by Yoon. he basically ran all the intersections against simplified geometry (like a convex hull of the geometry) to get an estimate for the hit points. |
| 16:46.34 | vasc | sorted those in z-order. and then used that to schedule the rays. |
| 16:46.52 | vasc | it was an out-of-core renderer. so it made even more sense there. |
| 16:47.24 | vasc | actually he used a simplified mesh, not the convex hull now that i think of it. |
| 16:48.04 | vasc | the ray against triangles... well. |
| 16:48.14 | vasc | the thing is... triangles are "fatter" than rays. |
| 16:49.08 | Stragus | But it's a lot easier to make coherent bundles of 32 triangles than 32 rays ;) |
| 16:49.25 | Stragus | (assuming global illumination-like incoherent tracing) |
| 16:49.45 | vasc | 3 vertexes with 3 coordinates of 4 bytes vs 2 vertexes with 3 coordinates. |
| 16:49.53 | vasc | and that's not even assuming the rays have same origin. |
| 16:50.32 | Stragus | Rays also need to track contextual information, so you can accumulate back to whatever bounce it came from |
| 16:50.55 | vasc | sure an id can take care of that. |
| 16:50.58 | Stragus | (assuming you short and shuffle them) |
| 16:51.02 | Stragus | sort* |
| 16:51.57 | Stragus | I don't know, there's a lot of potential for an acceleration structure designed to minimize the count of bundles of 32 triangles being tested (in the global cost function, 1 triangle in a sector has the same cost as 32) |
| 16:53.08 | vasc | i think i mentioned this in some slides for a class i gave sometime. |
| 16:53.43 | vasc | i was showing how you can reduce the amount of math ops in the intersections when testing with same ray origin vs same triangles and the like. |
| 16:54.08 | vasc | i remember that was also less nice for the many triangle approach. |
| 16:54.29 | vasc | thing is, when the many rays don't have the same origin though, then it changes. |
| 16:55.15 | Stragus | Right sure, highly coherent rays (including with same origin) is a different problem with a different optimal solution |
| 16:57.39 | vasc | https://drive.google.com/file/d/0B85Rkmt7rnCTWGVNUlhaZzZyQkU/view?usp=sharing |
| 16:57.40 | i5o | [ aula.pdf - Google Drive ] |
| 16:57.55 | vasc | page 6. |
| 16:58.02 | vasc | i gave that class like 5 years ago. |
| 16:58.59 | vasc | i remember i analyzed that as i thought it was interesting. and ran some tests. then i gave it up. :-) |
| 16:59.31 | vasc | oh... what was that other thing... |
| 16:59.38 | vasc | right. |
| 16:59.53 | vasc | ray-triangle fan and ray-triangle strip optimized intersection routines. |
| 16:59.59 | vasc | i was looking at those back then too. |
| 17:00.34 | vasc | i only found like one decent paper for each one. |
| 17:00.58 | vasc | it's another way to reduce the bandwidth required for the triangles right? |
| 17:03.10 | vasc | so i looked at this one: http://gamma.cs.unc.edu/RS/paper_rt07.pdf |
| 17:03.31 | vasc | but i dunno it seemed kinda lame to me, that you needed to build a BSP |
| 17:04.32 | vasc | actually an skd-tree but whatever. |
| 17:05.31 | vasc | it seemed lame to me. to use axis aligned separators. |
| 17:05.44 | vasc | you already have the planes defined by the triangles or whatever. |
| 17:06.00 | vasc | so i also ignored that idea. |
| 17:06.09 | *** join/#brlcad ``Erik (~erik@pool-100-16-14-17.bltmmd.fios.verizon.net) | |
| 17:11.07 | vasc | there's all sorts of tricks you can ues. |
| 17:11.09 | vasc | use. |
| 17:11.41 | vasc | like... you know how they say for primary rays since the origin is always the same, only store the direction? well i only stored the index used to generate the ray. |
| 17:11.58 | vasc | then i have generator functions for rays. primaries, shadows, ambient occlusion the works. |
| 17:12.03 | vasc | so the engine was all specialized. |
| 17:13.37 | vasc | ah whatever. i'm not gonna pick that up again. |
| 17:13.50 | vasc | i've decided if i ever rewrite it, i'm gonna do everything differently. |
| 17:14.43 | vasc | so Stragus, are many-triangle intersections in fashion again? |
| 17:17.43 | Stragus | I have no idea what's in fashion. I almost never read papers, I just have fun ;) |
| 17:18.11 | vasc | every once in a while something comes out that makes a difference. that's why i read papers. |
| 17:18.29 | vasc | its usually not a new idea, but, the twist makes the difference. |
| 17:18.39 | Stragus | And really, I would recommend using a graph rather than a tree |
| 17:18.49 | vasc | for the geometry right? |
| 17:18.52 | Stragus | A tree implies a stack, which implies per-thread stores and loads, and that's terrible |
| 17:19.00 | Stragus | For the space partitioning, yes |
| 17:19.50 | vasc | isn't that like a tree with ropes like they usually call it? well at least that's how Havran calls it. |
| 17:20.19 | Stragus | No idea if there's a fancy term for it. But what I use is a graph, sectors connecting to other sectors, there's no tree |
| 17:20.56 | vasc | then how's the partitioning done? does it still use split planes somehow? |
| 17:21.03 | Stragus | The idea being to avoid all per-thread memory stores, until you have your intersection point and you need to do something with it |
| 17:21.44 | Stragus | I use axis-aligned sectors, which can overlap in space, and each facet of a sector can either connect to another sector or a "node" with a splitting plane to decide where to go next |
| 17:22.13 | Stragus | There are other ways, that's not ideal for the one-ray-against-many-triangles approach |
| 17:22.25 | Stragus | But just avoid a per-thread stack, no matter how you do it |
| 17:22.39 | vasc | i've read something similar in papers from the 1980s but i can't say i remember other people working on that. it seemed like a cool idea when i read it. |
| 17:23.30 | vasc | back then a lot of people were trying to get ray tracing to scale in MPP architectures. |
| 17:24.13 | vasc | like Transputers or that Intel machine... the one with i860s. |
| 17:24.21 | Stragus | Tracking a stack in L1 on many-cores no-SIMD chips is no big deal |
| 17:24.35 | Stragus | But the situation is different with GPUs and modern wide SIMD |
| 17:24.45 | vasc | https://en.wikipedia.org/wiki/Intel_iPSC#iPSC.2F860 |
| 17:24.46 | i5o | [ Intel iPSC - Wikipedia ] |
| 17:25.13 | vasc | the Intel Paragon. |
| 17:25.45 | vasc | maan... they spent a lot of time trying to minimize inter thread communications and ops. |
| 17:26.24 | vasc | well i dunno if those processors actually had caches. |
| 17:26.26 | Stragus | Usually, for massively parallel software, you just write code to _avoid_ requiring low latency inter-thread/device communication |
| 17:27.14 | Stragus | Like for computational fluid dynamics, it's better to compute the same cells redundantely while you receive results from other devices from two time steps ago |
| 17:27.27 | vasc | ah right. the i860 had caches but early Transputers did not. |
| 17:28.12 | vasc | ah CFD. i always wondered how they implemented the data structures for the multi-grid solvers with AMR in those. |
| 17:29.00 | vasc | ever seen this? |
| 17:29.01 | vasc | https://www.youtube.com/watch?v=vYA0f6R5KAI |
| 17:29.03 | i5o | [ GPUs to Mars: Full-Scale Simulation of SpaceX's Mars Rocket Engine - YouTube ] |
| 17:29.03 | Stragus | AMR? |
| 17:29.13 | vasc | Adaptive Mesh Refinement |
| 17:29.34 | vasc | https://en.wikipedia.org/wiki/Adaptive_mesh_refinement |
| 17:29.34 | i5o | [ Adaptive mesh refinement - Wikipedia ] |
| 17:32.29 | vasc | https://youtu.be/vYA0f6R5KAI?t=1510 |
| 17:32.31 | i5o | [ GPUs to Mars: Full-Scale Simulation of SpaceX's Mars Rocket Engine - YouTube ] |
| 17:32.31 | vasc | animated |
| 17:33.14 | vasc | is fascinated with grids. |
| 17:38.53 | *** join/#brlcad KimK_ (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) | |
| 18:33.13 | vasc | mdtwenty[m], the code changes you made to branches/opencl seem to be fine at a glance. |
| 18:40.24 | Notify | 03BRL-CAD Wiki:Mariomeissner * 10157 /wiki/User:Mariomeissner/logs: |
| 18:40.34 | mdtwenty[m] | I noticed that some code from my previous patches still had indentation issues, so I fixed that and used it to make my first commit, like a test commit :) |
| 18:41.40 | mdtwenty[m] | the code to skip unevaluated partitions should be okay, I did some testing and it iterates only over the evaluated partitions, although that didn't make that much of a difference in performance |
| 18:43.36 | vasc | i remembered now that it might also be possible to reuse the forward_pp in the partitions, so next_evalpp might not be necessary. |
| 18:44.31 | vasc | in rt_boolfinal. |
| 18:45.11 | vasc | you just need to cache the partitions[current_index].forw_pp in a variable, which could be named next_index, before writing over the value. |
| 18:45.43 | vasc | and then in the loop you get the next index from that cached variable instead of from the partition |
| 18:47.39 | vasc | it should only make a difference in scenes with a lot of depth complexity. |
| 18:47.52 | vasc | maybe in golliath it makes a difference? |
| 18:49.35 | mdtwenty[m] | hm it has almost the same, but to be fair I only tested with the view az:35 el:25 |
| 18:50.02 | mdtwenty[m] | front view could have a more noticeable difference |
| 18:51.19 | mdtwenty[m] | and I think that it is possible to discard the use of next_evalpp now that you mention it |
| 18:52.08 | mdtwenty[m] | and maybe we dont even need to cache the forw_pp, since I could only update the forw_pp of the last partition evaluated |
| 18:52.13 | vasc | if we can reuse the foward_pp it would save memory in the partitions. |
| 18:53.25 | vasc | well you need to save it before you override it, because otherwise when the for (;;) loop iterates the value won't be correct. |
| 18:54.12 | vasc | uint next_index; |
| 18:54.26 | vasc | for (uint current_index = head; current_index != UINT_MAX; current_index = next_index) { |
| 18:54.35 | vasc | <PROTECTED> |
| 18:54.37 | vasc | <PROTECTED> |
| 18:55.15 | vasc | <PROTECTED> |
| 18:55.20 | vasc | oh i see what you mean. |
| 18:55.30 | vasc | yeah it should be no problem. |
| 18:56.13 | vasc | no need for the variable even. |
| 18:57.50 | mdtwenty[m] | yeah :) I will update it |
| 19:04.06 | vasc | i was also looking at if it was possible to remove the back_pp index by caching the previous index in a variable wherever that is used. |
| 19:04.28 | vasc | a lot of times algorithms with double-linked lists aren't actually necessary. |
| 19:04.49 | mdtwenty[m] | IIRC back_pp was only used in 1 condition in boolweave |
| 19:05.22 | vasc | its used in boolweave and rt_defoverlap |
| 19:07.14 | mdtwenty[m] | oh yes, right |
| 19:07.50 | mdtwenty[m] | but I think that could be cached in a variable |
| 19:11.16 | vasc | i'm not so sure about that.. |
| 19:11.24 | vasc | the boolweave case sure. |
| 19:11.30 | vasc | but i'm not certain about defoverlap. |
| 19:12.00 | vasc | because it doesn't iterate the list using the pointers |
| 19:12.03 | vasc | it uses the ids |
| 19:12.38 | vasc | hm |
| 19:12.43 | vasc | yes it might also be possible |
| 19:13.13 | vasc | by passing it the index to the previous partition via default_multioverlap in boolfinal |
| 19:14.08 | vasc | so that would shave off like 8 bytes in each partition with these two modifications. |
| 19:14.42 | vasc | i also have an idea to shave off like half the memory used in the struct hits in the partitions so. |
| 19:15.08 | vasc | over half actually |
| 19:15.22 | vasc | so the struct partition should become less than half the size. |
| 19:20.12 | vasc | it seems like soon enough we'll have a our first generation system. |
| 19:21.03 | vasc | the only thing which might make things slower is the bitvectors. or not. |
| 19:21.23 | vasc | it think it will depend on how sparse the bitvectors are. |
| 19:21.51 | vasc | i suspect they'll be faster on average than if we used lists. |
| 19:22.09 | vasc | it also makes memory management easier. |
| 19:22.41 | vasc | oh right, nearly forgot |
| 19:23.10 | vasc | we're gonna need to make an array with uints, one uint per solid, where the uint has the id of the first region. |
| 19:23.23 | vasc | so i can shade the materials in single-hit mode properly. |
| 19:23.40 | vasc | its only necessary in single-hit mode. |
| 19:24.01 | vasc | you prolly noticed its shading everything white/gray in single-hit mode right now. |
| 19:24.25 | vasc | and i can't remember anything else. |
| 19:24.38 | vasc | but there's plenty of optimizations we can do afterwards. |
| 19:24.50 | mdtwenty[m] | yeah I noticed it, when it wasn't white/gray was every solid it the same color |
| 19:24.52 | Notify | 03BRL-CAD:n_reed * 70035 brlcad/trunk/src/tclscripts/checker/CMakeLists.txt: Add initial check command test script.Currently runs the check Tcl-command in mged (mged -a nu -c) to checkbasic invocation behavior, and also subtraction behavior on two simpleoverlap cases.No args, but finds check.tcl by CHECKER_DIR environment variable.Doesn't run check.sh. Test databases and overlap directories are writtenout by the |
| 19:24.54 | Notify | test script itself. |
| 19:24.55 | vasc | oh, right, i also need to fix the lights. |
| 19:24.56 | Notify | ... |
| 19:28.23 | mdtwenty[m] | did you ever tried to render the havoc scene with your workstation GPU? I was curious about what the difference in time would be |
| 19:29.06 | vasc | it's bad... |
| 19:29.14 | vasc | there's too many roundtrips. |
| 19:29.32 | vasc | those can be fixed, but it requires a reduce algorithm. |
| 19:29.42 | vasc | i think its better not to focus too much on that right now. |
| 19:29.53 | vasc | we can just focus on the CPU performance. |
| 19:30.11 | vasc | i can later give you a piece of code with reduce for you to try out the difference. |
| 19:30.30 | vasc | but the license might not be compatible for wider distribution. |
| 19:30.41 | vasc | let's see... |
| 19:30.53 | vasc | yeah i got that here. |
| 19:32.10 | vasc | ah we don't use the reduce, prolly just scan. |
| 19:32.37 | vasc | the reduce is a sum, and the scan is prefix sums. |
| 19:32.47 | vasc | https://en.wikipedia.org/wiki/Prefix_sum |
| 19:32.48 | i5o | [ Prefix sum - Wikipedia ] |
| 19:34.03 | vasc | so we can compute the offsets quickly on the GPU side without doing memory transfers. |
| 19:36.00 | vasc | i.e. this code: |
| 19:36.01 | vasc | <PROTECTED> |
| 19:36.01 | vasc | <PROTECTED> |
| 19:36.01 | vasc | <PROTECTED> |
| 19:36.01 | vasc | BU_ASSERT((counts[i-1] % 2) == 0); |
| 19:36.01 | vasc | h[i] = h[i-1] + counts[i-1]/2;/* number of segs is half the number of hits */ |
| 19:36.03 | vasc | if (counts[i-1]/2 > max_depth) |
| 19:36.05 | vasc | <PROTECTED> |
| 19:36.07 | vasc | <PROTECTED> |
| 19:36.09 | vasc | in clt_Frame() |
| 19:36.46 | vasc | oh... except it does two things it computes the max and it does the prefix sums. |
| 19:36.58 | vasc | the max_depth is a sort of reduce i guess. |
| 19:40.05 | vasc | well enough of that |
| 19:40.42 | vasc | keep me notified of your progress on the ANSI C bool_eval() then mdtwenty[m]. i'll see if can work on the lights this week. |
| 19:41.23 | vasc | <PROTECTED> |
| 19:41.38 | mdtwenty[m] | Yeah I will update you as soon as I have some progress |
| 19:41.51 | vasc | ok see you later! |
| 19:42.08 | mdtwenty[m] | see you later! |
| 19:44.09 | *** part/#brlcad mdtwenty[m] (mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh) | |
| 19:59.45 | *** part/#brlcad mdtwenty (bc519789@gateway/web/freenode/ip.188.81.151.137) | |
| 20:05.54 | *** join/#brlcad mdtwenty[m] (mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh) | |
| 20:15.04 | *** join/#brlcad yukonbob (~bch@S0106c8fb2656f971.ok.shawcable.net) | |
| 20:15.09 | yukonbob | hello #brlcad |
| 22:10.05 | *** join/#brlcad yukonbob (~bch@S0106c8fb2656f971.ok.shawcable.net) | |