IRC log for #brlcad on 20170801

00:17.03 Notify 03BRL-CAD:mdtwenty * 70033 (brlcad/branches/opencl/include/rt/shoot.h brlcad/branches/opencl/src/librt/primitives/bool.cl and 3 others): skipping unevaluated partitions
00:18.06 *** join/#brlcad https_GK1wmSU (~deep-book@169.55.27.131)
00:19.05 *** part/#brlcad https_GK1wmSU (~deep-book@169.55.27.131)
00:38.53 *** join/#brlcad sqrosuofcvzwzywq (~armin@dslb-094-216-164-255.094.216.pools.vodafone-ip.de)
00:41.13 Notify 03BRL-CAD:Marco-domingues * 10149 /wiki/User:Marco-domingues/GSoC17/Log: /* 31 July */
02:13.37 *** join/#brlcad yukonbob (~bch@S0106b81619e90b96.ok.shawcable.net)
02:13.40 yukonbob hello #brlcad
02:43.36 *** join/#brlcad https_GK1wmSU (~deep-book@77.234.42.183)
02:44.37 *** part/#brlcad https_GK1wmSU (~deep-book@77.234.42.183)
03:34.33 *** join/#brlcad gabbar1947 (uid205515@gateway/web/irccloud.com/x-uthcarhicquzqbhn)
05:55.07 *** join/#brlcad merzo (~merzo@20-21-132-95.pool.ukrtel.net)
06:45.42 *** join/#brlcad Caterpillar (~caterpill@unaffiliated/caterpillar)
07:14.18 *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:8dd6:e12e:6013:7834)
07:15.00 *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:8dd6:e12e:6013:7834)
07:18.05 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
07:27.36 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
07:53.52 *** join/#brlcad Stragus_ (~alexis@modemcable131.0-202-24.mc.videotron.ca)
08:14.30 *** join/#brlcad merzo (~merzo@user-94-45-58-139.skif.com.ua)
08:25.19 *** join/#brlcad Caterpillar (~caterpill@unaffiliated/caterpillar)
08:34.02 *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675)
08:39.31 *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675)
08:42.54 *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675)
08:51.44 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
09:11.44 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
09:15.52 *** join/#brlcad KimK_ (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675)
09:17.31 *** join/#brlcad KimK_ (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675)
09:21.01 *** join/#brlcad KimK_ (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675)
09:32.36 *** join/#brlcad gabbar1947 (uid205515@gateway/web/irccloud.com/x-ltbxxbrtphejfhrl)
09:33.46 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
10:30.52 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
11:55.55 Notify 03BRL-CAD:Amritpal singh * 10150 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */
12:02.13 Notify 03BRL-CAD:Amritpal singh * 0 /wiki/File:Bentrebar.png:
12:07.20 Notify 03BRL-CAD:Amritpal singh * 0 /wiki/File:PopimageUshapeRebar.png:
12:08.05 Notify 03BRL-CAD:Amritpal singh * 10153 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */
12:09.25 Notify 03BRL-CAD:Amritpal singh * 10154 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */
12:20.59 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
13:22.06 *** join/#brlcad gabbar1947 (uid205515@gateway/web/irccloud.com/x-uzhthiaatgjhxdcu)
13:35.38 Notify 03BRL-CAD:mdtwenty * 70034 brlcad/branches/opencl/src/librt/primitives/rt.cl: Fix bug in partitions shading. Shading closest partition
13:39.29 *** join/#brlcad yorik (~yorik@2804:431:f720:4929:290:f5ff:fedc:3bb2)
13:39.45 *** join/#brlcad vasc (~vasc@bl4-147-121.dsl.telepac.pt)
13:41.09 Notify 03BRL-CAD:Marco-domingues * 0 /wiki/File:Shade_closest_partition.png:
13:48.37 Notify 03BRL-CAD:Marco-domingues * 10156 /wiki/User:Marco-domingues/GSoC17/Log: /* 1 August */
14:16.26 *** join/#brlcad mdtwenty (bc519789@gateway/web/freenode/ip.188.81.151.137)
14:21.57 vasc hey mdtwenty
14:22.07 mdtwenty hello!
14:22.38 *** join/#brlcad mdtwenty[m] (mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh)
14:23.09 vasc once you put the changes to skip unevaluated partitions and the bug fix for the closest partition on svn then tell me so i can look at the code
14:23.15 vasc about bool_eval()...
14:23.35 vasc did you see patch i posted on the patches tracker?
14:23.51 vasc <PROTECTED>
14:24.22 mdtwenty[m] I have commited the changes already in the opencl branch
14:24.58 vasc https://sourceforge.net/p/brlcad/patches/473/
14:24.58 i5o [ BRL-CAD / Patches / #473 Eliminate gotos from bool_eval() ]
14:25.34 vasc apply this patch to brlcad trunk/ and try running your tests on that to see what's the performance difference vs the old code.
14:25.59 vasc if that's fast enough, then we can work on a simplified prototype in ANSI C to port to OpenCL later.
14:26.22 vasc with a linearized binary tree in an array, instead of storing the expression tree in rpn format.
14:26.32 vasc so the tree would be in infix notation.
14:27.08 vasc it shouldn't even use more memory if we do it right.
14:27.43 mdtwenty[m] hm okay I will apply the patch and test it
14:28.03 vasc ok. i'll try to advance the pseudocode for the solution to speed this up.
14:28.58 mdtwenty[m] yeah I think most of the bugs with the csg code are fixed now
14:31.19 mdtwenty[m] the images match the results from the ANSI C code, although some scenes have missing geometry from the primitives that are not supported yet (goliath.g is one example of this, where some pipes are missing)
14:31.33 vasc i'm gonna need to know the first material in a solid so i can shade the primitives properly in the one-hit shader though.
14:31.45 vasc yeah pipe is used quite often.
14:32.15 vasc but the code is kinda more complicated than that of regular primitive. that's why i didn't ask you to work on it.
14:32.28 vasc pipe is a composite primitive.
14:32.44 vasc its a list of segments of things.
14:33.37 vasc it would distract from the purpose of getting the csg working.
14:33.51 vasc let's see...
14:34.16 vasc i need some data structure.
14:34.17 mdtwenty[m] hm yes I remember you saying that porting that primitive would take some work
14:35.21 vasc in case you're curious you can look at src/librt/primitives/pipe/pipe_shot.c or something like that.
14:36.04 vasc https://svn.code.sf.net/p/brlcad/code/brlcad/trunk/src/librt/primitives/pipe/pipe.c
14:36.53 vasc if you do a 'wc -l' to get a linecount on the pipe.c file vs, say, ehy.c you would see that's it's a lot more complex.
14:37.10 vasc anyway...
14:38.07 vasc implementing bool_eval() is easier than that.
14:38.15 vasc seriously.
14:40.36 mdtwenty[m] hm yes lets first optimize the code and then I can look into porting some more primitives into OpenCL :)
14:42.58 vasc nah.
14:43.05 vasc we want to optimize this to the max.
14:43.12 vasc screw the primitives.
14:43.22 vasc we only have one month.
14:44.01 mdtwenty[m] ah yes, I meant eventually after the gsoc period
14:44.16 vasc maybe next year. :-)
14:45.01 vasc and mind you, implementing pipe would be a breeze compared to the other main ticket item left to do... the brep intersector.
14:45.44 vasc gah.
14:45.49 vasc that's another gsoc or two.
14:46.36 vasc if we get all that working, i think this will have a more complete GPU intersector than ANYTHING ELSE in the market right now.
14:47.01 vasc if it happens.
14:47.16 vasc anyway...
14:51.27 vasc we'll use something like this bin tree representation
14:51.32 vasc https://github.com/mmp/pbrt-v2/blob/master/src/core/kdtree.h
14:51.33 i5o [ pbrt-v2/kdtree.h at master · mmp/pbrt-v2 · GitHub ]
14:52.13 vasc that's for a spatial kd-tree, but the data structure is more or less the same a bintree. in fact ours is simpler because we don't need to store as much data on the nodes.
14:59.31 mdtwenty[m] okay, IIRC the idea is to store the jumps in the data structure, along with the boolean tree information?
15:00.14 vasc yes.
15:00.23 vasc this is the data structure the code uses now:
15:00.37 vasc union tree {
15:00.37 vasc <PROTECTED>
15:00.37 vasc <PROTECTED>
15:00.37 vasc <PROTECTED>
15:00.37 vasc <PROTECTED>
15:00.38 vasc <PROTECTED>
15:00.40 vasc <PROTECTED>
15:00.42 vasc <PROTECTED>
15:00.44 vasc <PROTECTED>
15:00.46 vasc <PROTECTED>
15:00.48 vasc <PROTECTED>
15:00.50 vasc <PROTECTED>
15:00.52 vasc <PROTECTED>
15:00.54 vasc <PROTECTED>
15:00.56 vasc <PROTECTED>
15:00.58 vasc <PROTECTED>
15:01.00 vasc <PROTECTED>
15:01.02 vasc <PROTECTED>
15:01.04 vasc <PROTECTED>
15:01.06 vasc <PROTECTED>
15:01.08 vasc <PROTECTED>
15:01.10 vasc <PROTECTED>
15:01.12 vasc <PROTECTED>
15:01.14 vasc <PROTECTED>
15:01.16 vasc <PROTECTED>
15:01.18 vasc <PROTECTED>
15:01.21 vasc <PROTECTED>
15:01.23 vasc <PROTECTED>
15:01.25 vasc <PROTECTED>
15:01.27 vasc <PROTECTED>
15:01.29 vasc <PROTECTED>
15:01.31 vasc <PROTECTED>
15:01.33 vasc <PROTECTED>
15:01.35 vasc <PROTECTED>
15:01.37 vasc };
15:01.39 vasc and this is what we'll use: (pseudo-code)
15:01.41 vasc / assume NOPs have been removed and all leafs are SOLIDs.
15:01.43 vasc / pseudo-code.
15:01.45 vasc struct tree {
15:01.47 vasc <PROTECTED>
15:01.49 vasc <PROTECTED>
15:01.51 vasc <PROTECTED>
15:01.53 vasc <PROTECTED>
15:01.55 vasc <PROTECTED>
15:01.57 vasc <PROTECTED>
15:01.59 vasc <PROTECTED>
15:02.01 vasc <PROTECTED>
15:02.03 vasc <PROTECTED>
15:02.05 vasc };
15:02.07 vasc ta-da.
15:02.09 vasc a linearized array binary tree.
15:02.24 vasc where nodes are either branches with binary ops (has_lchild = 1) or leafs with solids (has_lchild = 0)
15:03.10 vasc the current code is a horrible, horrible waste of memory and has lotsa pointer chasing.
15:03.19 vasc we're gonna fix that.
15:04.06 Stragus Having the whole acceleration structure as a big linear blob of memory is good, and you can use relative offsets to jump around
15:05.06 Stragus 16 or 32 bits with a << 4 is pretty good. 32 bits is easier, 16 bits involves a complicated shuffling/ordering solution
15:05.08 vasc right. when it's a branch, left child is always at current_offset+1, and the right child is that current_offset+tree.rchild. or was that tree.rchild. i don't remember.
15:05.24 Stragus Ah, or that
15:06.09 vasc so each tree node should fit into 32-bits.
15:06.18 vasc 4 bytes.
15:06.44 vasc instead of... 4*2 + 8*3 bytes.
15:06.56 vasc 32
15:07.12 vasc 8x smaller.
15:07.22 vasc and not as much pointer chasing.
15:07.29 vasc so
15:07.42 vasc that's the whole idea.
15:08.09 vasc now the small detail of getting it to work.
15:08.29 vasc "small detail"
15:08.52 Stragus *nods* My raytracers always used a linear blob of memory for each graph. Another wonderful detail is that you can save the blob to disk and use "as is", you can use on both CPU and GPU directly, it just works
15:09.24 vasc yes. we'll use a simple recursive function the CPU to linearize this tree, and then we'll just copy that to the GPU.
15:10.33 Stragus Linearizing on CPU would be a good idea too
15:10.46 Stragus (and it makes things easier, same data structure everywhere)
15:18.59 vasc meh... the code for the kd-tree in pbrt is too complicated. i think I also saw this in Wald's thesis someplace.
15:21.47 vasc I think I first read about this in Gordon Stoll's SIGGRAPH presentation.
15:22.48 vasc hmmm...
15:23.09 vasc i probably has some of my own code someplace with this.
15:23.11 vasc have
15:24.00 mdtwenty[m] I've just tested your patch (https://sourceforge.net/p/brlcad/patches/473/) and it seems to match the current times with the version on the trunk
15:24.01 i5o [ A 404 Error has Occurred ]
15:24.13 vasc ok, but what about the times?
15:24.18 vasc oh
15:24.21 vasc great.
15:24.34 vasc so we got ourselves an algorithm without gotos.
15:24.36 mdtwenty[m] goliath is the same (0.33 vs 0.33)
15:24.47 vasc which can't be used in OpenCL.
15:25.21 vasc now we need an algorithm without gotos, wasted memory, needless pointer chasing, and no dynamic memory.
15:25.32 mdtwenty[m] havoc (0.62 new version vs 0.63 in the trunk)
15:25.33 vasc that takes care of the first problem.
15:25.35 mdtwenty[m] so no big difference
15:25.47 vasc yeah, that could just be measuring error.
15:25.53 vasc should be anyway.
15:26.33 vasc you wanna code the ANSI C prototype?
15:27.09 mdtwenty[m] yeah I can give it a try!
15:27.13 vasc we basically need to code a tree builder and a tree traverser.
15:31.06 vasc i was trying to find some of my old code so you could use that as a base.
15:33.26 vasc ah well.
15:33.46 vasc dunno if it survived my HD crash 3 years ago.
15:37.24 vasc meh. finding mostly code snippets of bin trees accessed with shifts.
15:37.37 vasc but that means it has to be a balanced bin tree, or it wastes a lot of memory so...
15:38.52 vasc anyway
15:41.12 vasc https://pastebin.com/R4SZpDMD
15:41.13 i5o [ [C] #define OP_UNION 0 /**< @brief Binary: L union R */ #define OP_INTERS - Pastebin.com ]
15:41.15 vasc this should help
15:41.42 vasc i don't want you to waste a lot of time reinventing things that have already been invented.
15:41.53 vasc so we can use the remaining time to actually invent new stuff.
15:41.57 vasc :-)
15:43.49 vasc start by building the tree builder, that converts a trunk/ union tree into that array of uints with the layout i mentioned before, and a function to print it to examine the output.
15:44.38 vasc you can use the KdTree() constructor and recursiveBuild() function I put in the pastebin from PBRT as a reference.
15:45.04 vasc but you'll need to use shifts and things like that to store things in 32-bits.
15:45.19 vasc because OpenCL does not have support for bitfields in structs either.
15:45.23 vasc gotta love OpenCL.
15:46.08 Stragus Managing the bits yourself sometimes allows producing better code than with bitfields anyway
15:46.18 vasc https://stackoverflow.com/questions/9033947/why-are-bitfields-not-allowed-in-opencl
15:46.19 i5o [ bit fields - Why are bitfields not allowed in OpenCL? - Stack Overflow ]
15:47.14 vasc ah so it's because of endianess? bah.
15:47.29 vasc then what about ints? those would need to be converted too...
15:47.36 vasc scratches his head.
15:48.01 vasc so it's just because of padding it seems.
15:49.18 vasc anyway the idea is the first bit says if its a branch or a leaf. if it's a branch the next 3 bits define the binary OP, and the remaining bits where right child is.
15:49.37 vasc if it's a leaf, then the other 30 bits have the solid id.
15:49.39 vasc that's it.
15:49.49 vasc 31 bits actually
15:50.15 vasc i hope 2G solids are enough.
15:51.22 vasc and like 256 M bool nodes.
15:51.37 vasc if it's not enough we can use 64 bits.
15:53.49 vasc mdtwenty[m], thx for conducting the tests.
15:53.59 vasc i'll look at your code now.
15:54.27 mdtwenty[m] no worries :)
15:54.45 mdtwenty[m] i think I got the idea.. I will start working on this and see what I can get
15:55.40 vasc sure. if your ANSI C prototype code turns out to be faster than the one in trunk/ then we can ask Sean to test it and maybe we'll use it in the main ANSI C path as well.
15:56.10 vasc it might be because the nodes are going to be smaller and the memory accesses more coherent.
15:56.33 vasc although i haven't looked at how the code allocates the nodes yet.
16:00.17 vasc you can use the RPN bool patch I made way back when to see which functions you need to implement to completely replace the existing boolean trees in trunk/.
16:02.04 mdtwenty[m] okay thanks! I'll keep that in mind
16:02.44 vasc but like i said, start with the tree builder and the print function.
16:02.56 vasc to debug.
16:13.22 vasc duh... ok, i see. the existing code has memory pool for tree nodes if the nodes get deleted and reallocated. but fresh nodes are just allocated with posix_memalign() or calloc()...
16:13.33 vasc yeah that should make them memory coherent. not.
16:13.53 vasc well maybe if the malloc implementation is smart. which it usually isn't.
16:15.19 *** join/#brlcad Caterpillar2 (~caterpill@unaffiliated/caterpillar)
16:15.44 vasc yes... the results of changing this code should prove quite interesting.
16:16.23 vasc notice how my code which always evaluates the entire expression actually was either the same speed or faster in the simpler expressions.
16:17.00 mdtwenty[m] i hope so!
16:17.55 mdtwenty[m] one thing that i noticed the other day was that the max rpn tree lenght in the havoc scene was 583 IIRC
16:19.16 vasc yeah that doesn't seem so good.
16:19.31 vasc well you know there are several possible strategies to optimize the tree evaluator.
16:21.47 vasc say a solid is never intersected by any ray. so why evaluate that for every partition or whatever?
16:22.10 vasc it's possible to optimize the tree in runtime or prior to render.
16:22.49 Stragus A linear blob with small offsets rathre than pointers, all packed in memory, will definitely be faster on the CPU
16:23.33 Stragus And you can use something like a 3D space filling curve to sort data so it's spatially close, better filling cache lines. I remember getting a few extra percents from that
16:25.13 vasc yeah none of the buffers are in z-order. that likely has a negative impact.
16:25.39 vasc i also need to reduce the amount of memory used in between stages.
16:26.48 Stragus If you ever want some pretty fast space filling curve code: http://www.rayforce.net/ccspacecurve.c http://www.rayforce.net/ccspacecurve.h
16:27.01 vasc about using SoA vs AoS...
16:28.07 vasc once the code is cleaned up we might try that.
16:28.29 vasc but right now i'm more concerned about thread divergence and poor usage of capacity.
16:28.47 Stragus Yup, that's really something for the final touch
16:28.57 Stragus (I gained 2-3% with that, woohoo)
16:29.39 vasc i had like order of magnitude improvements once because full GPU units weren't being used because of divergence.
16:30.00 vasc that one can really bite.
16:30.42 Stragus Instead of testing 32 rays against a single sector or primitive, have you considered one warp testing one ray against 32 sectors or primitives?
16:30.58 Stragus For highly incoherent rays, that seems like a better idea than trying to make coherent non-diverging bundles
16:31.27 Stragus (replace the 32 with 4 or 8 on CPUs, 64 on AMD, etc.)
16:32.00 vasc well right now the code only does primaries. well it's generic enough to be used with any rays, but those are the only rays the generator generates.
16:32.39 vasc i've had a lot of luck with sorting rays to decrease divergence.
16:33.19 Stragus Really?... Sorting rays either eat through your global memory banwidth, or you can't sort enough in on-chip shared memory
16:33.37 Stragus For me, it was always faster to just trace incoherently than to bother sorting them
16:33.39 vasc in my own engine i actually fudged the RNG for the ambient occlusion rays so that those are more coherent. but don't tell anyone that... :-)
16:33.55 Stragus But that could change if the acceleration structure was designed to test one ray against 32 primitives, instead of the other way around
16:34.06 Stragus Eh, I did the same thing
16:34.12 vasc ahah
16:35.02 vasc well the sorting is useful if you have a fast sort.
16:35.09 vasc did you try using the CUB sort?
16:35.20 Stragus The problem wasn't the fast sort, the problem was the global memory banwdith
16:35.28 vasc oh that?
16:35.33 Stragus I could trace incoherent rays for the same amount of bandwidth as sorting them
16:35.34 vasc you use compress-sort-decompress.
16:36.29 vasc have you read this paper?
16:36.30 vasc http://www.keldysh.ru/pages/cgraph/articles/dep20/publ2010/GPU-RayTracing.pdf
16:36.32 Stragus Assuming rays bouncing everywhere, how do you compress the origin, vector and other contextual information about that ray's purpose?
16:36.51 vasc look at that paper. :-)
16:37.29 Stragus The "compression" is for the sort, sure
16:37.52 Stragus But the write,read,write,read for bundling rays takes a lot of bandwidth
16:39.44 Stragus Considering the typical incoherent ray for me had to touch about ~17 cache lines of 32 bytes...
16:39.51 Stragus (if I remember the metric correctly)
16:40.17 vasc hm
16:40.21 vasc well you know
16:40.37 vasc for my own render pipeline i only use that for incoherent rays.
16:41.19 vasc also i think it's not just a problem of memory coherence but of scheduling as well
16:41.38 vasc like, adjacent rays typically have similar complexity.
16:42.12 Stragus Right, less waste from threads having nothing to do
16:42.43 Stragus Well, like I said, if I had to trace a ton of incoherent rays again, I would just use a different acceleration structure
16:42.52 Stragus Test a single ray against batches of 32 triangles instead
16:46.21 vasc i once read a poster by Yoon. he basically ran all the intersections against simplified geometry (like a convex hull of the geometry) to get an estimate for the hit points.
16:46.34 vasc sorted those in z-order. and then used that to schedule the rays.
16:46.52 vasc it was an out-of-core renderer. so it made even more sense there.
16:47.24 vasc actually he used a simplified mesh, not the convex hull now that i think of it.
16:48.04 vasc the ray against triangles... well.
16:48.14 vasc the thing is... triangles are "fatter" than rays.
16:49.08 Stragus But it's a lot easier to make coherent bundles of 32 triangles than 32 rays ;)
16:49.25 Stragus (assuming global illumination-like incoherent tracing)
16:49.45 vasc 3 vertexes with 3 coordinates of 4 bytes vs 2 vertexes with 3 coordinates.
16:49.53 vasc and that's not even assuming the rays have same origin.
16:50.32 Stragus Rays also need to track contextual information, so you can accumulate back to whatever bounce it came from
16:50.55 vasc sure an id can take care of that.
16:50.58 Stragus (assuming you short and shuffle them)
16:51.02 Stragus sort*
16:51.57 Stragus I don't know, there's a lot of potential for an acceleration structure designed to minimize the count of bundles of 32 triangles being tested (in the global cost function, 1 triangle in a sector has the same cost as 32)
16:53.08 vasc i think i mentioned this in some slides for a class i gave sometime.
16:53.43 vasc i was showing how you can reduce the amount of math ops in the intersections when testing with same ray origin vs same triangles and the like.
16:54.08 vasc i remember that was also less nice for the many triangle approach.
16:54.29 vasc thing is, when the many rays don't have the same origin though, then it changes.
16:55.15 Stragus Right sure, highly coherent rays (including with same origin) is a different problem with a different optimal solution
16:57.39 vasc https://drive.google.com/file/d/0B85Rkmt7rnCTWGVNUlhaZzZyQkU/view?usp=sharing
16:57.40 i5o [ aula.pdf - Google Drive ]
16:57.55 vasc page 6.
16:58.02 vasc i gave that class like 5 years ago.
16:58.59 vasc i remember i analyzed that as i thought it was interesting. and ran some tests. then i gave it up. :-)
16:59.31 vasc oh... what was that other thing...
16:59.38 vasc right.
16:59.53 vasc ray-triangle fan and ray-triangle strip optimized intersection routines.
16:59.59 vasc i was looking at those back then too.
17:00.34 vasc i only found like one decent paper for each one.
17:00.58 vasc it's another way to reduce the bandwidth required for the triangles right?
17:03.10 vasc so i looked at this one: http://gamma.cs.unc.edu/RS/paper_rt07.pdf
17:03.31 vasc but i dunno it seemed kinda lame to me, that you needed to build a BSP
17:04.32 vasc actually an skd-tree but whatever.
17:05.31 vasc it seemed lame to me. to use axis aligned separators.
17:05.44 vasc you already have the planes defined by the triangles or whatever.
17:06.00 vasc so i also ignored that idea.
17:06.09 *** join/#brlcad ``Erik (~erik@pool-100-16-14-17.bltmmd.fios.verizon.net)
17:11.07 vasc there's all sorts of tricks you can ues.
17:11.09 vasc use.
17:11.41 vasc like... you know how they say for primary rays since the origin is always the same, only store the direction? well i only stored the index used to generate the ray.
17:11.58 vasc then i have generator functions for rays. primaries, shadows, ambient occlusion the works.
17:12.03 vasc so the engine was all specialized.
17:13.37 vasc ah whatever. i'm not gonna pick that up again.
17:13.50 vasc i've decided if i ever rewrite it, i'm gonna do everything differently.
17:14.43 vasc so Stragus, are many-triangle intersections in fashion again?
17:17.43 Stragus I have no idea what's in fashion. I almost never read papers, I just have fun ;)
17:18.11 vasc every once in a while something comes out that makes a difference. that's why i read papers.
17:18.29 vasc its usually not a new idea, but, the twist makes the difference.
17:18.39 Stragus And really, I would recommend using a graph rather than a tree
17:18.49 vasc for the geometry right?
17:18.52 Stragus A tree implies a stack, which implies per-thread stores and loads, and that's terrible
17:19.00 Stragus For the space partitioning, yes
17:19.50 vasc isn't that like a tree with ropes like they usually call it? well at least that's how Havran calls it.
17:20.19 Stragus No idea if there's a fancy term for it. But what I use is a graph, sectors connecting to other sectors, there's no tree
17:20.56 vasc then how's the partitioning done? does it still use split planes somehow?
17:21.03 Stragus The idea being to avoid all per-thread memory stores, until you have your intersection point and you need to do something with it
17:21.44 Stragus I use axis-aligned sectors, which can overlap in space, and each facet of a sector can either connect to another sector or a "node" with a splitting plane to decide where to go next
17:22.13 Stragus There are other ways, that's not ideal for the one-ray-against-many-triangles approach
17:22.25 Stragus But just avoid a per-thread stack, no matter how you do it
17:22.39 vasc i've read something similar in papers from the 1980s but i can't say i remember other people working on that. it seemed like a cool idea when i read it.
17:23.30 vasc back then a lot of people were trying to get ray tracing to scale in MPP architectures.
17:24.13 vasc like Transputers or that Intel machine... the one with i860s.
17:24.21 Stragus Tracking a stack in L1 on many-cores no-SIMD chips is no big deal
17:24.35 Stragus But the situation is different with GPUs and modern wide SIMD
17:24.45 vasc https://en.wikipedia.org/wiki/Intel_iPSC#iPSC.2F860
17:24.46 i5o [ Intel iPSC - Wikipedia ]
17:25.13 vasc the Intel Paragon.
17:25.45 vasc maan... they spent a lot of time trying to minimize inter thread communications and ops.
17:26.24 vasc well i dunno if those processors actually had caches.
17:26.26 Stragus Usually, for massively parallel software, you just write code to _avoid_ requiring low latency inter-thread/device communication
17:27.14 Stragus Like for computational fluid dynamics, it's better to compute the same cells redundantely while you receive results from other devices from two time steps ago
17:27.27 vasc ah right. the i860 had caches but early Transputers did not.
17:28.12 vasc ah CFD. i always wondered how they implemented the data structures for the multi-grid solvers with AMR in those.
17:29.00 vasc ever seen this?
17:29.01 vasc https://www.youtube.com/watch?v=vYA0f6R5KAI
17:29.03 i5o [ GPUs to Mars: Full-Scale Simulation of SpaceX's Mars Rocket Engine - YouTube ]
17:29.03 Stragus AMR?
17:29.13 vasc Adaptive Mesh Refinement
17:29.34 vasc https://en.wikipedia.org/wiki/Adaptive_mesh_refinement
17:29.34 i5o [ Adaptive mesh refinement - Wikipedia ]
17:32.29 vasc https://youtu.be/vYA0f6R5KAI?t=1510
17:32.31 i5o [ GPUs to Mars: Full-Scale Simulation of SpaceX's Mars Rocket Engine - YouTube ]
17:32.31 vasc animated
17:33.14 vasc is fascinated with grids.
17:38.53 *** join/#brlcad KimK_ (~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675)
18:33.13 vasc mdtwenty[m], the code changes you made to branches/opencl seem to be fine at a glance.
18:40.24 Notify 03BRL-CAD Wiki:Mariomeissner * 10157 /wiki/User:Mariomeissner/logs:
18:40.34 mdtwenty[m] I noticed that some code from my previous patches still had indentation issues, so I fixed that and used it to make my first commit, like a test commit :)
18:41.40 mdtwenty[m] the code to skip unevaluated partitions should be okay, I did some testing and it iterates only over the evaluated partitions, although that didn't make that much of a difference in performance
18:43.36 vasc i remembered now that it might also be possible to reuse the forward_pp in the partitions, so next_evalpp might not be necessary.
18:44.31 vasc in rt_boolfinal.
18:45.11 vasc you just need to cache the partitions[current_index].forw_pp in a variable, which could be named next_index, before writing over the value.
18:45.43 vasc and then in the loop you get the next index from that cached variable instead of from the partition
18:47.39 vasc it should only make a difference in scenes with a lot of depth complexity.
18:47.52 vasc maybe in golliath it makes a difference?
18:49.35 mdtwenty[m] hm it has almost the same, but to be fair I only tested with the view az:35 el:25
18:50.02 mdtwenty[m] front view could have a more noticeable difference
18:51.19 mdtwenty[m] and I think that it is possible to discard the use of next_evalpp now that you mention it
18:52.08 mdtwenty[m] and maybe we dont even need to cache the forw_pp, since I could only update the forw_pp of the last partition evaluated
18:52.13 vasc if we can reuse the foward_pp it would save memory in the partitions.
18:53.25 vasc well you need to save it before you override it, because otherwise when the for (;;) loop iterates the value won't be correct.
18:54.12 vasc uint next_index;
18:54.26 vasc for (uint current_index = head; current_index != UINT_MAX; current_index = next_index) {
18:54.35 vasc <PROTECTED>
18:54.37 vasc <PROTECTED>
18:55.15 vasc <PROTECTED>
18:55.20 vasc oh i see what you mean.
18:55.30 vasc yeah it should be no problem.
18:56.13 vasc no need for the variable even.
18:57.50 mdtwenty[m] yeah :) I will update it
19:04.06 vasc i was also looking at if it was possible to remove the back_pp index by caching the previous index in a variable wherever that is used.
19:04.28 vasc a lot of times algorithms with double-linked lists aren't actually necessary.
19:04.49 mdtwenty[m] IIRC back_pp was only used in 1 condition in boolweave
19:05.22 vasc its used in boolweave and rt_defoverlap
19:07.14 mdtwenty[m] oh yes, right
19:07.50 mdtwenty[m] but I think that could be cached in a variable
19:11.16 vasc i'm not so sure about that..
19:11.24 vasc the boolweave case sure.
19:11.30 vasc but i'm not certain about defoverlap.
19:12.00 vasc because it doesn't iterate the list using the pointers
19:12.03 vasc it uses the ids
19:12.38 vasc hm
19:12.43 vasc yes it might also be possible
19:13.13 vasc by passing it the index to the previous partition via default_multioverlap in boolfinal
19:14.08 vasc so that would shave off like 8 bytes in each partition with these two modifications.
19:14.42 vasc i also have an idea to shave off like half the memory used in the struct hits in the partitions so.
19:15.08 vasc over half actually
19:15.22 vasc so the struct partition should become less than half the size.
19:20.12 vasc it seems like soon enough we'll have a our first generation system.
19:21.03 vasc the only thing which might make things slower is the bitvectors. or not.
19:21.23 vasc it think it will depend on how sparse the bitvectors are.
19:21.51 vasc i suspect they'll be faster on average than if we used lists.
19:22.09 vasc it also makes memory management easier.
19:22.41 vasc oh right, nearly forgot
19:23.10 vasc we're gonna need to make an array with uints, one uint per solid, where the uint has the id of the first region.
19:23.23 vasc so i can shade the materials in single-hit mode properly.
19:23.40 vasc its only necessary in single-hit mode.
19:24.01 vasc you prolly noticed its shading everything white/gray in single-hit mode right now.
19:24.25 vasc and i can't remember anything else.
19:24.38 vasc but there's plenty of optimizations we can do afterwards.
19:24.50 mdtwenty[m] yeah I noticed it, when it wasn't white/gray was every solid it the same color
19:24.52 Notify 03BRL-CAD:n_reed * 70035 brlcad/trunk/src/tclscripts/checker/CMakeLists.txt: Add initial check command test script.Currently runs the check Tcl-command in mged (mged -a nu -c) to checkbasic invocation behavior, and also subtraction behavior on two simpleoverlap cases.No args, but finds check.tcl by CHECKER_DIR environment variable.Doesn't run check.sh. Test databases and overlap directories are writtenout by the
19:24.54 Notify test script itself.
19:24.55 vasc oh, right, i also need to fix the lights.
19:24.56 Notify ...
19:28.23 mdtwenty[m] did you ever tried to render the havoc scene with your workstation GPU? I was curious about what the difference in time would be
19:29.06 vasc it's bad...
19:29.14 vasc there's too many roundtrips.
19:29.32 vasc those can be fixed, but it requires a reduce algorithm.
19:29.42 vasc i think its better not to focus too much on that right now.
19:29.53 vasc we can just focus on the CPU performance.
19:30.11 vasc i can later give you a piece of code with reduce for you to try out the difference.
19:30.30 vasc but the license might not be compatible for wider distribution.
19:30.41 vasc let's see...
19:30.53 vasc yeah i got that here.
19:32.10 vasc ah we don't use the reduce, prolly just scan.
19:32.37 vasc the reduce is a sum, and the scan is prefix sums.
19:32.47 vasc https://en.wikipedia.org/wiki/Prefix_sum
19:32.48 i5o [ Prefix sum - Wikipedia ]
19:34.03 vasc so we can compute the offsets quickly on the GPU side without doing memory transfers.
19:36.00 vasc i.e. this code:
19:36.01 vasc <PROTECTED>
19:36.01 vasc <PROTECTED>
19:36.01 vasc <PROTECTED>
19:36.01 vasc BU_ASSERT((counts[i-1] % 2) == 0);
19:36.01 vasc h[i] = h[i-1] + counts[i-1]/2;/* number of segs is half the number of hits */
19:36.03 vasc if (counts[i-1]/2 > max_depth)
19:36.05 vasc <PROTECTED>
19:36.07 vasc <PROTECTED>
19:36.09 vasc in clt_Frame()
19:36.46 vasc oh... except it does two things it computes the max and it does the prefix sums.
19:36.58 vasc the max_depth is a sort of reduce i guess.
19:40.05 vasc well enough of that
19:40.42 vasc keep me notified of your progress on the ANSI C bool_eval() then mdtwenty[m]. i'll see if can work on the lights this week.
19:41.23 vasc <PROTECTED>
19:41.38 mdtwenty[m] Yeah I will update you as soon as I have some progress
19:41.51 vasc ok see you later!
19:42.08 mdtwenty[m] see you later!
19:44.09 *** part/#brlcad mdtwenty[m] (mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh)
19:59.45 *** part/#brlcad mdtwenty (bc519789@gateway/web/freenode/ip.188.81.151.137)
20:05.54 *** join/#brlcad mdtwenty[m] (mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh)
20:15.04 *** join/#brlcad yukonbob (~bch@S0106c8fb2656f971.ok.shawcable.net)
20:15.09 yukonbob hello #brlcad
22:10.05 *** join/#brlcad yukonbob (~bch@S0106c8fb2656f971.ok.shawcable.net)

Generated by irclog2html.pl Modified by Tim Riker to work with infobot.