00:17.03 |
Notify |
03BRL-CAD:mdtwenty * 70033
(brlcad/branches/opencl/include/rt/shoot.h
brlcad/branches/opencl/src/librt/primitives/bool.cl and 3 others):
skipping unevaluated partitions |
00:18.06 |
*** join/#brlcad https_GK1wmSU
(~deep-book@169.55.27.131) |
00:19.05 |
*** part/#brlcad https_GK1wmSU
(~deep-book@169.55.27.131) |
00:38.53 |
*** join/#brlcad
sqrosuofcvzwzywq
(~armin@dslb-094-216-164-255.094.216.pools.vodafone-ip.de) |
00:41.13 |
Notify |
03BRL-CAD:Marco-domingues * 10149
/wiki/User:Marco-domingues/GSoC17/Log: /* 31 July */ |
02:13.37 |
*** join/#brlcad yukonbob
(~bch@S0106b81619e90b96.ok.shawcable.net) |
02:13.40 |
yukonbob |
hello #brlcad |
02:43.36 |
*** join/#brlcad https_GK1wmSU
(~deep-book@77.234.42.183) |
02:44.37 |
*** part/#brlcad https_GK1wmSU
(~deep-book@77.234.42.183) |
03:34.33 |
*** join/#brlcad gabbar1947
(uid205515@gateway/web/irccloud.com/x-uthcarhicquzqbhn) |
05:55.07 |
*** join/#brlcad merzo
(~merzo@20-21-132-95.pool.ukrtel.net) |
06:45.42 |
*** join/#brlcad Caterpillar
(~caterpill@unaffiliated/caterpillar) |
07:14.18 |
*** join/#brlcad KimK
(~Kim__@2600:8803:7a81:7400:8dd6:e12e:6013:7834) |
07:15.00 |
*** join/#brlcad KimK
(~Kim__@2600:8803:7a81:7400:8dd6:e12e:6013:7834) |
07:18.05 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
07:27.36 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
07:53.52 |
*** join/#brlcad Stragus_
(~alexis@modemcable131.0-202-24.mc.videotron.ca) |
08:14.30 |
*** join/#brlcad merzo
(~merzo@user-94-45-58-139.skif.com.ua) |
08:25.19 |
*** join/#brlcad Caterpillar
(~caterpill@unaffiliated/caterpillar) |
08:34.02 |
*** join/#brlcad KimK
(~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) |
08:39.31 |
*** join/#brlcad KimK
(~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) |
08:42.54 |
*** join/#brlcad KimK
(~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) |
08:51.44 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
09:11.44 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
09:15.52 |
*** join/#brlcad KimK_
(~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) |
09:17.31 |
*** join/#brlcad KimK_
(~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) |
09:21.01 |
*** join/#brlcad KimK_
(~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) |
09:32.36 |
*** join/#brlcad gabbar1947
(uid205515@gateway/web/irccloud.com/x-ltbxxbrtphejfhrl) |
09:33.46 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
10:30.52 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
11:55.55 |
Notify |
03BRL-CAD:Amritpal singh * 10150
/wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period
*/ |
12:02.13 |
Notify |
03BRL-CAD:Amritpal singh * 0
/wiki/File:Bentrebar.png: |
12:07.20 |
Notify |
03BRL-CAD:Amritpal singh * 0
/wiki/File:PopimageUshapeRebar.png: |
12:08.05 |
Notify |
03BRL-CAD:Amritpal singh * 10153
/wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period
*/ |
12:09.25 |
Notify |
03BRL-CAD:Amritpal singh * 10154
/wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period
*/ |
12:20.59 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
13:22.06 |
*** join/#brlcad gabbar1947
(uid205515@gateway/web/irccloud.com/x-uzhthiaatgjhxdcu) |
13:35.38 |
Notify |
03BRL-CAD:mdtwenty * 70034
brlcad/branches/opencl/src/librt/primitives/rt.cl: Fix bug in
partitions shading. Shading closest partition |
13:39.29 |
*** join/#brlcad yorik
(~yorik@2804:431:f720:4929:290:f5ff:fedc:3bb2) |
13:39.45 |
*** join/#brlcad vasc
(~vasc@bl4-147-121.dsl.telepac.pt) |
13:41.09 |
Notify |
03BRL-CAD:Marco-domingues * 0
/wiki/File:Shade_closest_partition.png: |
13:48.37 |
Notify |
03BRL-CAD:Marco-domingues * 10156
/wiki/User:Marco-domingues/GSoC17/Log: /* 1 August */ |
14:16.26 |
*** join/#brlcad mdtwenty
(bc519789@gateway/web/freenode/ip.188.81.151.137) |
14:21.57 |
vasc |
hey mdtwenty |
14:22.07 |
mdtwenty |
hello! |
14:22.38 |
*** join/#brlcad mdtwenty[m]
(mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh) |
14:23.09 |
vasc |
once you put the changes to skip unevaluated
partitions and the bug fix for the closest partition on svn then
tell me so i can look at the code |
14:23.15 |
vasc |
about bool_eval()... |
14:23.35 |
vasc |
did you see patch i posted on the patches
tracker? |
14:23.51 |
vasc |
<PROTECTED> |
14:24.22 |
mdtwenty[m] |
I have commited the changes already in the
opencl branch |
14:24.58 |
vasc |
https://sourceforge.net/p/brlcad/patches/473/ |
14:24.58 |
i5o |
[ BRL-CAD / Patches / #473 Eliminate gotos
from bool_eval() ] |
14:25.34 |
vasc |
apply this patch to brlcad trunk/ and try
running your tests on that to see what's the performance difference
vs the old code. |
14:25.59 |
vasc |
if that's fast enough, then we can work on a
simplified prototype in ANSI C to port to OpenCL later. |
14:26.22 |
vasc |
with a linearized binary tree in an array,
instead of storing the expression tree in rpn format. |
14:26.32 |
vasc |
so the tree would be in infix
notation. |
14:27.08 |
vasc |
it shouldn't even use more memory if we do it
right. |
14:27.43 |
mdtwenty[m] |
hm okay I will apply the patch and test
it |
14:28.03 |
vasc |
ok. i'll try to advance the pseudocode for the
solution to speed this up. |
14:28.58 |
mdtwenty[m] |
yeah I think most of the bugs with the csg
code are fixed now |
14:31.19 |
mdtwenty[m] |
the images match the results from the ANSI C
code, although some scenes have missing geometry from the
primitives that are not supported yet (goliath.g is one example of
this, where some pipes are missing) |
14:31.33 |
vasc |
i'm gonna need to know the first material in a
solid so i can shade the primitives properly in the one-hit shader
though. |
14:31.45 |
vasc |
yeah pipe is used quite often. |
14:32.15 |
vasc |
but the code is kinda more complicated than
that of regular primitive. that's why i didn't ask you to work on
it. |
14:32.28 |
vasc |
pipe is a composite primitive. |
14:32.44 |
vasc |
its a list of segments of things. |
14:33.37 |
vasc |
it would distract from the purpose of getting
the csg working. |
14:33.51 |
vasc |
let's see... |
14:34.16 |
vasc |
i need some data structure. |
14:34.17 |
mdtwenty[m] |
hm yes I remember you saying that porting that
primitive would take some work |
14:35.21 |
vasc |
in case you're curious you can look at
src/librt/primitives/pipe/pipe_shot.c or something like
that. |
14:36.04 |
vasc |
https://svn.code.sf.net/p/brlcad/code/brlcad/trunk/src/librt/primitives/pipe/pipe.c |
14:36.53 |
vasc |
if you do a 'wc -l' to get a linecount on the
pipe.c file vs, say, ehy.c you would see that's it's a lot more
complex. |
14:37.10 |
vasc |
anyway... |
14:38.07 |
vasc |
implementing bool_eval() is easier than
that. |
14:38.15 |
vasc |
seriously. |
14:40.36 |
mdtwenty[m] |
hm yes lets first optimize the code and then I
can look into porting some more primitives into OpenCL :) |
14:42.58 |
vasc |
nah. |
14:43.05 |
vasc |
we want to optimize this to the max. |
14:43.12 |
vasc |
screw the primitives. |
14:43.22 |
vasc |
we only have one month. |
14:44.01 |
mdtwenty[m] |
ah yes, I meant eventually after the gsoc
period |
14:44.16 |
vasc |
maybe next year. :-) |
14:45.01 |
vasc |
and mind you, implementing pipe would be a
breeze compared to the other main ticket item left to do... the
brep intersector. |
14:45.44 |
vasc |
gah. |
14:45.49 |
vasc |
that's another gsoc or two. |
14:46.36 |
vasc |
if we get all that working, i think this will
have a more complete GPU intersector than ANYTHING ELSE in the
market right now. |
14:47.01 |
vasc |
if it happens. |
14:47.16 |
vasc |
anyway... |
14:51.27 |
vasc |
we'll use something like this bin tree
representation |
14:51.32 |
vasc |
https://github.com/mmp/pbrt-v2/blob/master/src/core/kdtree.h |
14:51.33 |
i5o |
[ pbrt-v2/kdtree.h at master · mmp/pbrt-v2 ·
GitHub ] |
14:52.13 |
vasc |
that's for a spatial kd-tree, but the data
structure is more or less the same a bintree. in fact ours is
simpler because we don't need to store as much data on the
nodes. |
14:59.31 |
mdtwenty[m] |
okay, IIRC the idea is to store the jumps in
the data structure, along with the boolean tree
information? |
15:00.14 |
vasc |
yes. |
15:00.23 |
vasc |
this is the data structure the code uses
now: |
15:00.37 |
vasc |
union tree { |
15:00.37 |
vasc |
<PROTECTED> |
15:00.37 |
vasc |
<PROTECTED> |
15:00.37 |
vasc |
<PROTECTED> |
15:00.37 |
vasc |
<PROTECTED> |
15:00.38 |
vasc |
<PROTECTED> |
15:00.40 |
vasc |
<PROTECTED> |
15:00.42 |
vasc |
<PROTECTED> |
15:00.44 |
vasc |
<PROTECTED> |
15:00.46 |
vasc |
<PROTECTED> |
15:00.48 |
vasc |
<PROTECTED> |
15:00.50 |
vasc |
<PROTECTED> |
15:00.52 |
vasc |
<PROTECTED> |
15:00.54 |
vasc |
<PROTECTED> |
15:00.56 |
vasc |
<PROTECTED> |
15:00.58 |
vasc |
<PROTECTED> |
15:01.00 |
vasc |
<PROTECTED> |
15:01.02 |
vasc |
<PROTECTED> |
15:01.04 |
vasc |
<PROTECTED> |
15:01.06 |
vasc |
<PROTECTED> |
15:01.08 |
vasc |
<PROTECTED> |
15:01.10 |
vasc |
<PROTECTED> |
15:01.12 |
vasc |
<PROTECTED> |
15:01.14 |
vasc |
<PROTECTED> |
15:01.16 |
vasc |
<PROTECTED> |
15:01.18 |
vasc |
<PROTECTED> |
15:01.21 |
vasc |
<PROTECTED> |
15:01.23 |
vasc |
<PROTECTED> |
15:01.25 |
vasc |
<PROTECTED> |
15:01.27 |
vasc |
<PROTECTED> |
15:01.29 |
vasc |
<PROTECTED> |
15:01.31 |
vasc |
<PROTECTED> |
15:01.33 |
vasc |
<PROTECTED> |
15:01.35 |
vasc |
<PROTECTED> |
15:01.37 |
vasc |
}; |
15:01.39 |
vasc |
and this is what we'll use:
(pseudo-code) |
15:01.41 |
vasc |
/ assume NOPs have been removed and all leafs
are SOLIDs. |
15:01.43 |
vasc |
/ pseudo-code. |
15:01.45 |
vasc |
struct tree { |
15:01.47 |
vasc |
<PROTECTED> |
15:01.49 |
vasc |
<PROTECTED> |
15:01.51 |
vasc |
<PROTECTED> |
15:01.53 |
vasc |
<PROTECTED> |
15:01.55 |
vasc |
<PROTECTED> |
15:01.57 |
vasc |
<PROTECTED> |
15:01.59 |
vasc |
<PROTECTED> |
15:02.01 |
vasc |
<PROTECTED> |
15:02.03 |
vasc |
<PROTECTED> |
15:02.05 |
vasc |
}; |
15:02.07 |
vasc |
ta-da. |
15:02.09 |
vasc |
a linearized array binary tree. |
15:02.24 |
vasc |
where nodes are either branches with binary
ops (has_lchild = 1) or leafs with solids (has_lchild =
0) |
15:03.10 |
vasc |
the current code is a horrible, horrible waste
of memory and has lotsa pointer chasing. |
15:03.19 |
vasc |
we're gonna fix that. |
15:04.06 |
Stragus |
Having the whole acceleration structure as a
big linear blob of memory is good, and you can use relative offsets
to jump around |
15:05.06 |
Stragus |
16 or 32 bits with a << 4 is pretty
good. 32 bits is easier, 16 bits involves a complicated
shuffling/ordering solution |
15:05.08 |
vasc |
right. when it's a branch, left child is
always at current_offset+1, and the right child is that
current_offset+tree.rchild. or was that tree.rchild. i don't
remember. |
15:05.24 |
Stragus |
Ah, or that |
15:06.09 |
vasc |
so each tree node should fit into
32-bits. |
15:06.18 |
vasc |
4 bytes. |
15:06.44 |
vasc |
instead of... 4*2 + 8*3 bytes. |
15:06.56 |
vasc |
32 |
15:07.12 |
vasc |
8x smaller. |
15:07.22 |
vasc |
and not as much pointer chasing. |
15:07.29 |
vasc |
so |
15:07.42 |
vasc |
that's the whole idea. |
15:08.09 |
vasc |
now the small detail of getting it to
work. |
15:08.29 |
vasc |
"small detail" |
15:08.52 |
Stragus |
*nods* My raytracers always used a linear blob
of memory for each graph. Another wonderful detail is that you can
save the blob to disk and use "as is", you can use on both CPU and
GPU directly, it just works |
15:09.24 |
vasc |
yes. we'll use a simple recursive function the
CPU to linearize this tree, and then we'll just copy that to the
GPU. |
15:10.33 |
Stragus |
Linearizing on CPU would be a good idea
too |
15:10.46 |
Stragus |
(and it makes things easier, same data
structure everywhere) |
15:18.59 |
vasc |
meh... the code for the kd-tree in pbrt is too
complicated. i think I also saw this in Wald's thesis
someplace. |
15:21.47 |
vasc |
I think I first read about this in Gordon
Stoll's SIGGRAPH presentation. |
15:22.48 |
vasc |
hmmm... |
15:23.09 |
vasc |
i probably has some of my own code someplace
with this. |
15:23.11 |
vasc |
have |
15:24.00 |
mdtwenty[m] |
I've just tested your patch (https://sourceforge.net/p/brlcad/patches/473/)
and it seems to match the current times with the version on the
trunk |
15:24.01 |
i5o |
[ A 404 Error has Occurred ] |
15:24.13 |
vasc |
ok, but what about the times? |
15:24.18 |
vasc |
oh |
15:24.21 |
vasc |
great. |
15:24.34 |
vasc |
so we got ourselves an algorithm without
gotos. |
15:24.36 |
mdtwenty[m] |
goliath is the same (0.33 vs 0.33) |
15:24.47 |
vasc |
which can't be used in OpenCL. |
15:25.21 |
vasc |
now we need an algorithm without gotos, wasted
memory, needless pointer chasing, and no dynamic memory. |
15:25.32 |
mdtwenty[m] |
havoc (0.62 new version vs 0.63 in the
trunk) |
15:25.33 |
vasc |
that takes care of the first
problem. |
15:25.35 |
mdtwenty[m] |
so no big difference |
15:25.47 |
vasc |
yeah, that could just be measuring
error. |
15:25.53 |
vasc |
should be anyway. |
15:26.33 |
vasc |
you wanna code the ANSI C prototype? |
15:27.09 |
mdtwenty[m] |
yeah I can give it a try! |
15:27.13 |
vasc |
we basically need to code a tree builder and a
tree traverser. |
15:31.06 |
vasc |
i was trying to find some of my old code so
you could use that as a base. |
15:33.26 |
vasc |
ah well. |
15:33.46 |
vasc |
dunno if it survived my HD crash 3 years
ago. |
15:37.24 |
vasc |
meh. finding mostly code snippets of bin trees
accessed with shifts. |
15:37.37 |
vasc |
but that means it has to be a balanced bin
tree, or it wastes a lot of memory so... |
15:38.52 |
vasc |
anyway |
15:41.12 |
vasc |
https://pastebin.com/R4SZpDMD |
15:41.13 |
i5o |
[ [C] #define OP_UNION 0
/**< @brief Binary: L union R */ #define OP_INTERS -
Pastebin.com ] |
15:41.15 |
vasc |
this should help |
15:41.42 |
vasc |
i don't want you to waste a lot of time
reinventing things that have already been invented. |
15:41.53 |
vasc |
so we can use the remaining time to actually
invent new stuff. |
15:41.57 |
vasc |
:-) |
15:43.49 |
vasc |
start by building the tree builder, that
converts a trunk/ union tree into that array of uints with the
layout i mentioned before, and a function to print it to examine
the output. |
15:44.38 |
vasc |
you can use the KdTree() constructor and
recursiveBuild() function I put in the pastebin from PBRT as a
reference. |
15:45.04 |
vasc |
but you'll need to use shifts and things like
that to store things in 32-bits. |
15:45.19 |
vasc |
because OpenCL does not have support for
bitfields in structs either. |
15:45.23 |
vasc |
gotta love OpenCL. |
15:46.08 |
Stragus |
Managing the bits yourself sometimes allows
producing better code than with bitfields anyway |
15:46.18 |
vasc |
https://stackoverflow.com/questions/9033947/why-are-bitfields-not-allowed-in-opencl |
15:46.19 |
i5o |
[ bit fields - Why are bitfields not allowed
in OpenCL? - Stack Overflow ] |
15:47.14 |
vasc |
ah so it's because of endianess?
bah. |
15:47.29 |
vasc |
then what about ints? those would need to be
converted too... |
15:47.36 |
vasc |
scratches his
head. |
15:48.01 |
vasc |
so it's just because of padding it
seems. |
15:49.18 |
vasc |
anyway the idea is the first bit says if its a
branch or a leaf. if it's a branch the next 3 bits define the
binary OP, and the remaining bits where right child is. |
15:49.37 |
vasc |
if it's a leaf, then the other 30 bits have
the solid id. |
15:49.39 |
vasc |
that's it. |
15:49.49 |
vasc |
31 bits actually |
15:50.15 |
vasc |
i hope 2G solids are enough. |
15:51.22 |
vasc |
and like 256 M bool nodes. |
15:51.37 |
vasc |
if it's not enough we can use 64
bits. |
15:53.49 |
vasc |
mdtwenty[m], thx for conducting the
tests. |
15:53.59 |
vasc |
i'll look at your code now. |
15:54.27 |
mdtwenty[m] |
no worries :) |
15:54.45 |
mdtwenty[m] |
i think I got the idea.. I will start working
on this and see what I can get |
15:55.40 |
vasc |
sure. if your ANSI C prototype code turns out
to be faster than the one in trunk/ then we can ask Sean to test it
and maybe we'll use it in the main ANSI C path as well. |
15:56.10 |
vasc |
it might be because the nodes are going to be
smaller and the memory accesses more coherent. |
15:56.33 |
vasc |
although i haven't looked at how the code
allocates the nodes yet. |
16:00.17 |
vasc |
you can use the RPN bool patch I made way back
when to see which functions you need to implement to completely
replace the existing boolean trees in trunk/. |
16:02.04 |
mdtwenty[m] |
okay thanks! I'll keep that in mind |
16:02.44 |
vasc |
but like i said, start with the tree builder
and the print function. |
16:02.56 |
vasc |
to debug. |
16:13.22 |
vasc |
duh... ok, i see. the existing code has memory
pool for tree nodes if the nodes get deleted and reallocated. but
fresh nodes are just allocated with posix_memalign() or
calloc()... |
16:13.33 |
vasc |
yeah that should make them memory coherent.
not. |
16:13.53 |
vasc |
well maybe if the malloc implementation is
smart. which it usually isn't. |
16:15.19 |
*** join/#brlcad Caterpillar2
(~caterpill@unaffiliated/caterpillar) |
16:15.44 |
vasc |
yes... the results of changing this code
should prove quite interesting. |
16:16.23 |
vasc |
notice how my code which always evaluates the
entire expression actually was either the same speed or faster in
the simpler expressions. |
16:17.00 |
mdtwenty[m] |
i hope so! |
16:17.55 |
mdtwenty[m] |
one thing that i noticed the other day was
that the max rpn tree lenght in the havoc scene was 583
IIRC |
16:19.16 |
vasc |
yeah that doesn't seem so good. |
16:19.31 |
vasc |
well you know there are several possible
strategies to optimize the tree evaluator. |
16:21.47 |
vasc |
say a solid is never intersected by any ray.
so why evaluate that for every partition or whatever? |
16:22.10 |
vasc |
it's possible to optimize the tree in runtime
or prior to render. |
16:22.49 |
Stragus |
A linear blob with small offsets rathre than
pointers, all packed in memory, will definitely be faster on the
CPU |
16:23.33 |
Stragus |
And you can use something like a 3D space
filling curve to sort data so it's spatially close, better filling
cache lines. I remember getting a few extra percents from
that |
16:25.13 |
vasc |
yeah none of the buffers are in z-order. that
likely has a negative impact. |
16:25.39 |
vasc |
i also need to reduce the amount of memory
used in between stages. |
16:26.48 |
Stragus |
If you ever want some pretty fast space
filling curve code: http://www.rayforce.net/ccspacecurve.c
http://www.rayforce.net/ccspacecurve.h |
16:27.01 |
vasc |
about using SoA vs AoS... |
16:28.07 |
vasc |
once the code is cleaned up we might try
that. |
16:28.29 |
vasc |
but right now i'm more concerned about thread
divergence and poor usage of capacity. |
16:28.47 |
Stragus |
Yup, that's really something for the final
touch |
16:28.57 |
Stragus |
(I gained 2-3% with that, woohoo) |
16:29.39 |
vasc |
i had like order of magnitude improvements
once because full GPU units weren't being used because of
divergence. |
16:30.00 |
vasc |
that one can really bite. |
16:30.42 |
Stragus |
Instead of testing 32 rays against a single
sector or primitive, have you considered one warp testing one ray
against 32 sectors or primitives? |
16:30.58 |
Stragus |
For highly incoherent rays, that seems like a
better idea than trying to make coherent non-diverging
bundles |
16:31.27 |
Stragus |
(replace the 32 with 4 or 8 on CPUs, 64 on
AMD, etc.) |
16:32.00 |
vasc |
well right now the code only does primaries.
well it's generic enough to be used with any rays, but those are
the only rays the generator generates. |
16:32.39 |
vasc |
i've had a lot of luck with sorting rays to
decrease divergence. |
16:33.19 |
Stragus |
Really?... Sorting rays either eat through
your global memory banwidth, or you can't sort enough in on-chip
shared memory |
16:33.37 |
Stragus |
For me, it was always faster to just trace
incoherently than to bother sorting them |
16:33.39 |
vasc |
in my own engine i actually fudged the RNG for
the ambient occlusion rays so that those are more coherent. but
don't tell anyone that... :-) |
16:33.55 |
Stragus |
But that could change if the acceleration
structure was designed to test one ray against 32 primitives,
instead of the other way around |
16:34.06 |
Stragus |
Eh, I did the same thing |
16:34.12 |
vasc |
ahah |
16:35.02 |
vasc |
well the sorting is useful if you have a fast
sort. |
16:35.09 |
vasc |
did you try using the CUB sort? |
16:35.20 |
Stragus |
The problem wasn't the fast sort, the problem
was the global memory banwdith |
16:35.28 |
vasc |
oh that? |
16:35.33 |
Stragus |
I could trace incoherent rays for the same
amount of bandwidth as sorting them |
16:35.34 |
vasc |
you use compress-sort-decompress. |
16:36.29 |
vasc |
have you read this paper? |
16:36.30 |
vasc |
http://www.keldysh.ru/pages/cgraph/articles/dep20/publ2010/GPU-RayTracing.pdf |
16:36.32 |
Stragus |
Assuming rays bouncing everywhere, how do you
compress the origin, vector and other contextual information about
that ray's purpose? |
16:36.51 |
vasc |
look at that paper. :-) |
16:37.29 |
Stragus |
The "compression" is for the sort,
sure |
16:37.52 |
Stragus |
But the write,read,write,read for bundling
rays takes a lot of bandwidth |
16:39.44 |
Stragus |
Considering the typical incoherent ray for me
had to touch about ~17 cache lines of 32 bytes... |
16:39.51 |
Stragus |
(if I remember the metric correctly) |
16:40.17 |
vasc |
hm |
16:40.21 |
vasc |
well you know |
16:40.37 |
vasc |
for my own render pipeline i only use that for
incoherent rays. |
16:41.19 |
vasc |
also i think it's not just a problem of memory
coherence but of scheduling as well |
16:41.38 |
vasc |
like, adjacent rays typically have similar
complexity. |
16:42.12 |
Stragus |
Right, less waste from threads having nothing
to do |
16:42.43 |
Stragus |
Well, like I said, if I had to trace a ton of
incoherent rays again, I would just use a different acceleration
structure |
16:42.52 |
Stragus |
Test a single ray against batches of 32
triangles instead |
16:46.21 |
vasc |
i once read a poster by Yoon. he basically ran
all the intersections against simplified geometry (like a convex
hull of the geometry) to get an estimate for the hit
points. |
16:46.34 |
vasc |
sorted those in z-order. and then used that to
schedule the rays. |
16:46.52 |
vasc |
it was an out-of-core renderer. so it made
even more sense there. |
16:47.24 |
vasc |
actually he used a simplified mesh, not the
convex hull now that i think of it. |
16:48.04 |
vasc |
the ray against triangles... well. |
16:48.14 |
vasc |
the thing is... triangles are "fatter" than
rays. |
16:49.08 |
Stragus |
But it's a lot easier to make coherent bundles
of 32 triangles than 32 rays ;) |
16:49.25 |
Stragus |
(assuming global illumination-like incoherent
tracing) |
16:49.45 |
vasc |
3 vertexes with 3 coordinates of 4 bytes vs 2
vertexes with 3 coordinates. |
16:49.53 |
vasc |
and that's not even assuming the rays have
same origin. |
16:50.32 |
Stragus |
Rays also need to track contextual
information, so you can accumulate back to whatever bounce it came
from |
16:50.55 |
vasc |
sure an id can take care of that. |
16:50.58 |
Stragus |
(assuming you short and shuffle
them) |
16:51.02 |
Stragus |
sort* |
16:51.57 |
Stragus |
I don't know, there's a lot of potential for
an acceleration structure designed to minimize the count of bundles
of 32 triangles being tested (in the global cost function, 1
triangle in a sector has the same cost as 32) |
16:53.08 |
vasc |
i think i mentioned this in some slides for a
class i gave sometime. |
16:53.43 |
vasc |
i was showing how you can reduce the amount of
math ops in the intersections when testing with same ray origin vs
same triangles and the like. |
16:54.08 |
vasc |
i remember that was also less nice for the
many triangle approach. |
16:54.29 |
vasc |
thing is, when the many rays don't have the
same origin though, then it changes. |
16:55.15 |
Stragus |
Right sure, highly coherent rays (including
with same origin) is a different problem with a different optimal
solution |
16:57.39 |
vasc |
https://drive.google.com/file/d/0B85Rkmt7rnCTWGVNUlhaZzZyQkU/view?usp=sharing |
16:57.40 |
i5o |
[ aula.pdf - Google Drive ] |
16:57.55 |
vasc |
page 6. |
16:58.02 |
vasc |
i gave that class like 5 years ago. |
16:58.59 |
vasc |
i remember i analyzed that as i thought it was
interesting. and ran some tests. then i gave it up. :-) |
16:59.31 |
vasc |
oh... what was that other thing... |
16:59.38 |
vasc |
right. |
16:59.53 |
vasc |
ray-triangle fan and ray-triangle strip
optimized intersection routines. |
16:59.59 |
vasc |
i was looking at those back then
too. |
17:00.34 |
vasc |
i only found like one decent paper for each
one. |
17:00.58 |
vasc |
it's another way to reduce the bandwidth
required for the triangles right? |
17:03.10 |
vasc |
so i looked at this one: http://gamma.cs.unc.edu/RS/paper_rt07.pdf |
17:03.31 |
vasc |
but i dunno it seemed kinda lame to me, that
you needed to build a BSP |
17:04.32 |
vasc |
actually an skd-tree but whatever. |
17:05.31 |
vasc |
it seemed lame to me. to use axis aligned
separators. |
17:05.44 |
vasc |
you already have the planes defined by the
triangles or whatever. |
17:06.00 |
vasc |
so i also ignored that idea. |
17:06.09 |
*** join/#brlcad ``Erik
(~erik@pool-100-16-14-17.bltmmd.fios.verizon.net) |
17:11.07 |
vasc |
there's all sorts of tricks you can
ues. |
17:11.09 |
vasc |
use. |
17:11.41 |
vasc |
like... you know how they say for primary rays
since the origin is always the same, only store the direction? well
i only stored the index used to generate the ray. |
17:11.58 |
vasc |
then i have generator functions for rays.
primaries, shadows, ambient occlusion the works. |
17:12.03 |
vasc |
so the engine was all specialized. |
17:13.37 |
vasc |
ah whatever. i'm not gonna pick that up
again. |
17:13.50 |
vasc |
i've decided if i ever rewrite it, i'm gonna
do everything differently. |
17:14.43 |
vasc |
so Stragus, are many-triangle intersections in
fashion again? |
17:17.43 |
Stragus |
I have no idea what's in fashion. I almost
never read papers, I just have fun ;) |
17:18.11 |
vasc |
every once in a while something comes out that
makes a difference. that's why i read papers. |
17:18.29 |
vasc |
its usually not a new idea, but, the twist
makes the difference. |
17:18.39 |
Stragus |
And really, I would recommend using a graph
rather than a tree |
17:18.49 |
vasc |
for the geometry right? |
17:18.52 |
Stragus |
A tree implies a stack, which implies
per-thread stores and loads, and that's terrible |
17:19.00 |
Stragus |
For the space partitioning, yes |
17:19.50 |
vasc |
isn't that like a tree with ropes like they
usually call it? well at least that's how Havran calls
it. |
17:20.19 |
Stragus |
No idea if there's a fancy term for it. But
what I use is a graph, sectors connecting to other sectors, there's
no tree |
17:20.56 |
vasc |
then how's the partitioning done? does it
still use split planes somehow? |
17:21.03 |
Stragus |
The idea being to avoid all per-thread memory
stores, until you have your intersection point and you need to do
something with it |
17:21.44 |
Stragus |
I use axis-aligned sectors, which can overlap
in space, and each facet of a sector can either connect to another
sector or a "node" with a splitting plane to decide where to go
next |
17:22.13 |
Stragus |
There are other ways, that's not ideal for the
one-ray-against-many-triangles approach |
17:22.25 |
Stragus |
But just avoid a per-thread stack, no matter
how you do it |
17:22.39 |
vasc |
i've read something similar in papers from the
1980s but i can't say i remember other people working on that. it
seemed like a cool idea when i read it. |
17:23.30 |
vasc |
back then a lot of people were trying to get
ray tracing to scale in MPP architectures. |
17:24.13 |
vasc |
like Transputers or that Intel machine... the
one with i860s. |
17:24.21 |
Stragus |
Tracking a stack in L1 on many-cores no-SIMD
chips is no big deal |
17:24.35 |
Stragus |
But the situation is different with GPUs and
modern wide SIMD |
17:24.45 |
vasc |
https://en.wikipedia.org/wiki/Intel_iPSC#iPSC.2F860 |
17:24.46 |
i5o |
[ Intel iPSC - Wikipedia ] |
17:25.13 |
vasc |
the Intel Paragon. |
17:25.45 |
vasc |
maan... they spent a lot of time trying to
minimize inter thread communications and ops. |
17:26.24 |
vasc |
well i dunno if those processors actually had
caches. |
17:26.26 |
Stragus |
Usually, for massively parallel software, you
just write code to _avoid_ requiring low latency
inter-thread/device communication |
17:27.14 |
Stragus |
Like for computational fluid dynamics, it's
better to compute the same cells redundantely while you receive
results from other devices from two time steps ago |
17:27.27 |
vasc |
ah right. the i860 had caches but early
Transputers did not. |
17:28.12 |
vasc |
ah CFD. i always wondered how they implemented
the data structures for the multi-grid solvers with AMR in
those. |
17:29.00 |
vasc |
ever seen this? |
17:29.01 |
vasc |
https://www.youtube.com/watch?v=vYA0f6R5KAI |
17:29.03 |
i5o |
[ GPUs to Mars: Full-Scale Simulation of
SpaceX's Mars Rocket Engine - YouTube ] |
17:29.03 |
Stragus |
AMR? |
17:29.13 |
vasc |
Adaptive Mesh Refinement |
17:29.34 |
vasc |
https://en.wikipedia.org/wiki/Adaptive_mesh_refinement |
17:29.34 |
i5o |
[ Adaptive mesh refinement - Wikipedia
] |
17:32.29 |
vasc |
https://youtu.be/vYA0f6R5KAI?t=1510 |
17:32.31 |
i5o |
[ GPUs to Mars: Full-Scale Simulation of
SpaceX's Mars Rocket Engine - YouTube ] |
17:32.31 |
vasc |
animated |
17:33.14 |
vasc |
is fascinated with
grids. |
17:38.53 |
*** join/#brlcad KimK_
(~Kim__@2600:8803:7a81:7400:c482:d704:f9cf:4675) |
18:33.13 |
vasc |
mdtwenty[m], the code changes you made to
branches/opencl seem to be fine at a glance. |
18:40.24 |
Notify |
03BRL-CAD Wiki:Mariomeissner * 10157
/wiki/User:Mariomeissner/logs: |
18:40.34 |
mdtwenty[m] |
I noticed that some code from my previous
patches still had indentation issues, so I fixed that and used it
to make my first commit, like a test commit :) |
18:41.40 |
mdtwenty[m] |
the code to skip unevaluated partitions should
be okay, I did some testing and it iterates only over the evaluated
partitions, although that didn't make that much of a difference in
performance |
18:43.36 |
vasc |
i remembered now that it might also be
possible to reuse the forward_pp in the partitions, so next_evalpp
might not be necessary. |
18:44.31 |
vasc |
in rt_boolfinal. |
18:45.11 |
vasc |
you just need to cache the
partitions[current_index].forw_pp in a variable, which could be
named next_index, before writing over the value. |
18:45.43 |
vasc |
and then in the loop you get the next index
from that cached variable instead of from the partition |
18:47.39 |
vasc |
it should only make a difference in scenes
with a lot of depth complexity. |
18:47.52 |
vasc |
maybe in golliath it makes a
difference? |
18:49.35 |
mdtwenty[m] |
hm it has almost the same, but to be fair I
only tested with the view az:35 el:25 |
18:50.02 |
mdtwenty[m] |
front view could have a more noticeable
difference |
18:51.19 |
mdtwenty[m] |
and I think that it is possible to discard the
use of next_evalpp now that you mention it |
18:52.08 |
mdtwenty[m] |
and maybe we dont even need to cache the
forw_pp, since I could only update the forw_pp of the last
partition evaluated |
18:52.13 |
vasc |
if we can reuse the foward_pp it would save
memory in the partitions. |
18:53.25 |
vasc |
well you need to save it before you override
it, because otherwise when the for (;;) loop iterates the value
won't be correct. |
18:54.12 |
vasc |
uint next_index; |
18:54.26 |
vasc |
for (uint current_index = head; current_index
!= UINT_MAX; current_index = next_index) { |
18:54.35 |
vasc |
<PROTECTED> |
18:54.37 |
vasc |
<PROTECTED> |
18:55.15 |
vasc |
<PROTECTED> |
18:55.20 |
vasc |
oh i see what you mean. |
18:55.30 |
vasc |
yeah it should be no problem. |
18:56.13 |
vasc |
no need for the variable even. |
18:57.50 |
mdtwenty[m] |
yeah :) I will update it |
19:04.06 |
vasc |
i was also looking at if it was possible to
remove the back_pp index by caching the previous index in a
variable wherever that is used. |
19:04.28 |
vasc |
a lot of times algorithms with double-linked
lists aren't actually necessary. |
19:04.49 |
mdtwenty[m] |
IIRC back_pp was only used in 1 condition in
boolweave |
19:05.22 |
vasc |
its used in boolweave and
rt_defoverlap |
19:07.14 |
mdtwenty[m] |
oh yes, right |
19:07.50 |
mdtwenty[m] |
but I think that could be cached in a
variable |
19:11.16 |
vasc |
i'm not so sure about that.. |
19:11.24 |
vasc |
the boolweave case sure. |
19:11.30 |
vasc |
but i'm not certain about
defoverlap. |
19:12.00 |
vasc |
because it doesn't iterate the list using the
pointers |
19:12.03 |
vasc |
it uses the ids |
19:12.38 |
vasc |
hm |
19:12.43 |
vasc |
yes it might also be possible |
19:13.13 |
vasc |
by passing it the index to the previous
partition via default_multioverlap in boolfinal |
19:14.08 |
vasc |
so that would shave off like 8 bytes in each
partition with these two modifications. |
19:14.42 |
vasc |
i also have an idea to shave off like half the
memory used in the struct hits in the partitions so. |
19:15.08 |
vasc |
over half actually |
19:15.22 |
vasc |
so the struct partition should become less
than half the size. |
19:20.12 |
vasc |
it seems like soon enough we'll have a our
first generation system. |
19:21.03 |
vasc |
the only thing which might make things slower
is the bitvectors. or not. |
19:21.23 |
vasc |
it think it will depend on how sparse the
bitvectors are. |
19:21.51 |
vasc |
i suspect they'll be faster on average than if
we used lists. |
19:22.09 |
vasc |
it also makes memory management
easier. |
19:22.41 |
vasc |
oh right, nearly forgot |
19:23.10 |
vasc |
we're gonna need to make an array with uints,
one uint per solid, where the uint has the id of the first
region. |
19:23.23 |
vasc |
so i can shade the materials in single-hit
mode properly. |
19:23.40 |
vasc |
its only necessary in single-hit
mode. |
19:24.01 |
vasc |
you prolly noticed its shading everything
white/gray in single-hit mode right now. |
19:24.25 |
vasc |
and i can't remember anything else. |
19:24.38 |
vasc |
but there's plenty of optimizations we can do
afterwards. |
19:24.50 |
mdtwenty[m] |
yeah I noticed it, when it wasn't white/gray
was every solid it the same color |
19:24.52 |
Notify |
03BRL-CAD:n_reed * 70035
brlcad/trunk/src/tclscripts/checker/CMakeLists.txt: Add initial
check command test script.Currently runs the check Tcl-command in
mged (mged -a nu -c) to checkbasic invocation behavior, and also
subtraction behavior on two simpleoverlap cases.No args, but finds
check.tcl by CHECKER_DIR environment variable.Doesn't run check.sh.
Test databases and overlap directories are writtenout by
the |
19:24.54 |
Notify |
test script itself. |
19:24.55 |
vasc |
oh, right, i also need to fix the
lights. |
19:24.56 |
Notify |
... |
19:28.23 |
mdtwenty[m] |
did you ever tried to render the havoc scene
with your workstation GPU? I was curious about what the difference
in time would be |
19:29.06 |
vasc |
it's bad... |
19:29.14 |
vasc |
there's too many roundtrips. |
19:29.32 |
vasc |
those can be fixed, but it requires a reduce
algorithm. |
19:29.42 |
vasc |
i think its better not to focus too much on
that right now. |
19:29.53 |
vasc |
we can just focus on the CPU
performance. |
19:30.11 |
vasc |
i can later give you a piece of code with
reduce for you to try out the difference. |
19:30.30 |
vasc |
but the license might not be compatible for
wider distribution. |
19:30.41 |
vasc |
let's see... |
19:30.53 |
vasc |
yeah i got that here. |
19:32.10 |
vasc |
ah we don't use the reduce, prolly just
scan. |
19:32.37 |
vasc |
the reduce is a sum, and the scan is prefix
sums. |
19:32.47 |
vasc |
https://en.wikipedia.org/wiki/Prefix_sum |
19:32.48 |
i5o |
[ Prefix sum - Wikipedia ] |
19:34.03 |
vasc |
so we can compute the offsets quickly on the
GPU side without doing memory transfers. |
19:36.00 |
vasc |
i.e. this code: |
19:36.01 |
vasc |
<PROTECTED> |
19:36.01 |
vasc |
<PROTECTED> |
19:36.01 |
vasc |
<PROTECTED> |
19:36.01 |
vasc |
BU_ASSERT((counts[i-1] % 2) == 0); |
19:36.01 |
vasc |
h[i] = h[i-1] + counts[i-1]/2;/* number of
segs is half the number of hits */ |
19:36.03 |
vasc |
if (counts[i-1]/2 > max_depth) |
19:36.05 |
vasc |
<PROTECTED> |
19:36.07 |
vasc |
<PROTECTED> |
19:36.09 |
vasc |
in clt_Frame() |
19:36.46 |
vasc |
oh... except it does two things it computes
the max and it does the prefix sums. |
19:36.58 |
vasc |
the max_depth is a sort of reduce i
guess. |
19:40.05 |
vasc |
well enough of that |
19:40.42 |
vasc |
keep me notified of your progress on the ANSI
C bool_eval() then mdtwenty[m]. i'll see if can work on the lights
this week. |
19:41.23 |
vasc |
<PROTECTED> |
19:41.38 |
mdtwenty[m] |
Yeah I will update you as soon as I have some
progress |
19:41.51 |
vasc |
ok see you later! |
19:42.08 |
mdtwenty[m] |
see you later! |
19:44.09 |
*** part/#brlcad mdtwenty[m]
(mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh) |
19:59.45 |
*** part/#brlcad mdtwenty
(bc519789@gateway/web/freenode/ip.188.81.151.137) |
20:05.54 |
*** join/#brlcad mdtwenty[m]
(mdtwentyma@gateway/shell/matrix.org/x-zzjodrxmmmvwtbhh) |
20:15.04 |
*** join/#brlcad yukonbob
(~bch@S0106c8fb2656f971.ok.shawcable.net) |
20:15.09 |
yukonbob |
hello #brlcad |
22:10.05 |
*** join/#brlcad yukonbob
(~bch@S0106c8fb2656f971.ok.shawcable.net) |