00:19.53 |
*** join/#brlcad infobot
(~infobot@rikers.org) |
00:19.53 |
*** topic/#brlcad is GSoC
students: if you have a question, ask and wait for an answer ...
responses may take minutes or hours. Ask and WAIT.
;) |
00:27.59 |
*** join/#brlcad
efjrugungwcmohmu
(~armin@dslb-092-075-157-134.092.075.pools.vodafone-ip.de) |
03:41.06 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
04:40.52 |
Notify |
03BRL-CAD Wiki:Ravilogaiya * 0
/wiki/User:Ravilogaiya: |
05:31.46 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
05:32.36 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
05:33.21 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
05:34.12 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
05:35.02 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
05:35.46 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
06:04.33 |
*** join/#brlcad KimK
(~Kim__@2600:8803:7a81:7400:69b5:1646:8ec0:c796) |
06:34.41 |
*** join/#brlcad DaRock
(~Thunderbi@mail.unitedinsong.com.au) |
06:58.47 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
07:36.20 |
*** join/#brlcad Caterpillar
(~caterpill@unaffiliated/caterpillar) |
07:43.44 |
*** join/#brlcad merzo
(~merzo@93.94.41.67) |
07:49.32 |
Notify |
03BRL-CAD:Amritpal singh * 10057
/wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period
*/ |
07:53.11 |
Notify |
03BRL-CAD:Amritpal singh * 10058
/wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period
*/ |
08:12.20 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
08:34.21 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
08:51.53 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
10:38.01 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
11:43.38 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
11:46.04 |
*** join/#brlcad d_rossberg
(~rossberg@104.225.5.10) |
12:15.57 |
*** join/#brlcad DaRock
(~Thunderbi@mail.unitedinsong.com.au) |
12:19.55 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
12:58.55 |
*** join/#brlcad kintel
(~kintel@unaffiliated/kintel) |
13:33.22 |
*** join/#brlcad gabbar1947
(uid205515@gateway/web/irccloud.com/x-aihythsembbzujjz) |
13:48.51 |
d_rossberg |
gabbar1947: i run on a cmake error:
inlude/rt/primitives/annot.h isn't there, and annot.h is twice
inserted in the CMakeLists.txt |
13:49.40 |
gabbar1947 |
I'll check, give me a second |
13:51.40 |
gabbar1947 |
Rectified: I'm building on my system, just a
moment |
14:03.08 |
gabbar1947 |
Uploaded, this should pass |
14:07.08 |
d_rossberg |
gabbar1947: what is your intention behind your
changes to the root CMakeLists.txt |
14:07.14 |
d_rossberg |
? |
14:08.31 |
*** join/#brlcad yorik
(~yorik@2804:431:f720:80d8:290:f5ff:fedc:3bb2) |
14:08.55 |
gabbar1947 |
Actually I did not make any changes to the
file, Its somehow reflected in the patch. |
14:09.01 |
gabbar1947 |
I'll check once |
14:10.12 |
gabbar1947 |
I'm unaware of any such change made by me, I
have no idea why this reflects in the patch |
14:12.21 |
d_rossberg |
i'm recommending looking at a patch file with
a text editor and to check if it looks reasonable |
14:13.18 |
gabbar1947 |
I'll revert the change to the CmakeList.txt,
Is there anything else that you want me to look into ? |
14:13.36 |
d_rossberg |
in addition you have trailing spaces in your
code |
14:14.13 |
d_rossberg |
the "good" text editors provide a function to
remove them all in one step |
14:14.58 |
gabbar1947 |
I tried to remove as much as I could, anyways
I'll go through the files once again. I'll use a text editor other
than vim. |
14:16.41 |
d_rossberg |
google says in vi it is :%s/\s\+$//e |
14:18.22 |
d_rossberg |
see
https://vi.stackexchange.com/questions/454/whats-the-simplest-way-to-strip-trailing-whitespace-from-all-lines-in-a-file/ |
14:18.22 |
gcibot |
[ What's the simplest way to strip trailing
whitespace from all lines in a file? - Vi and Vim Stack Exchange
] |
14:18.35 |
d_rossberg |
and its reference to vim.wikia |
14:18.43 |
gabbar1947 |
works! thank you |
14:19.45 |
gabbar1947 |
On it! just a moment |
14:22.20 |
d_rossberg |
i wrote my documents with vi for many years,
you can write great literature with it - if you are tough enough
for it |
14:29.50 |
gabbar1947 |
can you give it a try now, and let me know if
there are more errors ! |
14:34.33 |
d_rossberg |
ok: typein.c has trailing spaces and a C++
comment (i.e. //) |
14:35.41 |
d_rossberg |
maybe you should simply remove the line - and
the number of segments in p_annot for the sake of simplicity (see
seans mail) |
14:35.52 |
gabbar1947 |
I'll remove it ! The C++ instinct ! |
14:38.01 |
gabbar1947 |
Actually I wanted the "l" command to display
the annotation container details as well, that was the reason for
its inclusion. anyways i'm removing it! |
14:38.13 |
d_rossberg |
run the :%s/\s\+$//e on all files you've
touched to make sure that no trailing space left |
14:38.38 |
d_rossberg |
isn't typein.c the in command? |
14:42.31 |
d_rossberg |
sorry, i had an old patch, in the actual one
all trailing spaces seem to be gone |
14:42.40 |
gabbar1947 |
:) |
14:43.41 |
gabbar1947 |
typein.c is the "in" command! but for the
describe() function for annotation displays the container params as
well, so just wanted to see the details on the screen, that's it
! |
14:44.22 |
d_rossberg |
ok, it's on your decision |
14:46.58 |
gabbar1947 |
I'm submitting the patch, once build completes
! |
14:55.45 |
gabbar1947 |
UPLOADED |
14:59.34 |
d_rossberg |
however, i've to go now :( |
14:59.41 |
d_rossberg |
i'll see ... |
15:11.36 |
gabbar1947 |
:) |
15:38.26 |
*** join/#brlcad merzo
(~merzo@93.94.41.67) |
15:49.55 |
*** join/#brlcad vasc
(~vasc@bl14-42-31.dsl.telepac.pt) |
15:50.01 |
vasc |
hey |
15:50.04 |
vasc |
hello mdtwenty[m] |
15:50.23 |
mdtwenty[m] |
Hi :) |
15:51.02 |
vasc |
so... you said something about only supporting
one partition? |
15:52.03 |
mdtwenty[m] |
not only one partition.. one region |
15:52.09 |
vasc |
right. |
15:52.52 |
vasc |
but that isn't in the weaving part proper
right? |
15:52.58 |
vasc |
it's in the evaluation for
rendering? |
15:53.37 |
mdtwenty[m] |
yes it in the evaluation part |
15:53.51 |
vasc |
well. you haven't implemented rt_boolfinal
yet. |
15:54.35 |
vasc |
is there anything left to do in the boolean
weaving? |
15:55.36 |
mdtwenty[m] |
i don't think so.. i think the boolean weaving
is already fine |
15:56.10 |
mdtwenty[m] |
i was looking now into the
rt_boolfinal |
15:58.02 |
mdtwenty[m] |
*i also uploaded today the weave patch without
using the pointers in the cl_partition structure |
15:58.38 |
vasc |
yes. i've seen that. i'll have to review it
more in depth later but it seems fine on a cursory
glance. |
15:59.28 |
vasc |
there's still the question of linked lists vs
arrays, but without more complex test scenes there's no good way to
benchmark it. |
16:02.15 |
vasc |
the most complex test scenes on the standard
database files are probably goliath and havoc. but i'm not sure if
they use much csg |
16:02.35 |
vasc |
they probably don't. |
16:03.07 |
vasc |
have you tried rendering those to see what
happens? |
16:03.15 |
vasc |
and the operators scene. |
16:03.22 |
vasc |
oh right. rendering issues. |
16:03.34 |
vasc |
well. |
16:03.38 |
mdtwenty[m] |
the havoc and the goliath? |
16:03.47 |
vasc |
yes. havoc.g and goliath.g i think |
16:03.53 |
vasc |
in share/db or something. |
16:04.05 |
mdtwenty[m] |
yes i see it |
16:04.08 |
mdtwenty[m] |
one sec |
16:04.38 |
vasc |
if it doesn't crash that would be good enough
i guess. |
16:04.43 |
vasc |
for now. |
16:05.27 |
vasc |
hopefully it's not horrendously slow
either. |
16:06.23 |
mdtwenty[m] |
hm the goliath scene fails the assertion of 32
segments per ray |
16:06.30 |
mdtwenty[m] |
so i think 32 is not enough |
16:06.30 |
vasc |
hah. |
16:06.50 |
vasc |
see what's the max depth. |
16:07.41 |
vasc |
it's still prolly under 64,. |
16:08.24 |
vasc |
i think i have some dynamic bitvector code in
opencl or ANSI C in here somewhere you could use if it's bigger
than that. |
16:11.35 |
mdtwenty[m] |
is there a function to see what is the max
depth or something? |
16:12.22 |
vasc |
no. just in that for loop where you do the
assert, keep track of the max segments per ray size, and then print
it out when the loop ends. |
16:13.18 |
vasc |
there might be something like that in one of
the brlcad tools, but i'm not sure if it would work with the opencl
backend. |
16:20.00 |
vasc |
ok found it |
16:20.01 |
vasc |
== host |
16:20.02 |
vasc |
cl_uint ND = N/WORD_BITS + 1; |
16:20.02 |
vasc |
mD = clCreateBuffer(gpuCtx, CL_MEM_READ_WRITE,
sizeof(cl_uint) * ND, NULL, NULL); |
16:20.02 |
vasc |
== device |
16:20.02 |
vasc |
inline uint bindex(const uint b) { |
16:20.03 |
vasc |
<PROTECTED> |
16:20.05 |
vasc |
} |
16:20.07 |
vasc |
inline uint bmask(const uint b) { |
16:20.09 |
vasc |
<PROTECTED> |
16:20.11 |
vasc |
} |
16:20.13 |
vasc |
inline uint isset(__global uint *bitset, const
uint b) { |
16:20.15 |
vasc |
<PROTECTED> |
16:20.17 |
vasc |
} |
16:20.19 |
vasc |
inline uint clr(__global uint *bitset, const
uint b) { |
16:20.21 |
vasc |
<PROTECTED> |
16:20.23 |
vasc |
} |
16:20.25 |
vasc |
inline uint set(__global uint *bitset, const
uint b) { |
16:20.27 |
vasc |
<PROTECTED> |
16:20.29 |
vasc |
} |
16:20.31 |
vasc |
- |
16:20.33 |
vasc |
this is my code, so i give you permission to
it for any purpose. |
16:20.49 |
vasc |
where WORD_BITS is 32 since 'D' is an array of
cl_uints |
16:21.33 |
vasc |
and N is the amount of bits you want the
bitvector to have. |
16:21.42 |
mdtwenty[m] |
i got maxdepth of 957571 |
16:21.48 |
vasc |
WHAT |
16:21.52 |
vasc |
do the math properly dude. |
16:21.58 |
vasc |
:-) |
16:22.02 |
vasc |
that can't be truel. |
16:22.13 |
vasc |
max per segment, not the sum of
everything. |
16:22.55 |
Stragus |
Still not a fan of allocating chunks out of
big buffers through atomics, eh :) |
16:23.27 |
vasc |
we might do that eventually. but for now
there's a lot of gfx card memory we don't use. |
16:24.01 |
Stragus |
All right. It goes up quickly when buffering
all hits for millions of rays |
16:24.52 |
vasc |
well sure. we could find the warp size and
only allocate a buffer of that size or something. |
16:24.57 |
vasc |
it's too much work :-) |
16:25.27 |
vasc |
those microoptimizations can be done
later. |
16:25.59 |
vasc |
i kinda doubt we need to do it this way
anyway. |
16:26.12 |
vasc |
i suspect we could do the csg processing in an
iterative fashion with a modified algorithm. |
16:26.48 |
Stragus |
So you want to allocate the "max depth" for
every ray... and how do you determine that max depth? |
16:27.22 |
vasc |
we count the amount of segments per before
allocating and actually storing the segments. |
16:27.33 |
vasc |
we count the amount of segments per ray before
allocating and actually storing the segments. |
16:27.37 |
Stragus |
Ideally, you would process the segments as
they come rather than buffering the whole thing. That complicates
the code though |
16:27.51 |
Stragus |
Ah yes, the two passes thing, count then
trace |
16:28.04 |
vasc |
yeah i suspect that could be done across the
whole pipeline, but it requires rethinking the algorithm. |
16:28.14 |
vasc |
it's probably non-trivial. but yeah it's
worthwhile in the long run. |
16:28.48 |
vasc |
i'm just kinda reticent about doing it first
hand without understanding how the current algorithm works
properly. |
16:29.14 |
vasc |
not just mechanically but in terms of
performance as well. |
16:29.40 |
Stragus |
My raytracer used an inlined callback, the
user could do whatever it wants with the hits. They can be
processed on the fly (recommended) or buffered by a custom solution
in the callback. And importantly, the inlined callback can
terminate rays early |
16:29.48 |
vasc |
especially considering the guys who originally
wrote the code didn't do it, and they worked on it for
decades. |
16:30.10 |
vasc |
well, we kind of have something like
that, |
16:30.36 |
vasc |
there's no storing of segments in the single
pass version of the renderer. |
16:30.45 |
Stragus |
Cool. Is the callback truly inlined? You want
to avoid any function call on GPU, especially function
pointers |
16:30.45 |
vasc |
but that doesn't do CSG. |
16:31.09 |
vasc |
and it only returns the first hit, or an
accumulation of the result of all the hits. |
16:31.28 |
vasc |
you can't have function pointers on
opencl. |
16:31.40 |
vasc |
but yeah its some function. |
16:31.45 |
Stragus |
Eheh. You can with CUDA, but it's a Very Bad
Idea anyway |
16:32.38 |
mdtwenty[m] |
i was doing it wrongly :D |
16:32.58 |
vasc |
that version of the renderer is way faster
than the current ANSI C one. |
16:33.04 |
mdtwenty[m] |
i got 493 maxdepth for the goliath and 105 for
the havoc |
16:33.12 |
vasc |
but it doesn't do CSG so it isn't a proper
comparison |
16:33.16 |
vasc |
really? |
16:33.21 |
vasc |
it's still way more than i expected. |
16:33.32 |
Stragus |
vasc: I'm sure. My raytracer of triangles
reached a billion rays per second... while a CPU core does 20M per
second at most |
16:33.44 |
Stragus |
(SSE optimized CPU code) |
16:33.56 |
vasc |
mdtwenty[m], what's the amount of primitives
in each scene? |
16:34.33 |
vasc |
just for curiosity's sake. |
16:35.06 |
vasc |
i think there's a 'list' command in mged or
something |
16:35.25 |
mdtwenty[m] |
goliath has 10499 primitives and havoc
2429 |
16:35.46 |
vasc |
pfew, it's still smaller at least. |
16:35.58 |
vasc |
but... |
16:36.09 |
Stragus |
There could be some ray hitting a bunch of
aligned screws? :) |
16:36.10 |
mdtwenty[m] |
well the good news is that the boolean weaving
doenst crash |
16:36.49 |
vasc |
well it might use an outrageous amount of
memory. |
16:37.10 |
vasc |
so perhaps Stragus will get his
thing. |
16:37.26 |
vasc |
:-) |
16:37.48 |
vasc |
the havoc with 105 is ok i guess. |
16:37.51 |
vasc |
but 493 |
16:38.10 |
vasc |
that's 16 double words |
16:38.25 |
vasc |
i.e. cl_uint [16] |
16:38.41 |
mdtwenty[m] |
yeah i got 493 while rendering with the front
view |
16:39.09 |
vasc |
can you compute the amount of memory that
would take with that size of bitvector? |
16:39.17 |
vasc |
the whole segments array |
16:39.39 |
Stragus |
Why use the max depth for a ray though? Why
not compute the sum of all rays, through a reduction kernel, if you
are going to perform an identical trace right away? |
16:40.08 |
vasc |
we have a bitvector we use inside each ray's
segment list |
16:40.15 |
Stragus |
(I still prefer dynamically allocated memory,
but your way would work fine, except for the tracing-twice
thing) |
16:41.00 |
vasc |
well |
16:41.02 |
vasc |
it's like |
16:41.06 |
vasc |
we have a list of segments |
16:41.11 |
vasc |
which gets computed into a list of
partitions |
16:41.44 |
vasc |
and then that gets evaluated |
16:42.05 |
Stragus |
So each ray computes how much storage it
requires, it stores that number, and you reduce all these numbers
to a grand total? |
16:42.13 |
vasc |
each segment only belongs in one object right?
but the partitions can belong to more than one. |
16:42.30 |
vasc |
like the ray pierce one and exits the other.
but it's the same partition solid space. |
16:42.52 |
vasc |
yeah its kinda like that. |
16:43.22 |
vasc |
but that's only used to compute the amount of
space we'll need. |
16:43.30 |
vasc |
the actual algorithm isn't just a
reduction. |
16:44.17 |
Stragus |
Right... but what I'm saying is that it
doesn't matter if the "max depth" is 40000 due to a bunch of
aligned screws somewhere |
16:44.25 |
Stragus |
You want the total for all rays |
16:44.38 |
vasc |
mdtwenty[m], another thing we could do is
dynamically allocate the bitvector, so rays with more segments
would get larger bitvectors, but i wonder if that would complicate
the code too much. |
16:45.22 |
vasc |
hmm |
16:47.59 |
vasc |
so it's
max_partitions*sizeof(cl_partition_without the
bitvector)+max_partitions*sizeof(cl_uint)*(493/32) |
16:48.09 |
vasc |
how much is that in bytes? |
16:48.17 |
vasc |
mdtwenty[m] |
16:48.44 |
vasc |
and this max_partitions is the total amount of
partitions for all the rays. |
16:48.53 |
vasc |
sum |
16:49.48 |
mdtwenty[m] |
one sec |
17:03.16 |
vasc |
hmm |
17:03.31 |
vasc |
perhaps this is not as a big of a deal as i
thought |
17:04.22 |
vasc |
+struct cl_partition { |
17:04.22 |
vasc |
+ struct cl_seg inseg; |
17:04.22 |
vasc |
+ struct cl_hit inhit; |
17:04.22 |
vasc |
+ struct cl_seg outseg; |
17:04.22 |
vasc |
+ struct cl_hit outhit; |
17:04.23 |
vasc |
+ cl_uint segs; /* 32-bit
vector to represent the segments in the partition */ |
17:04.25 |
vasc |
+}; |
17:04.30 |
vasc |
but |
17:04.35 |
vasc |
struct cl_hit { |
17:04.35 |
vasc |
<PROTECTED> |
17:04.35 |
vasc |
<PROTECTED> |
17:04.35 |
vasc |
<PROTECTED> |
17:04.37 |
vasc |
<PROTECTED> |
17:04.39 |
vasc |
<PROTECTED> |
17:04.41 |
vasc |
}; |
17:04.43 |
vasc |
and |
17:04.46 |
vasc |
struct cl_seg { |
17:04.47 |
vasc |
<PROTECTED> |
17:04.49 |
vasc |
<PROTECTED> |
17:04.51 |
vasc |
<PROTECTED> |
17:04.53 |
vasc |
}; |
17:04.55 |
vasc |
so |
17:04.57 |
vasc |
who cares. |
17:05.03 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
17:05.35 |
vasc |
it's like just a cl_hit has 8+8+8+2+1
words |
17:05.39 |
vasc |
double words |
17:05.57 |
vasc |
i.e. 27 |
17:06.25 |
Stragus |
That cl_hit struct is kind of heavy, like 84
bytes |
17:06.29 |
vasc |
and each partition has like 6 of
those |
17:07.10 |
vasc |
so using even 16 double words for the
bitvector seems pathetic in comparison. |
17:07.50 |
vasc |
still i'm kinda interested to know how much
memory the whole thing uses right now. |
17:08.29 |
vasc |
Stragus, it's much, much worse than
that. |
17:08.43 |
vasc |
coz cl_double3's are ACTUALLY
cl_double4s. |
17:09.15 |
vasc |
it's an opencl thing. |
17:09.40 |
vasc |
and then there's struct alignment |
17:09.49 |
vasc |
which reminds me |
17:09.58 |
vasc |
mdtwenty[m], instead of this: |
17:10.22 |
vasc |
+struct cl_partition { |
17:10.22 |
vasc |
+ struct cl_seg inseg; |
17:10.22 |
vasc |
+ struct cl_hit inhit; |
17:10.22 |
vasc |
+ struct cl_seg outseg; |
17:10.22 |
vasc |
+ struct cl_hit outhit; |
17:10.23 |
vasc |
+ cl_uint segs; /* 32-bit
vector to represent the segments in the partition */ |
17:10.24 |
vasc |
+}; |
17:10.27 |
vasc |
try this: |
17:10.38 |
vasc |
+struct cl_partition { |
17:10.38 |
vasc |
+ struct cl_seg inseg; |
17:10.38 |
vasc |
+ struct cl_seg outseg; |
17:10.38 |
vasc |
+ struct cl_hit inhit; |
17:10.38 |
vasc |
+ struct cl_hit outhit; |
17:10.39 |
vasc |
+ cl_uint segs; /* 32-bit
vector to represent the segments in the partition */ |
17:10.41 |
vasc |
+}; |
17:10.46 |
vasc |
and see if it's sizeof() is smaller. |
17:11.01 |
Stragus |
That shouldn't make a difference, both cl_hit
and cl_seg have the same alignment |
17:11.08 |
vasc |
i hope so. |
17:11.55 |
Stragus |
If these cl_double3 waste memory, then perhaps
it should be packed differently |
17:11.58 |
vasc |
its the buildin cl_types that are an issue
usually. |
17:12.02 |
vasc |
builtin |
17:12.45 |
Stragus |
Although frankly, this whole data storage
scheme is very unfriendly to GPUs and SIMD |
17:12.55 |
vasc |
hm |
17:13.05 |
vasc |
only because it isn't z-orderered. |
17:13.09 |
vasc |
oh i see. |
17:13.34 |
vasc |
well |
17:13.39 |
Stragus |
No! Because you will have scattered
loads/stores all over |
17:13.48 |
Stragus |
All memory transactions will be 8 times slower
than necessary |
17:14.10 |
vasc |
the thing is you probably don't need the whole
thing across all stages of the algorithm |
17:14.20 |
vasc |
so you could fraction this |
17:14.33 |
vasc |
and increase memory locality. |
17:14.39 |
Stragus |
For best performance, all threads of a
warp/wavefront need to access consecutive memory
addresses |
17:14.44 |
*** join/#brlcad merzo
(~merzo@194.140.108.146) |
17:15.01 |
Stragus |
So you need some struct where struct foo {
float x[32]; float y[32]; etc. }; |
17:15.02 |
vasc |
its like i said, you don't need to whole
thing. |
17:15.07 |
vasc |
even before that. |
17:15.45 |
vasc |
yeah i know but if we minimize the size of the
elements it ain't a big deal. |
17:15.59 |
vasc |
the problem is the structs are too fat right
now. |
17:16.20 |
vasc |
still |
17:16.35 |
vasc |
in comparison to the ANSI C code, it's
incredibly memory coherent ya know? |
17:16.40 |
Stragus |
Not a big deal? The stride between elements
doesn't matter unless it's in the same cache lines |
17:17.00 |
Stragus |
Well, these memory operations will be 8 times
slower than if they were reorganized differently |
17:17.14 |
Stragus |
That may or may not be a bottleneck, you'll
decide that |
17:17.31 |
vasc |
even with a cache? |
17:17.34 |
vasc |
i'm not sure about that. |
17:17.53 |
vasc |
i think the main issue is to have poor memory
locality in accesses. |
17:18.00 |
vasc |
rather than the access patterns
themselves. |
17:18.41 |
Stragus |
It's not about the cache, it's about memory
transactions |
17:18.52 |
vasc |
memory bank conflicts? |
17:19.00 |
Stragus |
I am very sure about that, been doing CUDA for
8 years, and probably the biggest helper in #cuda... |
17:19.16 |
Stragus |
Bank conflicts are for shared memory |
17:19.58 |
vasc |
well the thing is |
17:20.09 |
vasc |
if you're gonna need to access the rest of the
struct in the same kernel |
17:20.29 |
vasc |
it's all going to have to be loaded
anyway. |
17:20.30 |
Stragus |
Presumably, all threads of the same
warp/wavefront will also access the rest of their structs,
no? |
17:21.15 |
vasc |
that's the thing i said, i think we don't need
to store everything in that struct in all the stages of the
algorithm. it's just that currently we're slavishly following the
way the existing ANSI C code is structured. |
17:21.37 |
vasc |
like |
17:21.54 |
Stragus |
Okay! But it should still be designed so that
consecutive threads access consecutive values in memory |
17:22.02 |
Stragus |
You don't want a stride between
threads |
17:23.08 |
vasc |
i'll give you an example. i thought about
doing that in the intersections code. |
17:23.29 |
Stragus |
Consecutive addresses is _the_ solution that
is fast on all GPUs from all vendors, for all generations. Beyond
that, there are particularities if the accesses are shuffled, out
of order, with gaps between chunks of 128 bytes, etc. |
17:23.33 |
vasc |
well it turns out each kernel still has so
many branches. a lot of threads will be idling and it's awfully low
performance. |
17:23.46 |
vasc |
the GPU isn't maxed out. |
17:23.57 |
Stragus |
All right. But the threads that are active
would still access a bunch of packed addresses |
17:24.09 |
vasc |
no it's actually terrible. |
17:24.33 |
vasc |
imagine one thread is doing a quadric like a
sphere, and the other is doing a thorus intersection. |
17:24.55 |
Stragus |
Indeed, paths should be merged as much as
possible. Coherent rays can help a lot with that |
17:25.14 |
vasc |
well i thought about that. that actually kind
of happens as it is. |
17:25.28 |
vasc |
since i'm using a thread block. |
17:25.30 |
vasc |
but |
17:25.36 |
Stragus |
If there are some common operations between
spheres and thorus, like storing data (memory transactions), they
should be merged together |
17:25.46 |
vasc |
i think the best thing would be to reorder the
intersection calculations. |
17:25.47 |
Stragus |
As little code as possible should be specific
to branches |
17:25.57 |
Stragus |
That's possible, yes |
17:26.21 |
vasc |
like group the ones that use the same kernel
solver together. |
17:27.18 |
vasc |
but the whole existing ANSI C code is more
built to minimize the amount of operations than either memory
consumption, or maximize memory coherency |
17:27.18 |
Stragus |
It's possible to have warp-wide votes to
decide how many threads need to perform operation X, before
deciding to do it with a bunch of threads |
17:27.38 |
vasc |
or minimize branches |
17:27.38 |
Stragus |
But these aren't as critical issues as
properly organizing memory. Reshuffling memory implies rewriting a
lot of code, so it must be done early |
17:27.59 |
Stragus |
Right, it's very different to optimize for
scalar execution and for wide parallelism |
17:28.09 |
vasc |
its not just that |
17:28.16 |
vasc |
it's optimized for 1980s machines |
17:28.20 |
Stragus |
Oh I see, yes |
17:28.32 |
Stragus |
Memory was fast, ALUs were slow. And now it's
the other way around |
17:28.35 |
vasc |
yep+ |
17:31.01 |
vasc |
so mdtwenty[m] any luck with that? |
17:33.19 |
mdtwenty[m] |
sent a long message:
mdtwenty[m]_2017-06-23_17:33:18.txt <https://matrix.org/_matrix/media/v1/download/matrix.org/hVZtHbkIpDKOvbklzQAQwlpz> |
17:33.39 |
vasc |
ok |
17:33.47 |
vasc |
what about the memory size of the whole
thing? |
17:35.35 |
vasc |
<vasc> hmm |
17:35.36 |
vasc |
<vasc> so it's
max_partitions*sizeof(cl_partition_without the
bitvector)+max_partitions*sizeof(cl_uint)*(493/32) |
17:35.36 |
vasc |
<vasc> how much is that in
bytes? |
17:35.36 |
vasc |
<vasc> mdtwenty[m] |
17:35.36 |
vasc |
<vasc> and this max_partitions is the
total amount of partitions for all the rays. |
17:35.38 |
vasc |
<vasc> sum |
17:35.41 |
mdtwenty[m] |
i got 2337015024 |
17:35.55 |
mdtwenty[m] |
for the goliath that has 493 depth |
17:36.35 |
vasc |
2 GB?! |
17:37.04 |
vasc |
ok, how much is
max_partitions*sizeof(cl_partition_without the bitvector)
alone? |
17:37.39 |
mdtwenty[m] |
compiling |
17:39.06 |
mdtwenty[m] |
2 179 756 800 |
17:39.49 |
vasc |
also i wanna see the code for
bool_Eval |
17:39.59 |
vasc |
so the bitvectors aren't the real
problem |
17:40.16 |
vasc |
since they use a "mere" 200 MB or
less. |
17:40.31 |
vasc |
ok i think i got the idea |
17:40.36 |
vasc |
+struct cl_partition { |
17:40.36 |
vasc |
+ struct cl_seg inseg; |
17:40.36 |
vasc |
+ struct cl_hit inhit; |
17:40.36 |
vasc |
+ struct cl_seg outseg; |
17:40.37 |
vasc |
+ struct cl_hit outhit; |
17:40.37 |
vasc |
+ cl_uint segs; /* 32-bit
vector to represent the segments in the partition */ |
17:40.39 |
vasc |
+}; |
17:40.41 |
vasc |
+ |
17:40.53 |
vasc |
instead of storing copies of the cl_segs, why
not use indexes instead? |
17:41.51 |
mdtwenty[m] |
yes i think that it would work |
17:43.40 |
vasc |
((8*4*3+8+4)*2+4)*2 vs 4*2 |
17:43.45 |
vasc |
440 vs 8 |
17:43.56 |
vasc |
that should shrink things down |
17:47.10 |
vasc |
so you have the code for bool_eval so i can
look at it? i kind of want to understand which data in the
partitions will get accessed in rt_boolfinal and
rendering. |
17:49.30 |
mdtwenty[m] |
posted a file:
ocl_bool_eval.patch (51KB) <https://matrix.org/_matrix/media/v1/download/matrix.org/SFtMrGVveJAkLiBFABODKZMf> |
17:49.40 |
mdtwenty[m] |
this should be it |
17:51.27 |
vasc |
ok |
17:51.40 |
mdtwenty[m] |
i think that only the segments in the
partition are relevant for boolean evaluation and shading |
17:51.48 |
vasc |
rt_boolfinal seems to hammer a partition's
inhit/outhit .hit_hist's over and over. |
17:52.01 |
vasc |
.hit_dist |
17:52.11 |
vasc |
and then it computes segment regions |
17:54.50 |
vasc |
in a later optimization we may want to
simplify the partition structures. |
17:55.03 |
vasc |
struct cl_hit { |
17:55.03 |
vasc |
<PROTECTED> |
17:55.03 |
vasc |
<PROTECTED> |
17:55.03 |
vasc |
<PROTECTED> |
17:55.03 |
vasc |
<PROTECTED> |
17:55.04 |
vasc |
<PROTECTED> |
17:55.06 |
vasc |
}; |
17:55.15 |
vasc |
of all of this, it looks as if rt_boolfinal
only accesses the hit_dist. |
17:55.31 |
vasc |
i think the rest is only accessed in the final
rendering stages. |
17:56.14 |
vasc |
but for now, just not storing the whole
cl_segs in the cl_partitions should be enough. |
17:57.21 |
Stragus |
Eh, still keep in mind that repacking of hits
in structs of arrays of 32 or 64 |
17:58.09 |
vasc |
well the current code is horrible in several
ways. |
17:58.25 |
vasc |
just the amount of memory copies going on is
kind of insane. |
17:58.34 |
vasc |
i'm surprised it's as fast as it is. |
17:58.44 |
Stragus |
How "fast" is fast? :) |
17:59.11 |
Stragus |
The reference CPU code is also abysmally slow,
so it's not a great point of comparison |
17:59.44 |
vasc |
mdtwenty[m], after you use indexes instead of
storing the whole cl_segs, test goliath and havok again. tell me
the time it takes and if it crashes or not. |
18:00.00 |
vasc |
Stragus, well it's what we have. |
18:00.26 |
vasc |
mdtwenty[m], once you get that working, it's
time to work on rt_boolfinal |
18:00.54 |
vasc |
mdtwenty[m], oh and tell me how much memory it
uses to store the partitions now vs what it used before. |
18:01.05 |
vasc |
with the indexes. |
18:01.10 |
mdtwenty[m] |
and what about the bitvector for
now? |
18:01.25 |
vasc |
yes, use the dynamic bitvector too |
18:01.35 |
vasc |
i paste the code above? did you get
it? |
18:01.37 |
vasc |
pasted |
18:01.59 |
vasc |
<vasc> ok found it |
18:01.59 |
vasc |
<vasc> == host |
18:01.59 |
vasc |
<vasc> cl_uint ND = N/WORD_BITS +
1; |
18:01.59 |
vasc |
<vasc> mD = clCreateBuffer(gpuCtx,
CL_MEM_READ_WRITE, sizeof(cl_uint) * ND, NULL, NULL); |
18:01.59 |
vasc |
<vasc> == device |
18:02.00 |
vasc |
<vasc> inline uint bindex(const uint b)
{ |
18:02.02 |
vasc |
<vasc> return (b >>
5); |
18:02.04 |
vasc |
<vasc> } |
18:02.06 |
vasc |
<vasc> inline uint bmask(const uint b)
{ |
18:02.08 |
vasc |
<vasc> return (1 << (b &
31)); |
18:02.10 |
vasc |
<vasc> } |
18:02.12 |
vasc |
<vasc> inline uint isset(__global uint
*bitset, const uint b) { |
18:02.14 |
vasc |
<vasc> return (bitset[bindex(b)] &
bmask(b)); |
18:02.16 |
vasc |
<vasc> } |
18:02.18 |
vasc |
<vasc> inline uint clr(__global uint
*bitset, const uint b) { |
18:02.20 |
vasc |
<vasc> return (bitset[bindex(b)]
&= ~bmask(b)); |
18:02.22 |
vasc |
<vasc> } |
18:02.24 |
vasc |
<vasc> inline uint set(__global uint
*bitset, const uint b) { |
18:02.26 |
vasc |
<vasc> return (bitset[bindex(b)] |=
bmask(b)); |
18:02.28 |
vasc |
<vasc> } |
18:02.30 |
vasc |
<vasc> - |
18:02.32 |
vasc |
<vasc> this is my code, so i give you
permission to it for any purpose. |
18:02.34 |
vasc |
<vasc> where WORD_BITS is 32 since 'D'
is an array of cl_uints |
18:02.35 |
vasc |
<vasc> and N is the amount of bits you
want the bitvector to have. |
18:03.10 |
vasc |
basically you have a cl_uint array per
bitvector |
18:03.32 |
vasc |
and you an use the isset, clr, or set
functions to twiddle the bits. |
18:03.39 |
vasc |
or query them. |
18:04.48 |
vasc |
for a start you can just use a cl_uint
segs[16]; or whatever |
18:04.59 |
vasc |
but eventually you want to dynamically
determine the size of this |
18:05.39 |
vasc |
the quick and dirty way to do it, is basically
to pass it as a #define before the kernels are compiled. |
18:06.07 |
vasc |
but don't do that. |
18:06.28 |
vasc |
we'll probably need to optimize this some
other way. but without more tests, it's hard to determine the
appropriate solution,. |
18:07.05 |
Stragus |
That bitvector is to determine entry/exit
status through solids? |
18:07.15 |
vasc |
no |
18:07.29 |
vasc |
it states which segments are within a
partition |
18:07.43 |
vasc |
it's per ray |
18:08.01 |
vasc |
we could do this some other way
though |
18:08.01 |
Stragus |
Okay. I guess I'm not familiar enough with the
terminology used by the BRL-CAD raytracer |
18:08.27 |
vasc |
if the bitvector is too sparse, we would
probably be better off with using a list, like the current code
already does. |
18:09.11 |
Stragus |
Without knowing what the bitvector was, that
was my thought |
18:10.31 |
vasc |
well. |
18:10.47 |
vasc |
i thought there would be less depth complexity
in the average scene than ther actually is. |
18:10.50 |
vasc |
my mistake. |
18:11.35 |
vasc |
a typical game scene has like 3 depth
complexity. |
18:12.05 |
vasc |
in here we don't cull stuff. |
18:12.28 |
vasc |
i thought 32 was enough. so much for
that. |
18:12.38 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
18:13.28 |
vasc |
i know there are hard limits in the amount of
intersections per ray on triangle meshes in the current code for
example. |
18:15.59 |
vasc |
https://svn.code.sf.net/p/brlcad/code/brlcad/trunk/src/librt/primitives/bot/tie.c |
18:16.06 |
vasc |
<PROTECTED> |
18:17.32 |
vasc |
mdtwenty[m], if the bitvector is too slow on
the goliath scene, we'll have to use lists again... |
18:18.29 |
mdtwenty[m] |
ok, i will change the bool weave to use the
indexes for segments intead of the segs and will test |
18:18.35 |
vasc |
ok |
18:22.20 |
vasc |
hm |
18:23.00 |
vasc |
so this is what the ANSI C code does for
bool_eval of solids. |
18:23.02 |
vasc |
case OP_SOLID: |
18:23.02 |
vasc |
<PROTECTED> |
18:23.02 |
vasc |
register struct soltab *seek_stp =
treep->tr_a.tu_stp; |
18:23.02 |
vasc |
register struct seg **segpp; |
18:23.02 |
vasc |
for (BU_PTBL_FOR(segpp, (struct seg **),
&partp->pt_seglist)) { |
18:23.03 |
vasc |
<PROTECTED> |
18:23.05 |
vasc |
ret = 1; |
18:23.07 |
vasc |
goto pop; |
18:23.09 |
vasc |
<PROTECTED> |
18:23.11 |
vasc |
} |
18:23.13 |
vasc |
ret = 0; |
18:23.15 |
vasc |
<PROTECTED> |
18:23.17 |
vasc |
<PROTECTED> |
18:27.39 |
vasc |
so |
18:28.00 |
vasc |
you need to know if a partition has a solid in
it |
18:31.31 |
mdtwenty[m] |
a solid? |
18:33.43 |
vasc |
a solid is basically a primitive
object. |
18:33.54 |
vasc |
like a sphere. |
18:34.03 |
vasc |
in brlcad parlance. |
18:35.08 |
vasc |
a solid object. |
18:39.01 |
vasc |
i'll go jog for a while. should be back in
30-45 mins |
18:39.56 |
mdtwenty[m] |
Ok :) i will also take a break to get
dinner |
19:04.28 |
*** join/#brlcad DaRock
(~Thunderbi@mail.unitedinsong.com.au) |
20:17.32 |
vasc |
. |
20:38.16 |
mdtwenty[m] |
so i changed the struct partition and it is
working fine.. i am working on the dynamic bit vector right
now |
20:39.35 |
vasc |
good. make sure to make a backup.
:-) |
20:40.12 |
vasc |
you'll have to initialize the data
though |
20:40.30 |
vasc |
prolly the easiest way is to bzero the memory
before using it. |
21:03.27 |
mdtwenty[m] |
ok i will have to leave the house for a bit
and probably will only get back to this tomorrow morning |
21:05.17 |
mdtwenty[m] |
i will notify you once its done |
21:07.27 |
vasc |
ok |
21:08.42 |
vasc |
see you later then! |