IRC log for #brlcad on 20170623

00:19.53 *** join/#brlcad infobot (~infobot@rikers.org)
00:19.53 *** topic/#brlcad is GSoC students: if you have a question, ask and wait for an answer ... responses may take minutes or hours. Ask and WAIT. ;)
00:27.59 *** join/#brlcad efjrugungwcmohmu (~armin@dslb-092-075-157-134.092.075.pools.vodafone-ip.de)
03:41.06 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
04:40.52 Notify 03BRL-CAD Wiki:Ravilogaiya * 0 /wiki/User:Ravilogaiya:
05:31.46 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
05:32.36 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
05:33.21 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
05:34.12 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
05:35.02 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
05:35.46 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
06:04.33 *** join/#brlcad KimK (~Kim__@2600:8803:7a81:7400:69b5:1646:8ec0:c796)
06:34.41 *** join/#brlcad DaRock (~Thunderbi@mail.unitedinsong.com.au)
06:58.47 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
07:36.20 *** join/#brlcad Caterpillar (~caterpill@unaffiliated/caterpillar)
07:43.44 *** join/#brlcad merzo (~merzo@93.94.41.67)
07:49.32 Notify 03BRL-CAD:Amritpal singh * 10057 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */
07:53.11 Notify 03BRL-CAD:Amritpal singh * 10058 /wiki/User:Amritpal_singh/GSoC17/logs: /* Coding Period */
08:12.20 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
08:34.21 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
08:51.53 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
10:38.01 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
11:43.38 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
11:46.04 *** join/#brlcad d_rossberg (~rossberg@104.225.5.10)
12:15.57 *** join/#brlcad DaRock (~Thunderbi@mail.unitedinsong.com.au)
12:19.55 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
12:58.55 *** join/#brlcad kintel (~kintel@unaffiliated/kintel)
13:33.22 *** join/#brlcad gabbar1947 (uid205515@gateway/web/irccloud.com/x-aihythsembbzujjz)
13:48.51 d_rossberg gabbar1947: i run on a cmake error: inlude/rt/primitives/annot.h isn't there, and annot.h is twice inserted in the CMakeLists.txt
13:49.40 gabbar1947 I'll check, give me a second
13:51.40 gabbar1947 Rectified: I'm building on my system, just a moment
14:03.08 gabbar1947 Uploaded, this should pass
14:07.08 d_rossberg gabbar1947: what is your intention behind your changes to the root CMakeLists.txt
14:07.14 d_rossberg ?
14:08.31 *** join/#brlcad yorik (~yorik@2804:431:f720:80d8:290:f5ff:fedc:3bb2)
14:08.55 gabbar1947 Actually I did not make any changes to the file, Its somehow reflected in the patch.
14:09.01 gabbar1947 I'll check once
14:10.12 gabbar1947 I'm unaware of any such change made by me, I have no idea why this reflects in the patch
14:12.21 d_rossberg i'm recommending looking at a patch file with a text editor and to check if it looks reasonable
14:13.18 gabbar1947 I'll revert the change to the CmakeList.txt, Is there anything else that you want me to look into ?
14:13.36 d_rossberg in addition you have trailing spaces in your code
14:14.13 d_rossberg the "good" text editors provide a function to remove them all in one step
14:14.58 gabbar1947 I tried to remove as much as I could, anyways I'll go through the files once again. I'll use a text editor other than vim.
14:16.41 d_rossberg google says in vi it is :%s/\s\+$//e
14:18.22 d_rossberg see https://vi.stackexchange.com/questions/454/whats-the-simplest-way-to-strip-trailing-whitespace-from-all-lines-in-a-file/
14:18.22 gcibot [ What's the simplest way to strip trailing whitespace from all lines in a file? - Vi and Vim Stack Exchange ]
14:18.35 d_rossberg and its reference to vim.wikia
14:18.43 gabbar1947 works! thank you
14:19.45 gabbar1947 On it! just a moment
14:22.20 d_rossberg i wrote my documents with vi for many years, you can write great literature with it - if you are tough enough for it
14:29.50 gabbar1947 can you give it a try now, and let me know if there are more errors !
14:34.33 d_rossberg ok: typein.c has trailing spaces and a C++ comment (i.e. //)
14:35.41 d_rossberg maybe you should simply remove the line - and the number of segments in p_annot for the sake of simplicity (see seans mail)
14:35.52 gabbar1947 I'll remove it ! The C++ instinct !
14:38.01 gabbar1947 Actually I wanted the "l" command to display the annotation container details as well, that was the reason for its inclusion. anyways i'm removing it!
14:38.13 d_rossberg run the :%s/\s\+$//e on all files you've touched to make sure that no trailing space left
14:38.38 d_rossberg isn't typein.c the in command?
14:42.31 d_rossberg sorry, i had an old patch, in the actual one all trailing spaces seem to be gone
14:42.40 gabbar1947 :)
14:43.41 gabbar1947 typein.c is the "in" command! but for the describe() function for annotation displays the container params as well, so just wanted to see the details on the screen, that's it !
14:44.22 d_rossberg ok, it's on your decision
14:46.58 gabbar1947 I'm submitting the patch, once build completes !
14:55.45 gabbar1947 UPLOADED
14:59.34 d_rossberg however, i've to go now :(
14:59.41 d_rossberg i'll see ...
15:11.36 gabbar1947 :)
15:38.26 *** join/#brlcad merzo (~merzo@93.94.41.67)
15:49.55 *** join/#brlcad vasc (~vasc@bl14-42-31.dsl.telepac.pt)
15:50.01 vasc hey
15:50.04 vasc hello mdtwenty[m]
15:50.23 mdtwenty[m] Hi :)
15:51.02 vasc so... you said something about only supporting one partition?
15:52.03 mdtwenty[m] not only one partition.. one region
15:52.09 vasc right.
15:52.52 vasc but that isn't in the weaving part proper right?
15:52.58 vasc it's in the evaluation for rendering?
15:53.37 mdtwenty[m] yes it in the evaluation part
15:53.51 vasc well. you haven't implemented rt_boolfinal yet.
15:54.35 vasc is there anything left to do in the boolean weaving?
15:55.36 mdtwenty[m] i don't think so.. i think the boolean weaving is already fine
15:56.10 mdtwenty[m] i was looking now into the rt_boolfinal
15:58.02 mdtwenty[m] *i also uploaded today the weave patch without using the pointers in the cl_partition structure
15:58.38 vasc yes. i've seen that. i'll have to review it more in depth later but it seems fine on a cursory glance.
15:59.28 vasc there's still the question of linked lists vs arrays, but without more complex test scenes there's no good way to benchmark it.
16:02.15 vasc the most complex test scenes on the standard database files are probably goliath and havoc. but i'm not sure if they use much csg
16:02.35 vasc they probably don't.
16:03.07 vasc have you tried rendering those to see what happens?
16:03.15 vasc and the operators scene.
16:03.22 vasc oh right. rendering issues.
16:03.34 vasc well.
16:03.38 mdtwenty[m] the havoc and the goliath?
16:03.47 vasc yes. havoc.g and goliath.g i think
16:03.53 vasc in share/db or something.
16:04.05 mdtwenty[m] yes i see it
16:04.08 mdtwenty[m] one sec
16:04.38 vasc if it doesn't crash that would be good enough i guess.
16:04.43 vasc for now.
16:05.27 vasc hopefully it's not horrendously slow either.
16:06.23 mdtwenty[m] hm the goliath scene fails the assertion of 32 segments per ray
16:06.30 mdtwenty[m] so i think 32 is not enough
16:06.30 vasc hah.
16:06.50 vasc see what's the max depth.
16:07.41 vasc it's still prolly under 64,.
16:08.24 vasc i think i have some dynamic bitvector code in opencl or ANSI C in here somewhere you could use if it's bigger than that.
16:11.35 mdtwenty[m] is there a function to see what is the max depth or something?
16:12.22 vasc no. just in that for loop where you do the assert, keep track of the max segments per ray size, and then print it out when the loop ends.
16:13.18 vasc there might be something like that in one of the brlcad tools, but i'm not sure if it would work with the opencl backend.
16:20.00 vasc ok found it
16:20.01 vasc == host
16:20.02 vasc cl_uint ND = N/WORD_BITS + 1;
16:20.02 vasc mD = clCreateBuffer(gpuCtx, CL_MEM_READ_WRITE, sizeof(cl_uint) * ND, NULL, NULL);
16:20.02 vasc == device
16:20.02 vasc inline uint bindex(const uint b) {
16:20.03 vasc <PROTECTED>
16:20.05 vasc }
16:20.07 vasc inline uint bmask(const uint b) {
16:20.09 vasc <PROTECTED>
16:20.11 vasc }
16:20.13 vasc inline uint isset(__global uint *bitset, const uint b) {
16:20.15 vasc <PROTECTED>
16:20.17 vasc }
16:20.19 vasc inline uint clr(__global uint *bitset, const uint b) {
16:20.21 vasc <PROTECTED>
16:20.23 vasc }
16:20.25 vasc inline uint set(__global uint *bitset, const uint b) {
16:20.27 vasc <PROTECTED>
16:20.29 vasc }
16:20.31 vasc -
16:20.33 vasc this is my code, so i give you permission to it for any purpose.
16:20.49 vasc where WORD_BITS is 32 since 'D' is an array of cl_uints
16:21.33 vasc and N is the amount of bits you want the bitvector to have.
16:21.42 mdtwenty[m] i got maxdepth of 957571
16:21.48 vasc WHAT
16:21.52 vasc do the math properly dude.
16:21.58 vasc :-)
16:22.02 vasc that can't be truel.
16:22.13 vasc max per segment, not the sum of everything.
16:22.55 Stragus Still not a fan of allocating chunks out of big buffers through atomics, eh :)
16:23.27 vasc we might do that eventually. but for now there's a lot of gfx card memory we don't use.
16:24.01 Stragus All right. It goes up quickly when buffering all hits for millions of rays
16:24.52 vasc well sure. we could find the warp size and only allocate a buffer of that size or something.
16:24.57 vasc it's too much work :-)
16:25.27 vasc those microoptimizations can be done later.
16:25.59 vasc i kinda doubt we need to do it this way anyway.
16:26.12 vasc i suspect we could do the csg processing in an iterative fashion with a modified algorithm.
16:26.48 Stragus So you want to allocate the "max depth" for every ray... and how do you determine that max depth?
16:27.22 vasc we count the amount of segments per before allocating and actually storing the segments.
16:27.33 vasc we count the amount of segments per ray before allocating and actually storing the segments.
16:27.37 Stragus Ideally, you would process the segments as they come rather than buffering the whole thing. That complicates the code though
16:27.51 Stragus Ah yes, the two passes thing, count then trace
16:28.04 vasc yeah i suspect that could be done across the whole pipeline, but it requires rethinking the algorithm.
16:28.14 vasc it's probably non-trivial. but yeah it's worthwhile in the long run.
16:28.48 vasc i'm just kinda reticent about doing it first hand without understanding how the current algorithm works properly.
16:29.14 vasc not just mechanically but in terms of performance as well.
16:29.40 Stragus My raytracer used an inlined callback, the user could do whatever it wants with the hits. They can be processed on the fly (recommended) or buffered by a custom solution in the callback. And importantly, the inlined callback can terminate rays early
16:29.48 vasc especially considering the guys who originally wrote the code didn't do it, and they worked on it for decades.
16:30.10 vasc well, we kind of have something like that,
16:30.36 vasc there's no storing of segments in the single pass version of the renderer.
16:30.45 Stragus Cool. Is the callback truly inlined? You want to avoid any function call on GPU, especially function pointers
16:30.45 vasc but that doesn't do CSG.
16:31.09 vasc and it only returns the first hit, or an accumulation of the result of all the hits.
16:31.28 vasc you can't have function pointers on opencl.
16:31.40 vasc but yeah its some function.
16:31.45 Stragus Eheh. You can with CUDA, but it's a Very Bad Idea anyway
16:32.38 mdtwenty[m] i was doing it wrongly :D
16:32.58 vasc that version of the renderer is way faster than the current ANSI C one.
16:33.04 mdtwenty[m] i got 493 maxdepth for the goliath and 105 for the havoc
16:33.12 vasc but it doesn't do CSG so it isn't a proper comparison
16:33.16 vasc really?
16:33.21 vasc it's still way more than i expected.
16:33.32 Stragus vasc: I'm sure. My raytracer of triangles reached a billion rays per second... while a CPU core does 20M per second at most
16:33.44 Stragus (SSE optimized CPU code)
16:33.56 vasc mdtwenty[m], what's the amount of primitives in each scene?
16:34.33 vasc just for curiosity's sake.
16:35.06 vasc i think there's a 'list' command in mged or something
16:35.25 mdtwenty[m] goliath has 10499 primitives and havoc 2429
16:35.46 vasc pfew, it's still smaller at least.
16:35.58 vasc but...
16:36.09 Stragus There could be some ray hitting a bunch of aligned screws? :)
16:36.10 mdtwenty[m] well the good news is that the boolean weaving doenst crash
16:36.49 vasc well it might use an outrageous amount of memory.
16:37.10 vasc so perhaps Stragus will get his thing.
16:37.26 vasc :-)
16:37.48 vasc the havoc with 105 is ok i guess.
16:37.51 vasc but 493
16:38.10 vasc that's 16 double words
16:38.25 vasc i.e. cl_uint [16]
16:38.41 mdtwenty[m] yeah i got 493 while rendering with the front view
16:39.09 vasc can you compute the amount of memory that would take with that size of bitvector?
16:39.17 vasc the whole segments array
16:39.39 Stragus Why use the max depth for a ray though? Why not compute the sum of all rays, through a reduction kernel, if you are going to perform an identical trace right away?
16:40.08 vasc we have a bitvector we use inside each ray's segment list
16:40.15 Stragus (I still prefer dynamically allocated memory, but your way would work fine, except for the tracing-twice thing)
16:41.00 vasc well
16:41.02 vasc it's like
16:41.06 vasc we have a list of segments
16:41.11 vasc which gets computed into a list of partitions
16:41.44 vasc and then that gets evaluated
16:42.05 Stragus So each ray computes how much storage it requires, it stores that number, and you reduce all these numbers to a grand total?
16:42.13 vasc each segment only belongs in one object right? but the partitions can belong to more than one.
16:42.30 vasc like the ray pierce one and exits the other. but it's the same partition solid space.
16:42.52 vasc yeah its kinda like that.
16:43.22 vasc but that's only used to compute the amount of space we'll need.
16:43.30 vasc the actual algorithm isn't just a reduction.
16:44.17 Stragus Right... but what I'm saying is that it doesn't matter if the "max depth" is 40000 due to a bunch of aligned screws somewhere
16:44.25 Stragus You want the total for all rays
16:44.38 vasc mdtwenty[m], another thing we could do is dynamically allocate the bitvector, so rays with more segments would get larger bitvectors, but i wonder if that would complicate the code too much.
16:45.22 vasc hmm
16:47.59 vasc so it's max_partitions*sizeof(cl_partition_without the bitvector)+max_partitions*sizeof(cl_uint)*(493/32)
16:48.09 vasc how much is that in bytes?
16:48.17 vasc mdtwenty[m]
16:48.44 vasc and this max_partitions is the total amount of partitions for all the rays.
16:48.53 vasc sum
16:49.48 mdtwenty[m] one sec
17:03.16 vasc hmm
17:03.31 vasc perhaps this is not as a big of a deal as i thought
17:04.22 vasc +struct cl_partition {
17:04.22 vasc + struct cl_seg inseg;
17:04.22 vasc + struct cl_hit inhit;
17:04.22 vasc + struct cl_seg outseg;
17:04.22 vasc + struct cl_hit outhit;
17:04.23 vasc + cl_uint segs; /* 32-bit vector to represent the segments in the partition */
17:04.25 vasc +};
17:04.30 vasc but
17:04.35 vasc struct cl_hit {
17:04.35 vasc <PROTECTED>
17:04.35 vasc <PROTECTED>
17:04.35 vasc <PROTECTED>
17:04.37 vasc <PROTECTED>
17:04.39 vasc <PROTECTED>
17:04.41 vasc };
17:04.43 vasc and
17:04.46 vasc struct cl_seg {
17:04.47 vasc <PROTECTED>
17:04.49 vasc <PROTECTED>
17:04.51 vasc <PROTECTED>
17:04.53 vasc };
17:04.55 vasc so
17:04.57 vasc who cares.
17:05.03 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
17:05.35 vasc it's like just a cl_hit has 8+8+8+2+1 words
17:05.39 vasc double words
17:05.57 vasc i.e. 27
17:06.25 Stragus That cl_hit struct is kind of heavy, like 84 bytes
17:06.29 vasc and each partition has like 6 of those
17:07.10 vasc so using even 16 double words for the bitvector seems pathetic in comparison.
17:07.50 vasc still i'm kinda interested to know how much memory the whole thing uses right now.
17:08.29 vasc Stragus, it's much, much worse than that.
17:08.43 vasc coz cl_double3's are ACTUALLY cl_double4s.
17:09.15 vasc it's an opencl thing.
17:09.40 vasc and then there's struct alignment
17:09.49 vasc which reminds me
17:09.58 vasc mdtwenty[m], instead of this:
17:10.22 vasc +struct cl_partition {
17:10.22 vasc + struct cl_seg inseg;
17:10.22 vasc + struct cl_hit inhit;
17:10.22 vasc + struct cl_seg outseg;
17:10.22 vasc + struct cl_hit outhit;
17:10.23 vasc + cl_uint segs; /* 32-bit vector to represent the segments in the partition */
17:10.24 vasc +};
17:10.27 vasc try this:
17:10.38 vasc +struct cl_partition {
17:10.38 vasc + struct cl_seg inseg;
17:10.38 vasc + struct cl_seg outseg;
17:10.38 vasc + struct cl_hit inhit;
17:10.38 vasc + struct cl_hit outhit;
17:10.39 vasc + cl_uint segs; /* 32-bit vector to represent the segments in the partition */
17:10.41 vasc +};
17:10.46 vasc and see if it's sizeof() is smaller.
17:11.01 Stragus That shouldn't make a difference, both cl_hit and cl_seg have the same alignment
17:11.08 vasc i hope so.
17:11.55 Stragus If these cl_double3 waste memory, then perhaps it should be packed differently
17:11.58 vasc its the buildin cl_types that are an issue usually.
17:12.02 vasc builtin
17:12.45 Stragus Although frankly, this whole data storage scheme is very unfriendly to GPUs and SIMD
17:12.55 vasc hm
17:13.05 vasc only because it isn't z-orderered.
17:13.09 vasc oh i see.
17:13.34 vasc well
17:13.39 Stragus No! Because you will have scattered loads/stores all over
17:13.48 Stragus All memory transactions will be 8 times slower than necessary
17:14.10 vasc the thing is you probably don't need the whole thing across all stages of the algorithm
17:14.20 vasc so you could fraction this
17:14.33 vasc and increase memory locality.
17:14.39 Stragus For best performance, all threads of a warp/wavefront need to access consecutive memory addresses
17:14.44 *** join/#brlcad merzo (~merzo@194.140.108.146)
17:15.01 Stragus So you need some struct where struct foo { float x[32]; float y[32]; etc. };
17:15.02 vasc its like i said, you don't need to whole thing.
17:15.07 vasc even before that.
17:15.45 vasc yeah i know but if we minimize the size of the elements it ain't a big deal.
17:15.59 vasc the problem is the structs are too fat right now.
17:16.20 vasc still
17:16.35 vasc in comparison to the ANSI C code, it's incredibly memory coherent ya know?
17:16.40 Stragus Not a big deal? The stride between elements doesn't matter unless it's in the same cache lines
17:17.00 Stragus Well, these memory operations will be 8 times slower than if they were reorganized differently
17:17.14 Stragus That may or may not be a bottleneck, you'll decide that
17:17.31 vasc even with a cache?
17:17.34 vasc i'm not sure about that.
17:17.53 vasc i think the main issue is to have poor memory locality in accesses.
17:18.00 vasc rather than the access patterns themselves.
17:18.41 Stragus It's not about the cache, it's about memory transactions
17:18.52 vasc memory bank conflicts?
17:19.00 Stragus I am very sure about that, been doing CUDA for 8 years, and probably the biggest helper in #cuda...
17:19.16 Stragus Bank conflicts are for shared memory
17:19.58 vasc well the thing is
17:20.09 vasc if you're gonna need to access the rest of the struct in the same kernel
17:20.29 vasc it's all going to have to be loaded anyway.
17:20.30 Stragus Presumably, all threads of the same warp/wavefront will also access the rest of their structs, no?
17:21.15 vasc that's the thing i said, i think we don't need to store everything in that struct in all the stages of the algorithm. it's just that currently we're slavishly following the way the existing ANSI C code is structured.
17:21.37 vasc like
17:21.54 Stragus Okay! But it should still be designed so that consecutive threads access consecutive values in memory
17:22.02 Stragus You don't want a stride between threads
17:23.08 vasc i'll give you an example. i thought about doing that in the intersections code.
17:23.29 Stragus Consecutive addresses is _the_ solution that is fast on all GPUs from all vendors, for all generations. Beyond that, there are particularities if the accesses are shuffled, out of order, with gaps between chunks of 128 bytes, etc.
17:23.33 vasc well it turns out each kernel still has so many branches. a lot of threads will be idling and it's awfully low performance.
17:23.46 vasc the GPU isn't maxed out.
17:23.57 Stragus All right. But the threads that are active would still access a bunch of packed addresses
17:24.09 vasc no it's actually terrible.
17:24.33 vasc imagine one thread is doing a quadric like a sphere, and the other is doing a thorus intersection.
17:24.55 Stragus Indeed, paths should be merged as much as possible. Coherent rays can help a lot with that
17:25.14 vasc well i thought about that. that actually kind of happens as it is.
17:25.28 vasc since i'm using a thread block.
17:25.30 vasc but
17:25.36 Stragus If there are some common operations between spheres and thorus, like storing data (memory transactions), they should be merged together
17:25.46 vasc i think the best thing would be to reorder the intersection calculations.
17:25.47 Stragus As little code as possible should be specific to branches
17:25.57 Stragus That's possible, yes
17:26.21 vasc like group the ones that use the same kernel solver together.
17:27.18 vasc but the whole existing ANSI C code is more built to minimize the amount of operations than either memory consumption, or maximize memory coherency
17:27.18 Stragus It's possible to have warp-wide votes to decide how many threads need to perform operation X, before deciding to do it with a bunch of threads
17:27.38 vasc or minimize branches
17:27.38 Stragus But these aren't as critical issues as properly organizing memory. Reshuffling memory implies rewriting a lot of code, so it must be done early
17:27.59 Stragus Right, it's very different to optimize for scalar execution and for wide parallelism
17:28.09 vasc its not just that
17:28.16 vasc it's optimized for 1980s machines
17:28.20 Stragus Oh I see, yes
17:28.32 Stragus Memory was fast, ALUs were slow. And now it's the other way around
17:28.35 vasc yep+
17:31.01 vasc so mdtwenty[m] any luck with that?
17:33.19 mdtwenty[m] sent a long message: mdtwenty[m]_2017-06-23_17:33:18.txt <https://matrix.org/_matrix/media/v1/download/matrix.org/hVZtHbkIpDKOvbklzQAQwlpz>
17:33.39 vasc ok
17:33.47 vasc what about the memory size of the whole thing?
17:35.35 vasc <vasc> hmm
17:35.36 vasc <vasc> so it's max_partitions*sizeof(cl_partition_without the bitvector)+max_partitions*sizeof(cl_uint)*(493/32)
17:35.36 vasc <vasc> how much is that in bytes?
17:35.36 vasc <vasc> mdtwenty[m]
17:35.36 vasc <vasc> and this max_partitions is the total amount of partitions for all the rays.
17:35.38 vasc <vasc> sum
17:35.41 mdtwenty[m] i got 2337015024
17:35.55 mdtwenty[m] for the goliath that has 493 depth
17:36.35 vasc 2 GB?!
17:37.04 vasc ok, how much is max_partitions*sizeof(cl_partition_without the bitvector) alone?
17:37.39 mdtwenty[m] compiling
17:39.06 mdtwenty[m] 2 179 756 800
17:39.49 vasc also i wanna see the code for bool_Eval
17:39.59 vasc so the bitvectors aren't the real problem
17:40.16 vasc since they use a "mere" 200 MB or less.
17:40.31 vasc ok i think i got the idea
17:40.36 vasc +struct cl_partition {
17:40.36 vasc + struct cl_seg inseg;
17:40.36 vasc + struct cl_hit inhit;
17:40.36 vasc + struct cl_seg outseg;
17:40.37 vasc + struct cl_hit outhit;
17:40.37 vasc + cl_uint segs; /* 32-bit vector to represent the segments in the partition */
17:40.39 vasc +};
17:40.41 vasc +
17:40.53 vasc instead of storing copies of the cl_segs, why not use indexes instead?
17:41.51 mdtwenty[m] yes i think that it would work
17:43.40 vasc ((8*4*3+8+4)*2+4)*2 vs 4*2
17:43.45 vasc 440 vs 8
17:43.56 vasc that should shrink things down
17:47.10 vasc so you have the code for bool_eval so i can look at it? i kind of want to understand which data in the partitions will get accessed in rt_boolfinal and rendering.
17:49.30 mdtwenty[m] posted a file: ocl_bool_eval.patch (51KB) <https://matrix.org/_matrix/media/v1/download/matrix.org/SFtMrGVveJAkLiBFABODKZMf>
17:49.40 mdtwenty[m] this should be it
17:51.27 vasc ok
17:51.40 mdtwenty[m] i think that only the segments in the partition are relevant for boolean evaluation and shading
17:51.48 vasc rt_boolfinal seems to hammer a partition's inhit/outhit .hit_hist's over and over.
17:52.01 vasc .hit_dist
17:52.11 vasc and then it computes segment regions
17:54.50 vasc in a later optimization we may want to simplify the partition structures.
17:55.03 vasc struct cl_hit {
17:55.03 vasc <PROTECTED>
17:55.03 vasc <PROTECTED>
17:55.03 vasc <PROTECTED>
17:55.03 vasc <PROTECTED>
17:55.04 vasc <PROTECTED>
17:55.06 vasc };
17:55.15 vasc of all of this, it looks as if rt_boolfinal only accesses the hit_dist.
17:55.31 vasc i think the rest is only accessed in the final rendering stages.
17:56.14 vasc but for now, just not storing the whole cl_segs in the cl_partitions should be enough.
17:57.21 Stragus Eh, still keep in mind that repacking of hits in structs of arrays of 32 or 64
17:58.09 vasc well the current code is horrible in several ways.
17:58.25 vasc just the amount of memory copies going on is kind of insane.
17:58.34 vasc i'm surprised it's as fast as it is.
17:58.44 Stragus How "fast" is fast? :)
17:59.11 Stragus The reference CPU code is also abysmally slow, so it's not a great point of comparison
17:59.44 vasc mdtwenty[m], after you use indexes instead of storing the whole cl_segs, test goliath and havok again. tell me the time it takes and if it crashes or not.
18:00.00 vasc Stragus, well it's what we have.
18:00.26 vasc mdtwenty[m], once you get that working, it's time to work on rt_boolfinal
18:00.54 vasc mdtwenty[m], oh and tell me how much memory it uses to store the partitions now vs what it used before.
18:01.05 vasc with the indexes.
18:01.10 mdtwenty[m] and what about the bitvector for now?
18:01.25 vasc yes, use the dynamic bitvector too
18:01.35 vasc i paste the code above? did you get it?
18:01.37 vasc pasted
18:01.59 vasc <vasc> ok found it
18:01.59 vasc <vasc> == host
18:01.59 vasc <vasc> cl_uint ND = N/WORD_BITS + 1;
18:01.59 vasc <vasc> mD = clCreateBuffer(gpuCtx, CL_MEM_READ_WRITE, sizeof(cl_uint) * ND, NULL, NULL);
18:01.59 vasc <vasc> == device
18:02.00 vasc <vasc> inline uint bindex(const uint b) {
18:02.02 vasc <vasc> return (b >> 5);
18:02.04 vasc <vasc> }
18:02.06 vasc <vasc> inline uint bmask(const uint b) {
18:02.08 vasc <vasc> return (1 << (b & 31));
18:02.10 vasc <vasc> }
18:02.12 vasc <vasc> inline uint isset(__global uint *bitset, const uint b) {
18:02.14 vasc <vasc> return (bitset[bindex(b)] & bmask(b));
18:02.16 vasc <vasc> }
18:02.18 vasc <vasc> inline uint clr(__global uint *bitset, const uint b) {
18:02.20 vasc <vasc> return (bitset[bindex(b)] &= ~bmask(b));
18:02.22 vasc <vasc> }
18:02.24 vasc <vasc> inline uint set(__global uint *bitset, const uint b) {
18:02.26 vasc <vasc> return (bitset[bindex(b)] |= bmask(b));
18:02.28 vasc <vasc> }
18:02.30 vasc <vasc> -
18:02.32 vasc <vasc> this is my code, so i give you permission to it for any purpose.
18:02.34 vasc <vasc> where WORD_BITS is 32 since 'D' is an array of cl_uints
18:02.35 vasc <vasc> and N is the amount of bits you want the bitvector to have.
18:03.10 vasc basically you have a cl_uint array per bitvector
18:03.32 vasc and you an use the isset, clr, or set functions to twiddle the bits.
18:03.39 vasc or query them.
18:04.48 vasc for a start you can just use a cl_uint segs[16]; or whatever
18:04.59 vasc but eventually you want to dynamically determine the size of this
18:05.39 vasc the quick and dirty way to do it, is basically to pass it as a #define before the kernels are compiled.
18:06.07 vasc but don't do that.
18:06.28 vasc we'll probably need to optimize this some other way. but without more tests, it's hard to determine the appropriate solution,.
18:07.05 Stragus That bitvector is to determine entry/exit status through solids?
18:07.15 vasc no
18:07.29 vasc it states which segments are within a partition
18:07.43 vasc it's per ray
18:08.01 vasc we could do this some other way though
18:08.01 Stragus Okay. I guess I'm not familiar enough with the terminology used by the BRL-CAD raytracer
18:08.27 vasc if the bitvector is too sparse, we would probably be better off with using a list, like the current code already does.
18:09.11 Stragus Without knowing what the bitvector was, that was my thought
18:10.31 vasc well.
18:10.47 vasc i thought there would be less depth complexity in the average scene than ther actually is.
18:10.50 vasc my mistake.
18:11.35 vasc a typical game scene has like 3 depth complexity.
18:12.05 vasc in here we don't cull stuff.
18:12.28 vasc i thought 32 was enough. so much for that.
18:12.38 *** join/#brlcad teepee (~teepee@unaffiliated/teepee)
18:13.28 vasc i know there are hard limits in the amount of intersections per ray on triangle meshes in the current code for example.
18:15.59 vasc https://svn.code.sf.net/p/brlcad/code/brlcad/trunk/src/librt/primitives/bot/tie.c
18:16.06 vasc <PROTECTED>
18:17.32 vasc mdtwenty[m], if the bitvector is too slow on the goliath scene, we'll have to use lists again...
18:18.29 mdtwenty[m] ok, i will change the bool weave to use the indexes for segments intead of the segs and will test
18:18.35 vasc ok
18:22.20 vasc hm
18:23.00 vasc so this is what the ANSI C code does for bool_eval of solids.
18:23.02 vasc case OP_SOLID:
18:23.02 vasc <PROTECTED>
18:23.02 vasc register struct soltab *seek_stp = treep->tr_a.tu_stp;
18:23.02 vasc register struct seg **segpp;
18:23.02 vasc for (BU_PTBL_FOR(segpp, (struct seg **), &partp->pt_seglist)) {
18:23.03 vasc <PROTECTED>
18:23.05 vasc ret = 1;
18:23.07 vasc goto pop;
18:23.09 vasc <PROTECTED>
18:23.11 vasc }
18:23.13 vasc ret = 0;
18:23.15 vasc <PROTECTED>
18:23.17 vasc <PROTECTED>
18:27.39 vasc so
18:28.00 vasc you need to know if a partition has a solid in it
18:31.31 mdtwenty[m] a solid?
18:33.43 vasc a solid is basically a primitive object.
18:33.54 vasc like a sphere.
18:34.03 vasc in brlcad parlance.
18:35.08 vasc a solid object.
18:39.01 vasc i'll go jog for a while. should be back in 30-45 mins
18:39.56 mdtwenty[m] Ok :) i will also take a break to get dinner
19:04.28 *** join/#brlcad DaRock (~Thunderbi@mail.unitedinsong.com.au)
20:17.32 vasc .
20:38.16 mdtwenty[m] so i changed the struct partition and it is working fine.. i am working on the dynamic bit vector right now
20:39.35 vasc good. make sure to make a backup. :-)
20:40.12 vasc you'll have to initialize the data though
20:40.30 vasc prolly the easiest way is to bzero the memory before using it.
21:03.27 mdtwenty[m] ok i will have to leave the house for a bit and probably will only get back to this tomorrow morning
21:05.17 mdtwenty[m] i will notify you once its done
21:07.27 vasc ok
21:08.42 vasc see you later then!

Generated by irclog2html.pl Modified by Tim Riker to work with infobot.