Stream: brlcad

Topic: performance


view this post on Zulip Sean (Jan 30 2018 at 02:46):

@Cezar the overarching goal is to speed up ray tracing performance which has lots of potential sub-projects including conversion to opencl, optimization of specific routines, and developing a new interactive GUI based on ray tracing

view this post on Zulip Sean (Jan 30 2018 at 02:46):

there's also performance of libged commands, which completely span the gamut, and performance of our lowest utility containers in libbu (which will likely involve profiling and replacement with C++ containers)

specific performance hot spots (priorities) include A) the raytrace pipeline needing to dispatch sets of rays (64x64 postage stamps that subdivide into packets of 8 rays), B) boolean evaluation conversion to spmd/simd, C) optimization/elimination of libbu pointer tables (bu_pbtl_* calls) and bit vectors, and D) elimination of libbu container aliasing (primarily elimination of struct bu_list iteration)

view this post on Zulip Sean (Feb 12 2018 at 03:33):

@Cezar there's definitely faster ways to search for a pointer (or we could try and eliminate the need to search for a pointer in the first place) -- even a std::ordered_hash may be faster. what was meant about the bu_list elimination in D was a more efficient implementation, a different way that doesn't involve pointer aliasing. for B, not so much parallelism (though that would be a fine project for some specific commands) as commands that quickly fall apart (e.g., a database with 1M items). postage stamps are simply rendering subsections of an image, e.g., in 64x64 pixel tiles, instead of pixel or line at a time.

view this post on Zulip Rahul Saxena (Feb 13 2018 at 13:06):

Hi,I am a gsoc aspirant interested in CAD design.Please guide me how to contribute to this project.

view this post on Zulip Sean (Feb 13 2018 at 14:08):

@Rahul Saxena welcome, your best guide is yourself, demonstrating independent productivity and interest ;)

view this post on Zulip Cezar (Feb 13 2018 at 15:01):

regarding B, i was wondering what you meant by conversion to SIMD/SPMD, if not parallelism. also, i'm wondering why it's called boolean evaluation, to me it looks more like set theory, at least the operations. i've also tried to find the code doing the evaluation, it seems to be in libnmg/nmg_bool/bool.c:nmg_bool, and i think the expression tree is constructed in libged/comb.c. do i get this right?

view this post on Zulip Cezar (Feb 13 2018 at 15:03):

also, for large dbs, is evaluation too slow because of an inefficient algorithm, or does it just not happen?

view this post on Zulip Cezar (Feb 13 2018 at 15:04):

i think i could try creating a large db to test that, unless one already exists

view this post on Zulip Sean (Feb 14 2018 at 05:38):

@Cezar conversion to simd/spmd is using vector units to do some things 4 at a time for the price of 1

view this post on Zulip Sean (Feb 14 2018 at 05:40):

boolean evaluation is not libnmg (that's a specific subset of polygonal geometry) -- boolean evaluation is in librt/bool.c (more specifically rt_boolweave() and rt_boolfinal()).

view this post on Zulip Sean (Feb 14 2018 at 05:46):

for large db's, it's not that evaluation is slow -- it's that some algorithms are O(N^2), that is they get exponentially slower as there are more items. so for example if some command (e.g., facetize) takes 20 seconds with 20 booleans, how long will it take with 40 booleans? you'd hope it'd take 40 seconds or better, but some algorithms don't work that way and 20 seconds might become 400 seconds. so a good project would be identifying which commands slow down exponentially and then working to fix them.

view this post on Zulip Naseef (Feb 20 2018 at 20:11):

(deleted)

view this post on Zulip Cezar (Mar 22 2018 at 10:59):

i've run rt -B -s512 -H127 -J0 -o h.p havoc.g havoc under a profiler, it took 6.12 min with most of that spent in bool_eval (1.76 min, and 1/5 of that iterating through a bu_list) and bu_ptbl_cat_uniq (17.68 sec)

view this post on Zulip Cezar (Mar 22 2018 at 10:59):

i was wondering if there is anything else i should benchmark?

view this post on Zulip Sean (Mar 22 2018 at 14:15):

@Cezar you found three priority performance areas right there (bu_pbtl, bu_list, and boolean evaluation) -- perhaps some of the earlier comments will make more sense to you now. just addressing all three of those is a lot to tackle for gsoc, but definitely all priority. things to think about when optimizing should generally go top-down like can some of the boolean evals be eliminated, for those that can't can they be made more efficient like eliminating bu_list traversals, for those that can't maybe the container can be changed (e.g., from a list to an array), or maybe changing it so we don't need to store pointers, and finally changing the way pointers are being stored (e.g., using a c++ hash or map container)

view this post on Zulip Cezar (Mar 22 2018 at 17:27):

is it fine using c++? HACKING says that it's "no prohibited", but also that the projects aims to be C89-compliant?

view this post on Zulip Cezar (Mar 22 2018 at 17:28):

also, i was looking at set data structures (for bool eval), and union-find looks like a good choice. has it been considered and discarded before?

view this post on Zulip Cezar (Mar 22 2018 at 17:31):

i think i've asked about this before, but "bool eval" seems like a misnomer, but maybe i'm missing something? it looks more like sets to me than boolean anything

view this post on Zulip Sean (Mar 22 2018 at 18:33):

C++ is fine except for in public API of C libraries

view this post on Zulip Sean (Mar 22 2018 at 18:33):

It's looking like we'll be doing our last c89-compliant release here soon (in the next couple weeks hopefully) after which we will be requiring c++11, so that can be part of a gsoc plan

view this post on Zulip Sean (Mar 22 2018 at 18:35):

as for union-find, the results of boolean evaluation are not necessarily disjoint sets

view this post on Zulip Sean (Mar 22 2018 at 18:37):

i think i've asked about this before, but "bool eval" seems like a misnomer, but maybe i'm missing something? it looks more like sets to me than boolean anything

CSG combines elements of set theory and boolean algebra, read https://en.wikipedia.org/wiki/Constructive_solid_geometry

view this post on Zulip Cezar (Mar 23 2018 at 12:14):

thanks for the answers. i have another question :D previously, you mentioned pointer aliasing in bu_list, but looking through src/libbu/list.c, i can't find any function where pointer aliasing is involved. i was wondering where i could find an example of PA

view this post on Zulip Sean (Mar 24 2018 at 02:34):

this is a somewhat advanced topic. you should read up more on what exactly pointer aliasing is. it's not something you'd find in a list.c function -- you find it in all the places the struct is used

view this post on Zulip Cezar (Mar 24 2018 at 08:29):

i did read up on it, but i misunderstood and thought that it could only occur with function arguments (i see why that's not true now)

view this post on Zulip Cezar (May 09 2018 at 08:42):

i created two plots of calls to bool_eval during the raytracing of havoc.g. the first one has the number of segments in the partition given as argument (which is iterated when looking for seek_stp) and the second one is the number of times the OP_SOLID case is taken in the switch

view this post on Zulip Cezar (May 09 2018 at 08:44):

it seems that iterating through the list of segments isn't that expensive since there are usually 2 or 4 segments in the list, but the search happens a lot of times so it adds up

view this post on Zulip Cezar (May 09 2018 at 08:46):

it also tells me that a replacement data structure should have low constant factor, else it will likely perform worse despite having better asymptotic complexity (N is very small in this case)

view this post on Zulip Cezar (May 09 2018 at 08:49):

it might also be that havoc isn't really representative of cases where the list of segments is huge :-? i used it because it's the largest example geometry, but if other examples are better, i'll look at those as well

view this post on Zulip Sean (May 09 2018 at 12:08):

i created two plots of calls to bool_eval during the raytracing of havoc.g. the first one has the number of segments in the partition given as argument (which is iterated when looking for seek_stp) and the second one is the number of times the OP_SOLID case is taken in the switch

Very cool graphs! @Cezar if you keep these images somewhere, you could probably publish a formal report later into the project. Glad to help you submit it somewhere if that's something interesting to you.

Another insight from usually having 2 or 4 segments in the list is that there's probably a structure that could be devised that treats them in pairs that would do better (or be more vectorizable potentially). It's also significant that there are "relatively few" in each partition, though we do have to be wary that the number of segments could be unlimited in the worst case.

view this post on Zulip Sean (May 09 2018 at 12:12):

havoc is fairly representative of a traditionally constructed csg model. other common types are imported geometries which will typically be mostly BoT meshes or NURBS, often with just a few (expensive) csg operations sprinkled throughout (e.g., a handful of subtractions to fix overlaps).

view this post on Zulip Sean (May 09 2018 at 12:52):

With the list of elements being as low as they are, one of the things I've thought would probably help is setting up a sorting network for the < N case

view this post on Zulip Cezar (May 09 2018 at 17:59):

didn't think about vectorization. i haven't written any vectorized code before, but i'll write some this week, since it seems to pop up rather often around this project

view this post on Zulip Cezar (May 09 2018 at 18:00):

regarding "setting up a sorting network for the < N case", i suppose you mean determining a certain number N, and if there are less than N segments, set up a sorting network?

view this post on Zulip Sean (May 09 2018 at 18:06):

you shouldn't need to worry about vectorization too much directly as there is some compiler support (for automatic vectorization) and several good abstractions available. we're more interested in OpenCL and TBB right now

view this post on Zulip Sean (May 09 2018 at 18:07):

regarding "setting up a sorting network for the < N case", i suppose you mean determining a certain number N, and if there are less than N segments, set up a sorting network?

yes, using a sorting network when the number of items to be sorted is small, falling back to something more general when it's larger

view this post on Zulip Cezar (May 09 2018 at 18:10):

and having a sorting network in place, i would be able to find seek_stp in O(1) using the comparators?

view this post on Zulip Cezar (May 09 2018 at 18:13):

actually i think at least O(log N)

view this post on Zulip Cezar (May 09 2018 at 18:14):

and i suppose it would not work for (very) large N because of the space complexity?

view this post on Zulip Sean (May 09 2018 at 18:23):

you could definitely get to O(logN) just by using a std::unordered_map for ptbl

view this post on Zulip Sean (May 09 2018 at 18:24):

I'm not sure what exactly the bool_eval() logic is doing for OP_SOLID, would have to study the code more to understand why it's scanning for that tree solid in partition list

view this post on Zulip Cezar (May 09 2018 at 19:11):

i'm not familiar with sorting networks. i read about them now, and i don't get where they fit in, if not for the logN search :-?

view this post on Zulip Cezar (May 10 2018 at 11:12):

i was looking at a profile and noticed that bu_bitv_clear took a lot of time (Debug build). BU_BITV_ZEROALL is a macro that zeroes the bytes in the bit vector using a while loop. i replaced the loop with a call to memset and one call to the macro in rt_shootray went from 5.21 s to 248 ms. i was wondering what the reason for the loop was initially

view this post on Zulip Cezar (May 10 2018 at 11:16):

although i expect that the improvements would be smaller on a release build, i think they would still exist

view this post on Zulip Cezar (May 10 2018 at 12:34):

looks like on release, the compiler makes this change itself, so it's useless

view this post on Zulip Cezar (May 10 2018 at 12:38):

regarding what you said before, that bitv functions shouldn't show up in profiles, i think it's fine. the only one i'm seeing is bitv_clear, and for 1 M bits, you still have to write 125,000 zeroes to memory

view this post on Zulip Cezar (May 10 2018 at 12:40):

assuming you can't remove the bitv altogether

view this post on Zulip Cezar (May 10 2018 at 12:49):

i'm thinking of pre-alloc'ing a large number of bytes for rt_shootray itself, and whenever a bitv is needed, just return a position within this buffer with enough bytes to fulfil the request. only zero it when a request can't be satisfied

view this post on Zulip Cezar (May 10 2018 at 12:53):

although bitv_clear amounts to very little, i'm not sure it's worth doing this

view this post on Zulip Cezar (May 10 2018 at 22:44):

although for large number of bitv_clear's with large number of elements, it would most likely matter

view this post on Zulip Cezar (May 10 2018 at 22:44):

i did some comparisons with std::unordered_set, it's way faster

view this post on Zulip Cezar (May 10 2018 at 22:49):

if i start with replacing bu_ptbl, i still think testing existence in O(1) using bit vectors or stl is where i should begin

view this post on Zulip Sean (May 16 2018 at 13:22):

if i start with replacing bu_ptbl, i still think testing existence in O(1) using bit vectors or stl is where i should begin

sounds reasonable to me too!

view this post on Zulip Cezar (May 18 2018 at 06:38):

as i've written in my dev log, i replaced the bu_ptbl iteration inside bool_eval with bu_bitv. now rt_boolfinal is slower because of bit vector initialisation, but bool_final went from 14 seconds to 12 seconds. a lot of time is still spent checking if the solid is inside the partition (as expected from previous graphs), but each check is faster.

view this post on Zulip Cezar (May 18 2018 at 07:04):

i guess the next step would be to replace bu_bitv with std::bitset in this specific instance and see what happens then

view this post on Zulip Cezar (May 18 2018 at 07:17):

but i don't think it's worth it. the main problem with bool_eval is that there are lots of nodes in the tree, and each operation is already o(1), except the bu_ptbl iteration, in which there are few elements (most of the time just 1, in the worst case ~60; but even then, i would trade iterating over 60 elements for initialising a huge bit vector a lot of the time; even if i use std::bitset, i don't see it being worth it)

view this post on Zulip Cezar (May 18 2018 at 08:54):

(deleted)

view this post on Zulip Cezar (May 18 2018 at 09:20):

i tried with std::bitset. it seems to be consistently 2 seconds faster on Debug, which is what i would expect :-?

view this post on Zulip Cezar (May 18 2018 at 09:30):

if you think it's enough for a patch, i'll submit one

view this post on Zulip Cezar (May 18 2018 at 09:56):

on Release, it's more like 0.5–1 seconds faster. it's consistently faster, but the impact is smaller

view this post on Zulip starseeker (May 18 2018 at 11:22):

@Cezar Go ahead and submit the std::bitset patch, since you've done the work - if it's consistently faster I don't see any particular reason not to use it, and even if we don't end up doing so it will be available to others for future testing.

view this post on Zulip Cezar (May 18 2018 at 12:58):

i submitted the patch https://sourceforge.net/p/brlcad/patches/486/. i used vector<bool> instead of bitset because bitset requires specifying the size at compile time.

view this post on Zulip Sean (May 19 2018 at 03:39):

as i've written in my dev log, i replaced the bu_ptbl iteration inside bool_eval with bu_bitv. now rt_boolfinal is slower because of bit vector initialisation, but bool_final went from 14 seconds to 12 seconds. a lot of time is still spent checking if the solid is inside the partition (as expected from previous graphs), but each check is faster.

how much slower? how did you measure?

view this post on Zulip Sean (May 19 2018 at 03:41):

i tried with std::bitset. it seems to be consistently 2 seconds faster on Debug, which is what i would expect :-?

What's it look like in Release? Debug is fairly insignificant for performance testing... lots of things we have intentionally run slower when compiled as Debug so that they will be accessible within a debugger.

view this post on Zulip Sean (May 19 2018 at 03:53):

I'd make sure it is consistently faster in Release mode before applying the change -- @Cezar your intuition about initializing repeatedly being more expensive than short list iterations could be true when optimized. it's worth being more certain across models on both ends of the spectrum.

view this post on Zulip Cezar (May 20 2018 at 11:10):

[...] it's worth being more certain across models on both ends of the spectrum.

i'm currently running bench/run.sh with my changes, on release. after that i'll run it with trunk and post the results

view this post on Zulip Cezar (May 20 2018 at 17:13):

i ran the benchmarks and they actually display a slowdown across all models. here is trunk and here are my changes. i ran a debug build inside a profiler, and the destructor for vector shows up in there, which isn't salutary

view this post on Zulip Cezar (May 20 2018 at 17:19):

one mistake i think i'm making is that i'm initialising (and then destroying, too) an ap->a_rt_i->nsolids-long bit vector for each partition, and i could do that once outside the loop, and then just set the bits at each iteration.

view this post on Zulip Cezar (May 20 2018 at 17:36):

actually, i don't think there's any way around it. basically what i'm doing is introducing nsolids operations for each partition, which previously did not exist.

view this post on Zulip Cezar (May 20 2018 at 17:38):

i think a smarter approach is to store the results of the search in OP_SOLID in bool_eval. first time through the list, it would be just as expensive, but subsequent searches would become o(1)

view this post on Zulip Cezar (May 21 2018 at 21:40):

i've tried things with std::map and std::unordered_map these days, they seem to slow down raytracing significantly. i think this is because the containers need to be constructed/destructed; there are few elements in the bu_ptbl; accessing an element in a map is more expensive than in an array; and locality (maps are trees, unordered_maps use linked lists for buckets, elements in bu_ptbls are contiguous)

view this post on Zulip Cezar (May 21 2018 at 21:43):

another thing i found out is that there are a lot of partitions (> 100k). previously, i imagined there were ~100 partitions with huge trees

view this post on Zulip starseeker (May 22 2018 at 03:36):

@Cezar Is there any way you can re-use the map/unordered_map rather than creating/destroying it? (no idea if that makes sense or is practical in this context, just a thought...)

view this post on Zulip starseeker (May 22 2018 at 03:37):

That won't help the locality problem, of course...

view this post on Zulip Cezar (May 22 2018 at 11:19):

i ran it under a profiler and it looked like most of the time was spent accessing elements (inside operator[])

view this post on Zulip Cezar (May 22 2018 at 11:20):

in the case of map, i also saw the recursive calls made when walking the tree

view this post on Zulip Cezar (May 22 2018 at 15:42):

i tried using hashes when the number of segments in a partition is bigger than a threshold. performance degraded now as well, probably because of all the conditionals in addition to the previous logic

view this post on Zulip Cezar (May 22 2018 at 15:48):

i'm not sure how the "space" is split into partitions (if that's what those partitions mean), but it seems like the bu_ptbl iteration was written to work well with it. in ~30 M calls to bool_eval when tracing havoc, there are < 10 segments in a partition. so it seems to have lots of partitions with few segments, and in this case, i don't think whatever data structure i use will pay off, since at the very least, i'll have to initialise it for each partition (for example zeroing the array)

view this post on Zulip Cezar (May 22 2018 at 15:51):

i tried a simple array hash, with a dumb/fast function (address of soltab mod 67), and it was still 3 seconds slower

view this post on Zulip Cezar (May 22 2018 at 15:54):

i tested using rt -B -H31 -P1 -o havoc.png havoc.g havoc, and got ~18 sec on trunk and ~21 sec with my changes

view this post on Zulip Cezar (May 23 2018 at 08:24):

so i think the next step would be to replace the bu_ptbl of struct soltabs altogether, instead of augmenting it. there is little work done when calling bool_eval once, and i think adding work there for bookkeeping will only make it slower no matter what

view this post on Zulip starseeker (May 23 2018 at 09:55):

Sounds reasonable

view this post on Zulip Cezar (May 23 2018 at 11:13):

by the way, i guess i should have asked this sooner, but do you have a (measurable) performance goal in mind? i'm currently running bench/run.sh and comparing rays/sec between runs to determine if it's improving, but i was wondering if there is a goal like "at least 5.5 M rays/sec when benchmarking moss"

view this post on Zulip Sean (May 27 2018 at 18:48):

we haven't quantified the goals, and a naive attempt would likely be unusefully arbitrary (e.g., 2x faster).

view this post on Zulip Sean (May 27 2018 at 18:52):

it would be good if we could figure out an objective metric, like having a render targets for an implicit object, triangle, and nurbs/brep version that have sufficient quality that they converge to the same image (within some specified pixel deviation tolerance).

view this post on Zulip Sean (May 27 2018 at 18:53):

then we could at least talk about relative costs in a more meaningful manner. it might be the start of defining on OCH (similar to SAR used by kd-tree, but not misnomered)

view this post on Zulip Sean (May 27 2018 at 18:54):

realistically, we know there's a solid order of magnitude possible through data coherency alone, simd should give a similar order of magnitude boost -- so something that is currently 1M rays/s might attain 100M rays/s

view this post on Zulip Cezar (May 27 2018 at 19:43):

so i've finished fixing my build errors and ran benchmarks using std::vector. they're better on all models, but not by much. however, i am now able to more easily test other stl containers. what i did was 1) change the struct bu_ptbl pt_seglist member of struct partition to std::vector<struct seg *> pt_seglist (and implemented the operations used in bool.cpp with the new structure); 2) split rt/ray_partition.h from raytrace.h and included it separately where it's needed; and 3) converted the files where the header is needed to C++ (with C linkage). what this means though is that you can no longer iterate the segments list from C, and i've also had to change a lot of files. the performance improvements so far are a bit underwhelming maybe, so i'm wondering if the trade offs are worth it

view this post on Zulip Cezar (May 27 2018 at 19:43):

gsoc/bench % cat summary-trunk-*
Abs  rmbp 4041636.94    1714565.15  1734933.19  1348127.28  1830646.00  2287836.14  2159624.11  Thu May 24 01:01:31 EEST 2018
*vgr rmbp 29498.84  25567.62    30942.27    25264.75    25896.81    154.35  22887.44
Abs  rmbp 4168024.50    1761264.50  1715300.76  1403866.97  1816805.40  2308539.52  2195633.60  Thu May 24 01:17:52 EEST 2018
*vgr rmbp 30421.31  26264.00    30592.13    26309.35    25701.02    155.75  23240.59
Abs  rmbp 4191027.14    1735223.93  1774402.02  1391472.34  1786512.82  2302579.43  2196869.61  Thu May 24 01:37:09 EEST 2018
*vgr rmbp 30589.20  25875.69    31646.19    26077.06    25272.49    155.35  23269.33
Abs  rmbp 4194157.90    1708366.28  1768637.50  1422026.54  1912957.40  2312326.89  2219745.41  Thu May 24 01:56:24 EEST 2018
*vgr rmbp 30612.05  25475.19    31543.38    26649.67    27061.21    156.01  23582.91
Abs  rmbp 4252770.44    1786723.23  1836428.96  1419499.54  1919978.36  2304439.92  2253306.74  Thu May 24 02:14:48 EEST 2018
*vgr rmbp 31039.85  26643.65    32752.43    26602.31    27160.53    155.47  24059.04
gsoc/bench % cat summary-vector-*
Abs  rmbp 4684937.42    2031225.42  2025885.21  1557917.44  2090030.01  2569718.32  2493285.63  Sun May 27 21:15:55 EEST 2018
*vgr rmbp 34194.12  30289.67    36131.35    29196.35    29566.13    173.37  26591.83
Abs  rmbp 4638519.22    1932765.21  1975258.83  1554311.46  2033613.46  2507226.39  2440282.42  Sun May 27 21:33:41 EEST 2018
*vgr rmbp 33855.33  28821.43    35228.44    29128.77    28768.05    169.16  25995.19
Abs  rmbp 4605819.18    1986518.42  2022192.86  1557950.64  2083347.79  2540601.93  2466071.80  Sun May 27 21:49:56 EEST 2018
*vgr rmbp 33616.66  29623.00    36065.50    29196.97    29471.60    171.41  26357.52
Abs  rmbp 4681758.44    2003582.74  2061324.01  1560023.33  2034590.59  2532082.09  2478893.53  Sun May 27 22:05:59 EEST 2018
*vgr rmbp 34170.92  29877.46    36763.40    29235.81    28781.87    170.83  26500.04
Abs  rmbp 4581822.58    2019161.52  2035468.56  1575925.77  2070459.65  2589455.33  2478715.56  Sun May 27 22:22:35 EEST 2018
*vgr rmbp 33441.51  30109.77    36302.27    29533.84    29289.28    174.70  26475.22

view this post on Zulip Cezar (May 27 2018 at 19:43):

here are the numbers, hopefully the formatting is fine

view this post on Zulip Cezar (May 27 2018 at 19:48):

[...] like having a render targets for an implicit object, triangle, and nurbs/brep version that have sufficient quality that they converge to the same image (within some specified pixel deviation tolerance).

hmm... i'm not familiar with what a render target and an implicit object are

view this post on Zulip Sean (May 27 2018 at 19:49):

nice... that's about a 10% improvement...

view this post on Zulip Cezar (May 27 2018 at 19:50):

[...] it might be the start of defining on OCH (similar to SAR used by kd-tree, but not misnomered)

i'm not familiar with those terms either :D i found this (http://pl887.pairlitesite.com/papers/rt06-fast-kd/fast-kd-construction-RT06.pdf) searching for "SAR kd-tree", but i'll have to read up

view this post on Zulip Sean (May 27 2018 at 19:50):

considering ptbl's were only amounting to about 20-30% of the overall time, you cut it way down

view this post on Zulip Sean (May 27 2018 at 19:50):

sar is surface area heuristic

view this post on Zulip Sean (May 27 2018 at 19:51):

it's a way to split up geometry (whether with a kd-tree or bsp or other spatial partition method) where you want to try and balance the amount of work on both sides of a split so they're roughly even

view this post on Zulip Sean (May 27 2018 at 19:52):

if everything is a triangle, then surface area tends to be a reasonable metric for that ... the amount of geometry surface over here equals the amount over there, so this is a good split point

view this post on Zulip Sean (May 27 2018 at 19:52):

when you don't have triangles, SAH is just stupid and the wrong term

view this post on Zulip Sean (May 27 2018 at 19:53):

the generalization is some form of an object complexity or object cost metric / heuristic

view this post on Zulip Cezar (May 27 2018 at 20:02):

ok, i think i understand what that means. i'm also curious what a render target is. something like "a target number of operations", or something else? and regarding data coherency, is that cache coherency, or something else/more?

view this post on Zulip Sean (May 27 2018 at 20:56):

in what context? a render target is typically an image being made or a particular type of image

view this post on Zulip Cezar (May 27 2018 at 20:57):

oh, you said previously "it would be good if we could figure out an objective metric, like having a render targets for an implicit object"

view this post on Zulip Sean (May 27 2018 at 20:58):

data coherence does refer to cache coherency or more generally to coherency in memory and across memory boundaries -- for example, if something is on disk, you obviously want to only read from disk once and do all your processing on it (as opposed to reading it from disk repeatedly)

view this post on Zulip Sean (May 27 2018 at 20:59):

same thing applies for things read into main memory that end up on the cpu or gpu memory, like if on the cpu, keep things in l3 as long as possible, in l2 as long as possible, etc

view this post on Zulip Cezar (May 27 2018 at 21:01):

oh, i see. is reading repeatedly from disk a problem that exists right now in the codebase, or was it a hypothetical example?

view this post on Zulip Sean (May 27 2018 at 21:01):

yeah, previous comment about render target was to have a desired image (e.g., a 1024x1024 rendering of a sphere) that you would get when you run (for example) rt on a sph object, or on a sph.bot object or a sph.brep object -- the "target" aka desired image would be the same for all three

view this post on Zulip Sean (May 27 2018 at 21:01):

hypothetical example

view this post on Zulip Sean (May 27 2018 at 21:03):

typically, coherency is disk < net < main memory < l3 cache < l2 cache < l1 cache || gpu cache

view this post on Zulip Cezar (May 27 2018 at 21:05):

i'm aware, although i refer to it as memory hierarchy

view this post on Zulip Sean (May 27 2018 at 21:11):

yes! it is also memory hierarchy, coherency is simply a data-centric view (as opposed to a memory architecture perspective) where you keep related memory items together, so you can do what you need on the data without reaching down the hierarchy for more data (thus incurring a stall)

view this post on Zulip Sean (May 27 2018 at 21:23):

here's an interesting gist on someone rewriting a simple algorithm with l1/l2 in mind: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

view this post on Zulip starseeker (May 28 2018 at 14:50):

@Cezar was the winner std::vector then?

view this post on Zulip Cezar (May 28 2018 at 15:59):

i didn't get to test the others. i'll start today

view this post on Zulip Cezar (May 28 2018 at 17:22):

so while modifying the code to use std::set, i noticed that i can't actually remove the iteration in bool_eval because in there, it's searching for a soltab, while the container holds segs, so i have to access seg_stp. i also can't change it to hold soltabs because somewhere else in the code, it requires a seg's other members and there's no pointer from a soltab to a seg

view this post on Zulip Cezar (May 28 2018 at 20:04):

i tried unordered_set, but stopped quickly. after a few seconds, it was still not done with the first item in the benchmark (which takes 0.06 sec with bu_ptbl). i profiled it and most (all?) of the time is spent in the container's rehash method, so i don't think there's much to do here

view this post on Zulip Cezar (May 28 2018 at 20:28):

also related to my previous message, i can't use bit vectors either. and since maps are similar to sets (buckets and rehashing), i don't think they'll yield good results

view this post on Zulip starseeker (May 29 2018 at 00:18):

@Cezar Out of curiosity, what do you think about maybe trying a C implementation of red black trees? We've got one in libbu, and if it has problems there are a few others out there we could replace it with.

view this post on Zulip Cezar (May 29 2018 at 08:05):

i think std::set and std::map use rb trees. i didn't get to test them yesterday because i kept getting seg faults and postponed fixing them, but i'll get to it today. after that, i'll try the libbu implementation as well. also, i think i can test for membership in O(log N) in bool_eval with std::map

view this post on Zulip Cezar (May 29 2018 at 10:13):

i got set to work, but it's worse than trunk (moss, frame 8, 3.3 M rays/sec vs 4.5 and 5.2). i'll try map next, and then the rb tree in libbu

view this post on Zulip Cezar (May 29 2018 at 12:22):

tried map, it's still slow. profiled it, spends 9 seconds (out of 1.25 minutes on Debug) in bool_eval finding the soltab with .count(). i think it's because of locality + expensive operations. it's log N, but when searching for < 5 elements, the constant factors matter

view this post on Zulip Cezar (May 29 2018 at 12:28):

i'll try libbu's implementation as well. as for other implementations, i remember the linux kernel having one. maybe there are a few tricks there. there's also the matter of licensing :-?

view this post on Zulip Cezar (May 29 2018 at 14:43):

although if i’m trying other implementations, i can try different packages for hashes, too

view this post on Zulip Sean (May 30 2018 at 05:21):

Shouldn’t need to worry about licensing. Typically only problems are with gpl and academic codes that have a discriminatory clause like cc-by-nc making it not Open Source.

view this post on Zulip Sean (May 30 2018 at 05:27):

I think you are seeing a clear pattern that init constant is dominant on the O(logN) methods so either need a diff init or change the algorithm...

view this post on Zulip Sean (May 30 2018 at 05:28):

Can you briefly summarize everything you’ve tried and the timings into a table? If the data is handy....

view this post on Zulip Cezar (May 30 2018 at 08:29):

Can you briefly summarize everything you’ve tried and the timings into a table? If the data is handy....

That was my initial plan, but after seeing the poor performance of other containers, I stopped the benchmarking suite before completing. For example, using map, it went from ~2.2 M for world to ~1.2 M, so I figured it's not worth running further.

view this post on Zulip Cezar (May 30 2018 at 08:32):

But I'm curious why vector is faster than bu_ptbl. I want to try adjusting the initial capacity and see if this improves performance.

view this post on Zulip Cezar (May 30 2018 at 11:14):

i set bu_ptbl's initial capacity to 16 (down from 64) and the numbers are actually close to vector. it still has a slight edge, but the upside is that there's no need to switch to C++

view this post on Zulip Cezar (May 30 2018 at 11:14):

gsoc/bench % cat summary-trunk-*
Abs  rmbp 4041636.94    1714565.15  1734933.19  1348127.28  1830646.00  2287836.14  2159624.11  Thu May 24 01:01:31 EEST 2018
*vgr rmbp 29498.84  25567.62    30942.27    25264.75    25896.81    154.35  22887.44
Abs  rmbp 4168024.50    1761264.50  1715300.76  1403866.97  1816805.40  2308539.52  2195633.60  Thu May 24 01:17:52 EEST 2018
*vgr rmbp 30421.31  26264.00    30592.13    26309.35    25701.02    155.75  23240.59
Abs  rmbp 4191027.14    1735223.93  1774402.02  1391472.34  1786512.82  2302579.43  2196869.61  Thu May 24 01:37:09 EEST 2018
*vgr rmbp 30589.20  25875.69    31646.19    26077.06    25272.49    155.35  23269.33
Abs  rmbp 4194157.90    1708366.28  1768637.50  1422026.54  1912957.40  2312326.89  2219745.41  Thu May 24 01:56:24 EEST 2018
*vgr rmbp 30612.05  25475.19    31543.38    26649.67    27061.21    156.01  23582.91
Abs  rmbp 4252770.44    1786723.23  1836428.96  1419499.54  1919978.36  2304439.92  2253306.74  Thu May 24 02:14:48 EEST 2018
*vgr rmbp 31039.85  26643.65    32752.43    26602.31    27160.53    155.47  24059.04
gsoc/bench % cat summary-ptbl16-*
Abs  rmbp 4680155.95    1927090.59  1999106.16  1550430.99  2071242.81  2599398.72  2471237.53  Wed May 30 12:53:31 EEST 2018
*vgr rmbp 34159.22  28736.81    35653.75    29056.05    29300.36    175.38  26180.26
Abs  rmbp 4662057.95    1932794.89  1932672.45  1539780.22  2089337.31  2568054.50  2454116.22  Wed May 30 13:11:15 EEST 2018
*vgr rmbp 34027.13  28821.87    34468.92    28856.45    29556.33    173.26  25983.99
Abs  rmbp 4483025.56    1959477.52  1968239.40  1534994.64  2065281.19  2589953.44  2433495.29  Wed May 30 13:29:25 EEST 2018
*vgr rmbp 32720.42  29219.76    35103.25    28766.76    29216.03    174.74  25866.82
Abs  rmbp 4582552.38    1887762.28  2002007.33  1554848.10  2082253.88  2560400.25  2444970.70  Wed May 30 13:46:10 EEST 2018
*vgr rmbp 33446.84  28150.34    35705.49    29138.83    29456.13    172.74  26011.72
Abs  rmbp 4602999.06    1911004.39  1992763.36  1525482.72  2079252.77  2585297.14  2449466.57  Wed May 30 14:02:18 EEST 2018
*vgr rmbp 33596.08  28496.93    35540.63    28588.50    29413.67    174.42  25968.37
gsoc/bench % cat summary-vector-*
Abs  rmbp 4684937.42    2031225.42  2025885.21  1557917.44  2090030.01  2569718.32  2493285.63  Sun May 27 21:15:55 EEST 2018
*vgr rmbp 34194.12  30289.67    36131.35    29196.35    29566.13    173.37  26591.83
Abs  rmbp 4638519.22    1932765.21  1975258.83  1554311.46  2033613.46  2507226.39  2440282.42  Sun May 27 21:33:41 EEST 2018
*vgr rmbp 33855.33  28821.43    35228.44    29128.77    28768.05    169.16  25995.19
Abs  rmbp 4605819.18    1986518.42  2022192.86  1557950.64  2083347.79  2540601.93  2466071.80  Sun May 27 21:49:56 EEST 2018
*vgr rmbp 33616.66  29623.00    36065.50    29196.97    29471.60    171.41  26357.52
Abs  rmbp 4681758.44    2003582.74  2061324.01  1560023.33  2034590.59  2532082.09  2478893.53  Sun May 27 22:05:59 EEST 2018
*vgr rmbp 34170.92  29877.46    36763.40    29235.81    28781.87    170.83  26500.04
Abs  rmbp 4581822.58    2019161.52  2035468.56  1575925.77  2070459.65  2589455.33  2478715.56  Sun May 27 22:22:35 EEST 2018
*vgr rmbp 33441.51  30109.77    36302.27    29533.84    29289.28    174.70  26475.22

view this post on Zulip Cezar (May 30 2018 at 12:14):

is there any way to run the benchmark on windows?

view this post on Zulip Cezar (May 30 2018 at 21:15):

i plotted the data here and dumped the results here

view this post on Zulip Cezar (Jun 01 2018 at 18:53):

i tried running with the rb tree implementation in libbu. it's super slow (tens of seconds where trunk takes 0.06) and that's because of acquiring and releasing semaphores. running with -P 1 makes it faster. it could be an implementation problem, maybe locking isn't required. i still think it has worse locality and initialisation, which should matter for small lists. but if you want me to try other implementations, i will do so. if you want to see my code (maybe i'm doing it wrong), i'll submit a patch

view this post on Zulip Cezar (Jun 01 2018 at 19:16):

i've spammed some patches here

view this post on Zulip Cezar (Jun 01 2018 at 22:21):

libbu semaphores are used because of MALLOC_NOT_MP_SAFE, which i think is not needed nowadays (?). i undefined it and performance increased significantly, but it's still ~3x slower than bu_ptbl. insertions (due to allocating new nodes) and the OP_SOLID case in bool_eval are the hotspots. i could try to keep a pool of nodes and allocate from that

view this post on Zulip starseeker (Jun 02 2018 at 01:11):

@Cezar I'd be curious about https://github.com/jstimpfle/sil/tree/master/rb3ptr but I have no idea how much work that would be, or even if it makes sense for this case...

view this post on Zulip starseeker (Jun 02 2018 at 01:13):

the "memory allocation done by client" bit intrigues me, since your testing is indicating initialization costs are an issue... makes me wonder if we could pre-allocate...

view this post on Zulip starseeker (Jun 02 2018 at 01:15):

I'm not surprised our redblack tree in libbu has issues - to the best of my knowledge its not been stressed much

view this post on Zulip starseeker (Jun 02 2018 at 01:28):

@Cezar might also try a few other containers like a splay tree and see if they do anything interesting...

view this post on Zulip starseeker (Jun 02 2018 at 01:30):

@Cezar any sense of whether something like a memory pool might be helpful?

view this post on Zulip Cezar (Jun 02 2018 at 06:36):

@Cezar I'd be curious about https://github.com/jstimpfle/sil/tree/master/rb3ptr but I have no idea how much work that would be, or even if it makes sense for this case...

i could try it, but looking at the readme, i see that it's "Not thread safe". this seems like a big problem :-?

view this post on Zulip Cezar (Jun 02 2018 at 06:39):

@Cezar any sense of whether something like a memory pool might be helpful?

since most of the time is spent allocating nodes during insertions, i think a memory pool would be helpful. i was considering it, but my thinking is that i would still need a list to manage the pool, i'd still iterate over it, and it would be bigger still than the current lists (of segments). so it might harm performance in the end

view this post on Zulip starseeker (Jun 02 2018 at 20:55):

Hmm. Is bu_ptbl insertion currently thread safe?

view this post on Zulip Cezar (Jun 04 2018 at 08:15):

i think so, since it doesn't hold any global state, and the way it's used is contained to a single thread. so if thread A creates a bu_ptbl, only thread A will read/modify it

view this post on Zulip Cezar (Jun 04 2018 at 08:18):

but rb3ptr looks to be the same, i hadn't thought about it before pasting the "not thread safe" comment

view this post on Zulip Sean (Jun 05 2018 at 04:05):

i set bu_ptbl's initial capacity to 16 (down from 64) and the numbers are actually close to vector. it still has a slight edge, but the upside is that there's no need to switch to C++

On the surface, that was a shocking discovery but in retrospect it makes complete sense if static init is dominating the timing. Did you test any other sizes like 8 or 0 or 24?

view this post on Zulip Sean (Jun 05 2018 at 04:14):

I think you've just about killed all the ptbl replacement options. you found a way to make it considerably faster, great -- I suggest moving on to the next profile hot point. if ptbl is still high on the profile for specific cases, it might make sense to use a hybrid strategy (like plain array for first 8 entries, then some logN method for additional).

view this post on Zulip Sean (Jun 05 2018 at 04:16):

otherwise, diminishing returns ... lots of other room for improvement ;)

view this post on Zulip Cezar (Jun 05 2018 at 07:08):

On the surface, that was a shocking discovery but in retrospect it makes complete sense if static init is dominating the timing. Did you test any other sizes like 8 or 0 or 24?

I tried 8 and that's better than trunk as well, but worse than 16. I think 16 is a sweet spot because there are usually between 8 and 16 segments in a partition, so no resizing is needed and no time is wasted zeroing unused elements

view this post on Zulip Cezar (Jun 05 2018 at 07:12):

I think you've just about killed all the ptbl replacement options. you found a way to make it considerably faster, great -- I suggest moving on to the next profile hot point. if ptbl is still high on the profile for specific cases, it might make sense to use a hybrid strategy (like plain array for first 8 entries, then some logN method for additional).

you previously mentioned data coherency, bu_list pointer aliasing and simd offering big performance improvements. i would like to look at those next, but i'm not sure where to begin

view this post on Zulip Cezar (Jun 05 2018 at 07:13):

also, i asked if there is a way to run the benchmark suite on windows (i don't see any). if there isn't, would it be worth it if i ported the run.sh script to python?

view this post on Zulip Cezar (Jun 06 2018 at 19:41):

i was looking at vectorization, and what i want to try next is to compile with clang's vectorization reports, see what couldn't be vectorized and why, and try to rewrite the code. does this sound good?

view this post on Zulip Cezar (Jun 06 2018 at 19:47):

also, while going through cmakelists, i noticed some SSE flags commented out due to invalid instructions. is there more context around it?

view this post on Zulip Sean (Jun 07 2018 at 04:07):

also, i asked if there is a way to run the benchmark suite on windows (i don't see any). if there isn't, would it be worth it if i ported the run.sh script to python?

there's not, but not worth porting to python at this time -- there are other plans for it already in the works

view this post on Zulip Sean (Jun 07 2018 at 04:17):

you previously mentioned data coherency, bu_list pointer aliasing and simd offering big performance improvements. i would like to look at those next, but i'm not sure where to begin

I suggest finding something relatively simple and isolated. see what it takes to change it. the most egregious cases are functions that take/pass bu_list as a function parameter. looking quickly at public headers, wdb.h stands out as just having a couple and it's a fairly isolated library -- maybe start there as a test.

view this post on Zulip Sean (Jun 07 2018 at 05:22):

also, while going through cmakelists, i noticed some SSE flags commented out due to invalid instructions. is there more context around it?

It's hard to test for SSE without causing a runtime crash. We need both build-time and (if not more importantly) runtime sse detection that doesn't crash when you don't have sse.

view this post on Zulip Cezar (Jun 07 2018 at 06:14):

there's not, but not worth porting to python at this time -- there are other plans for it already in the works

oh, ok :D is there anything i can read on the plans? the reason i wanted to test on windows is that it is the most downloaded version on sf, and figured it is important to benchmark there as well

view this post on Zulip Cezar (Jun 07 2018 at 06:16):

It's hard to test for SSE without causing a runtime crash. We need both build-time and (if not more importantly) runtime sse detection that doesn't crash when you don't have sse.

i think cpuid would be enough here, at least on intel/amd

view this post on Zulip Cezar (Jun 07 2018 at 06:16):

I suggest finding something relatively simple and isolated. see what it takes to change it. the most egregious cases are functions that take/pass bu_list as a function parameter. looking quickly at public headers, wdb.h stands out as just having a couple and it's a fairly isolated library -- maybe start there as a test.

ok, i'll do this today

view this post on Zulip Cezar (Jun 07 2018 at 12:06):

i want to see if i get this right. in int mk_pipe(struct rt_wdb *fp, const char *name, struct bu_list *headp), is the reason for pointer aliasing that struct rt_wdb's first member is a struct bu_list, and the compiler can't be sure that fp and headp point to disjoint memory?

view this post on Zulip Cezar (Jun 07 2018 at 12:07):

i'm using this as the definition of pointer aliasing

view this post on Zulip Cezar (Jun 07 2018 at 12:08):

i thought that if the parameters have different types, they're assumed to not be aliases of one another, so i guess that's why i don't understand why aliasing takes place in the example i gave

view this post on Zulip Sean (Jun 07 2018 at 13:07):

oh, ok :D is there anything i can read on the plans? the reason i wanted to test on windows is that it is the most downloaded version on sf, and figured it is important to benchmark there as well

There's nothing to read up on, nothing fancy. It's just being reworked in another way and being used as a testing framework for another performance code (i.e., being slightly generalized). I can let you know if there's bits you can tackle since it is performance related, but don't want to spread you out too thin!

view this post on Zulip Sean (Jun 07 2018 at 13:12):

i think cpuid would be enough here, at least on intel/amd

it should be and that's essentially what we have now (see src/libbu/simd.c) but then there are outliers doing different things (e.g., include/bn/dvec.h)

view this post on Zulip Sean (Jun 07 2018 at 13:33):

i want to see if i get this right. in int mk_pipe(struct rt_wdb *fp, const char *name, struct bu_list *headp), is the reason for pointer aliasing that struct rt_wdb's first member is a struct bu_list, and the compiler can't be sure that fp and headp point to disjoint memory?

So this is a little complicated to explain because the aliasing typically happens in non-obvious and in unlocalized places. There's no problem with fp in that function signature. There's technically (not yet) a problem with headp either except that it's a generic bu_list -- meaning "it's a list of something".

If you look in src/libwdb/pipe.c, you'll find that function and sure enough mk_pipe() doesn't really do anything other than add that headp bu_list to another bu_list (i.e., pipe_segs_head). But if we look at the places pipe_segs_head is used, we start to find aliasing in action that screws with performance. Consider the destructor function for example: mk_pipe_free()

In there, it's iterating over a generic "list of something" headp, but casting each list element to a struct wdb_pipept, unlinking it from the list (i.e., dequeue wp->l), and releasing the memory. That only worked because wdb_pipept has a bu_list as the first byte in its structure.

With strict pointer aliasing, the compiler could know that the next element in the bu_list is += sizeof(struct bu_list) bytes ahead... but it's not in our case because it's really += sizeof(struct wdb_pipept) bytes ahead. And if the compiler assumes wrongly, trying to optimize aggressively, it will end up prefetching the wrong bytes, and it'll typically crash eventually. When we turn strict aliasing off, it tells the compiler to not do any lookahead optimization, use the manual cast sizes that switch from one thing to another depending on which function we are in.

Make sense?? :)

view this post on Zulip Sean (Jun 07 2018 at 13:38):

So the issue really is bu_list's entire existence because it was implemented specifically to take advantage of non-strict aliasing, for code simplicity, brevity. The "fix" requires re-evaluating the container being used, and figuring out a different container -- like A) passing a struct wdb_pipept with traditional forw/next pointers in it (so that structure is a list), B) passing a plain NULL-terminated array of wdb_pipept structures, C) using some other generic C container (akin to using a std::vector<struct wdb_pipept>), D) ...

view this post on Zulip Cezar (Jun 07 2018 at 13:42):

yep, it makes sense now. what i was missing was the part about prefetching the wrong bytes

view this post on Zulip Sean (Jun 07 2018 at 13:55):

Great!

view this post on Zulip Sean (Jun 07 2018 at 14:03):

For what it’s worth, the answer should be to convert a bu_list to either an array of structs or a struct with array(s) in it. If you read up and see terms like AoS or SoA or AoSoA, that’s what the typical performance pattern is. Linked lists should basically never be used any more except underpinning more complex containers like std::map

view this post on Zulip Cezar (Jun 07 2018 at 14:09):

because of locality, right?

view this post on Zulip Cezar (Jun 07 2018 at 14:33):

oh, i just looked up SoA and AoS, i've done similar things when implementing graphs, but didn't know they had a name, or their performance implications :D

view this post on Zulip Sean (Jun 07 2018 at 14:49):

Yes, memory locality dominates. When a page of memory is loaded and you access a structure, it’s very likely that the next structure you’ll need is on that same page of memory, so you avoid having to refetch if it’s an array. With linked list pointers, it’s possible for every structure to live on different pages and always or frequently incur a page fault.

view this post on Zulip Cezar (Jun 08 2018 at 17:24):

i got two questions :D 1) i profiled rt(1) again, and i can't find hotspots involving bu_list. i understand what the problem is conceptually, but is there anything i can do to convince myself that it is a problem in practice? am i missing something in the rt profile, or do i have to profile something else?; 2) if i do have to replace bu_list with an array-based structure, i'll have to implement (or find something already done — i see that bu/list.h mentions some candidates for replacement) something similar to bu_ptbl, but which holds elements of arbitrary size, and change the entire code base to use this new structure. or are there specific libraries which i should focus on (but if there are, i suppose this ties into my first question — how do i identify them?)?

view this post on Zulip starseeker (Jun 09 2018 at 02:11):

@Cezar did you run rt with -F/dev/null to avoid overhead generated by image writing?

view this post on Zulip starseeker (Jun 09 2018 at 02:15):

@Cezar I don't know if you'll have to completely replace bu_list everywhere, but any substantial change to bu_list's use in librt I would expect to touch a lot of code. My advice would be to set up "parallel" implementation files and API calls which replicate functionality using the new replacement for bu_list. That way, you can instrument things up to do an apples to apples comparison on output accuracy in the old code vs. the rewritten code. And if (as I suspect) we would have to leave the old code in place while deprecating it in favor of a de-bu_list-ified API, that would position the code well to handle/manage that process.

view this post on Zulip starseeker (Jun 09 2018 at 02:19):

@Cezar an unrelated point, but something I am curious about - are there any places in the code where our use of bu_ptbl_ins_uniq show up as any sort of performance bottleneck? The uniqueness guarantee of bu_ptbl_ins_uniq I would expect to get expensive quickly as table sizes grow... Or, if it doesn't immediately show up in our code, could you try setting up some sort of unit test to get a sense of when bu_ptbl_ins_uniq starts to get problematic performance wise for both random and worst-case uniqueness searching?

view this post on Zulip starseeker (Jun 09 2018 at 02:22):

It may not matter greatly for raytracing (based on your work to date) but since bu_ptbl seems to perform well enough to keep around I'd like to at least have some sense of what we might do (and when we might need/want to do it) to avoid very high performance penalties for uniq insertion into large bu_ptbl arrays, if it's a test you can do easily and quickly...

view this post on Zulip starseeker (Jun 09 2018 at 02:24):

e.g. could we "switch" the backend container to a red-black tree implementation or something similar when the bu_ptbl size gets past some size threshold, and keep the current fast behavior at smaller array sizes when the uniq test is cheap enough to not be a concern?

view this post on Zulip Cezar (Jun 09 2018 at 07:28):

@Cezar did you run rt with -F/dev/null to avoid overhead generated by image writing?

no, i was doing -o file.png. i was looking for something like -F when profiling, but i missed it when skimming through the man pages

view this post on Zulip Cezar (Jun 09 2018 at 07:29):

@Cezar I don't know if you'll have to completely replace bu_list everywhere, but any substantial change to bu_list's use in librt I would expect to touch a lot of code. My advice would be to set up "parallel" implementation files and API calls which replicate functionality using the new replacement for bu_list. That way, you can instrument things up to do an apples to apples comparison on output accuracy in the old code vs. the rewritten code. And if (as I suspect) we would have to leave the old code in place while deprecating it in favor of a de-bu_list-ified API, that would position the code well to handle/manage that process.

sounds good

view this post on Zulip Cezar (Jun 09 2018 at 07:35):

@Cezar an unrelated point, but something I am curious about - are there any places in the code where our use of bu_ptbl_ins_uniq show up as any sort of performance bottleneck? The uniqueness guarantee of bu_ptbl_ins_uniq I would expect to get expensive quickly as table sizes grow... Or, if it doesn't immediately show up in our code, could you try setting up some sort of unit test to get a sense of when bu_ptbl_ins_uniq starts to get problematic performance wise for both random and worst-case uniqueness searching?

bu_ptbl_ins/cat_uniq do show up when profiling, but i can't tell if it's a bottleneck. they show up in rt_boolfinal when iterating over the ~150k partitions, so i think it's because they're called a lot of times, not because the calls themselves are expensive. i drew some graphs a few weeks back showing that the number of segments in the list is <= 12

view this post on Zulip Cezar (Jun 09 2018 at 07:42):

It may not matter greatly for raytracing (based on your work to date) but since bu_ptbl seems to perform well enough to keep around I'd like to at least have some sense of what we might do (and when we might need/want to do it) to avoid very high performance penalties for uniq insertion into large bu_ptbl arrays, if it's a test you can do easily and quickly...

that should be easy to test, but i'm wondering if it's not "premature optimisation", since large bu_ptbls don't seem to show up anywhere. theoretically they can get huge, and i think even for ~10k elements a rb tree would be faster in the worst case, but if that's never exhibited, what's the point? also, i'm not familiar with the particulars of the raytracer, but i think it's splitting the space into lots (~150k) of small (~16 segments) partitions, which works well with the bu_ptbl implementation. or maybe it's a coincidence i've noticed

view this post on Zulip starseeker (Jun 09 2018 at 12:29):

@Cezar fair enough. libged's search command uses bu_ptbl, and that was actually the use case I had in mind where bu_ptbl_ins_uniq might end up expensive, but you're correct that that's not the focus for this project.

view this post on Zulip starseeker (Jun 09 2018 at 12:31):

I can always throw it together later if needed, and if you're going to tackle the bu_list question that's going to demand full attention ;-)

view this post on Zulip Cezar (Jun 13 2018 at 14:07):

i started working on moving struct region to SoA. i should look at the code and see how the structure is used and see if AoSoA is worth trying as well (for example, i've seen loops where only the reg_namemember is used, but there are also places where two or more members are used)

view this post on Zulip Cezar (Jun 13 2018 at 14:08):

i'm also thinking of using xxhash when inserting/updating an element for fast equality check (is this region equal to this other region?). but this should make working with these structs a bit more complicated

view this post on Zulip Cezar (Jun 13 2018 at 14:10):

a problem i'm currently having is that when inserting/updating, i have to update a lot of arrays (15 in this case). this means at least 30 lines of code (15 if the array needs to be realloc'd and 15 for updating the data), and i'm looking for ways of making this shorter. i was thinking of making the struct a void * array of arrays, where each outer line is a member. then i need to #define the names of the members like #define reg_name el[0] (like tree.h does, i think). but then i think that when using the structure, the members will have to be casted to their actual type. this seems cumbersome. if i can't find a proper solution, i'll just write the code to update each element (maybe generate it with python, outside trunk)

view this post on Zulip Cezar (Jun 13 2018 at 14:15):

i also still don't know how to measure the impact the data organisation has on the cache. using PMCs, i've determined there are ~1.7x as many "no execute" cycles (waiting for memory) as execute cycles. but i don't know how to measure the impact of each line of code (i think it should be possible). i think this is important because there is a lot of code to change when moving to SoA, and i would prefer to have a way to measure this at the onset

view this post on Zulip Cezar (Jun 26 2018 at 03:35):

@starseeker @Sean any thoughts on the above? :D i also ran cachegrind on rt inside a debian VM and created a table with the results, sorted by L1d misses

view this post on Zulip Cezar (Jun 26 2018 at 03:37):

the tree structure is responsible for quite a lot of those misses, so i think it's worth looking into? i remember looking at it a few weeks back and it seemed like the addresses of the nodes were close together in memory

view this post on Zulip Sean (Jun 26 2018 at 22:52):

a problem i'm currently having is that when inserting/updating, i have to update a lot of arrays (15 in this case). this means at least 30 lines of code (15 if the array needs to be realloc'd and 15 for updating the data), and i'm looking for ways of making this shorter.

Why? Lines of code isn't really a relevant measure.

i was thinking of making the struct a void * array of arrays

If you're type punning, automatic vectorization will not be possible. Anything getting converted to arrays needs to be non-void, non-pointer arrays.

then i need to #define the names of the members like #define reg_name el[0] (like tree.h does, i think).

Existing #defines are indirections that are no longer best practice and should be avoided.

In general, I think you jumped on one that is WAY too complicated with struct region... there are tons of considerations, problems, and really hard complexities with most of the core structures. that's why the original suggestion was to simply pick a bu_list somewhere/anywhere, and convert it to an array (or some other container). experience with doing a few of those will greatly help solve the harder problems down the road (and much will be learned along the way). Have you looked at any of the lists?

view this post on Zulip Sean (Jun 26 2018 at 22:57):

@starseeker @Sean any thoughts on the above? :D i also ran cachegrind on rt inside a debian VM and created a table with the results, sorted by L1d misses

These results are amplified by incoherent pointer chasing. Even doing something as simple as shooting rays in bundles should have a major impact on L1/L2 hits.

Note from an optimization strategy perspective, you want to first look at main memory access stalls, then L3 stalls, then L2, then L1. If you jump straight to L1, you'll end up injecting complexity that will likely get thrown away (completely) when you then look at other stalls (that are orders of magnitude bigger).

view this post on Zulip Cezar (Jun 28 2018 at 11:41):

ok, i see what you mean. i've found struct temp_file_list inside libbu/temp.c, that looks self-contained. i'll start working on that one

view this post on Zulip Erik (Feb 29 2020 at 20:18):

is this any good?

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
Benchmark results indicate an approximate VGR performance metric of 134454
Logarithmic VGR metric is 5.13  (natural logarithm is 11.81)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

view this post on Zulip Sean (Feb 29 2020 at 23:42):

woah. yeah, that's good.

view this post on Zulip Sean (Feb 29 2020 at 23:42):

24 core machine?

view this post on Zulip Erik (Feb 29 2020 at 23:50):

40 xeon 4114's @ 2.2, 92g ram, just a ti2080 right now... office workstation...

view this post on Zulip Sean (Feb 29 2020 at 23:51):

40? nice! might be able to get a higher number then, but that's still pretty outstanding.

view this post on Zulip Erik (Feb 29 2020 at 23:53):

I'm sure, I only did =Release, probably lots of other fun flags and minor tweaks to do... mebbe I'll hit it with 'vtune' to learn vtune :D

view this post on Zulip Sean (Mar 01 2020 at 07:33):

used vtune for the first time a couple years ago, it's nice and intuitive, bit of a learning curve.

view this post on Zulip Sean (Mar 01 2020 at 07:34):

perf works great in a pinch though, super easy and just as good metrics (just without the nice gui)

view this post on Zulip Erik (May 16 2020 at 12:05):

Abs **** 3402463.46 1380366.82 1359886.60 1030002.51 1221374.29 1546660.32 1656792.33 Fri May 15 22:28:48 EDT 2020

brand new ryzen 3 running minimal ubuntu.
ryzen 3 3200G 4core 3.6ghz
16gb ddr4 2400
'asus tuff b450-plus' motherboard and a slow assed wd spinning platter
(odd duck, ainnit? it's an anti-gaming desktop that can be turned into a gaming rig, for my oldest. Homework first.)


Last updated: Oct 09 2024 at 00:44 UTC