performance · brlcad · Zulip Chat Archive

@Cezar the overarching goal is to speed up ray tracing performance which has lots of potential sub-projects including conversion to opencl, optimization of specific routines, and developing a new interactive GUI based on ray tracing

Sean (Jan 30 2018 at 02:46):

there's also performance of libged commands, which completely span the gamut, and performance of our lowest utility containers in libbu (which will likely involve profiling and replacement with C++ containers)

specific performance hot spots (priorities) include A) the raytrace pipeline needing to dispatch sets of rays (64x64 postage stamps that subdivide into packets of 8 rays), B) boolean evaluation conversion to spmd/simd, C) optimization/elimination of libbu pointer tables (bu_pbtl_* calls) and bit vectors, and D) elimination of libbu container aliasing (primarily elimination of struct bu_list iteration)

Sean (Feb 12 2018 at 03:33):

@Cezar there's definitely faster ways to search for a pointer (or we could try and eliminate the need to search for a pointer in the first place) -- even a std::ordered_hash may be faster. what was meant about the bu_list elimination in D was a more efficient implementation, a different way that doesn't involve pointer aliasing. for B, not so much parallelism (though that would be a fine project for some specific commands) as commands that quickly fall apart (e.g., a database with 1M items). postage stamps are simply rendering subsections of an image, e.g., in 64x64 pixel tiles, instead of pixel or line at a time.

Rahul Saxena (Feb 13 2018 at 13:06):

Hi,I am a gsoc aspirant interested in CAD design.Please guide me how to contribute to this project.

Sean (Feb 13 2018 at 14:08):

@Rahul Saxena welcome, your best guide is yourself, demonstrating independent productivity and interest ;)

Cezar (Feb 13 2018 at 15:01):

regarding B, i was wondering what you meant by conversion to SIMD/SPMD, if not parallelism. also, i'm wondering why it's called boolean evaluation, to me it looks more like set theory, at least the operations. i've also tried to find the code doing the evaluation, it seems to be in libnmg/nmg_bool/bool.c:nmg_bool, and i think the expression tree is constructed in libged/comb.c. do i get this right?

Cezar (Feb 13 2018 at 15:03):

also, for large dbs, is evaluation too slow because of an inefficient algorithm, or does it just not happen?

Cezar (Feb 13 2018 at 15:04):

Sean (Feb 14 2018 at 05:38):

@Cezar conversion to simd/spmd is using vector units to do some things 4 at a time for the price of 1

Sean (Feb 14 2018 at 05:40):

boolean evaluation is not libnmg (that's a specific subset of polygonal geometry) -- boolean evaluation is in librt/bool.c (more specifically rt_boolweave() and rt_boolfinal()).

Sean (Feb 14 2018 at 05:46):

for large db's, it's not that evaluation is slow -- it's that some algorithms are O(N^2), that is they get exponentially slower as there are more items. so for example if some command (e.g., facetize) takes 20 seconds with 20 booleans, how long will it take with 40 booleans? you'd hope it'd take 40 seconds or better, but some algorithms don't work that way and 20 seconds might become 400 seconds. so a good project would be identifying which commands slow down exponentially and then working to fix them.

Naseef (Feb 20 2018 at 20:11):

Cezar (Mar 22 2018 at 10:59):

i've run rt -B -s512 -H127 -J0 -o h.p havoc.g havoc under a profiler, it took 6.12 min with most of that spent in bool_eval (1.76 min, and 1/5 of that iterating through a bu_list) and bu_ptbl_cat_uniq (17.68 sec)

Cezar (Mar 22 2018 at 10:59):

Sean (Mar 22 2018 at 14:15):

@Cezar you found three priority performance areas right there (bu_pbtl, bu_list, and boolean evaluation) -- perhaps some of the earlier comments will make more sense to you now. just addressing all three of those is a lot to tackle for gsoc, but definitely all priority. things to think about when optimizing should generally go top-down like can some of the boolean evals be eliminated, for those that can't can they be made more efficient like eliminating bu_list traversals, for those that can't maybe the container can be changed (e.g., from a list to an array), or maybe changing it so we don't need to store pointers, and finally changing the way pointers are being stored (e.g., using a c++ hash or map container)

Cezar (Mar 22 2018 at 17:27):

is it fine using c++? HACKING says that it's "no prohibited", but also that the projects aims to be C89-compliant?

Cezar (Mar 22 2018 at 17:28):

also, i was looking at set data structures (for bool eval), and union-find looks like a good choice. has it been considered and discarded before?

Cezar (Mar 22 2018 at 17:31):

i think i've asked about this before, but "bool eval" seems like a misnomer, but maybe i'm missing something? it looks more like sets to me than boolean anything

Sean (Mar 22 2018 at 18:33):

It's looking like we'll be doing our last c89-compliant release here soon (in the next couple weeks hopefully) after which we will be requiring c++11, so that can be part of a gsoc plan

Sean (Mar 22 2018 at 18:35):

as for union-find, the results of boolean evaluation are not necessarily disjoint sets

Sean (Mar 22 2018 at 18:37):

Cezar (Mar 23 2018 at 12:14):

thanks for the answers. i have another question :D previously, you mentioned pointer aliasing in bu_list, but looking through src/libbu/list.c, i can't find any function where pointer aliasing is involved. i was wondering where i could find an example of PA

Sean (Mar 24 2018 at 02:34):

this is a somewhat advanced topic. you should read up more on what exactly pointer aliasing is. it's not something you'd find in a list.c function -- you find it in all the places the struct is used

Cezar (Mar 24 2018 at 08:29):

i did read up on it, but i misunderstood and thought that it could only occur with function arguments (i see why that's not true now)

Cezar (May 09 2018 at 08:42):

i created two plots of calls to bool_eval during the raytracing of havoc.g. the first one has the number of segments in the partition given as argument (which is iterated when looking for seek_stp) and the second one is the number of times the OP_SOLID case is taken in the switch

Cezar (May 09 2018 at 08:44):

it seems that iterating through the list of segments isn't that expensive since there are usually 2 or 4 segments in the list, but the search happens a lot of times so it adds up

Cezar (May 09 2018 at 08:46):

it also tells me that a replacement data structure should have low constant factor, else it will likely perform worse despite having better asymptotic complexity (N is very small in this case)

Cezar (May 09 2018 at 08:49):

it might also be that havoc isn't really representative of cases where the list of segments is huge :-? i used it because it's the largest example geometry, but if other examples are better, i'll look at those as well

Sean (May 09 2018 at 12:08):

Very cool graphs! @Cezar if you keep these images somewhere, you could probably publish a formal report later into the project. Glad to help you submit it somewhere if that's something interesting to you.

Another insight from usually having 2 or 4 segments in the list is that there's probably a structure that could be devised that treats them in pairs that would do better (or be more vectorizable potentially). It's also significant that there are "relatively few" in each partition, though we do have to be wary that the number of segments could be unlimited in the worst case.

Sean (May 09 2018 at 12:12):

havoc is fairly representative of a traditionally constructed csg model. other common types are imported geometries which will typically be mostly BoT meshes or NURBS, often with just a few (expensive) csg operations sprinkled throughout (e.g., a handful of subtractions to fix overlaps).

Sean (May 09 2018 at 12:52):

With the list of elements being as low as they are, one of the things I've thought would probably help is setting up a sorting network for the < N case

Cezar (May 09 2018 at 17:59):

didn't think about vectorization. i haven't written any vectorized code before, but i'll write some this week, since it seems to pop up rather often around this project

Cezar (May 09 2018 at 18:00):

regarding "setting up a sorting network for the < N case", i suppose you mean determining a certain number N, and if there are less than N segments, set up a sorting network?

Sean (May 09 2018 at 18:06):

you shouldn't need to worry about vectorization too much directly as there is some compiler support (for automatic vectorization) and several good abstractions available. we're more interested in OpenCL and TBB right now

Sean (May 09 2018 at 18:07):

yes, using a sorting network when the number of items to be sorted is small, falling back to something more general when it's larger

Cezar (May 09 2018 at 18:10):

and having a sorting network in place, i would be able to find seek_stp in O(1) using the comparators?

Cezar (May 09 2018 at 18:13):

Cezar (May 09 2018 at 18:14):

and i suppose it would not work for (very) large N because of the space complexity?

Sean (May 09 2018 at 18:23):

Sean (May 09 2018 at 18:24):

I'm not sure what exactly the bool_eval() logic is doing for OP_SOLID, would have to study the code more to understand why it's scanning for that tree solid in partition list

Cezar (May 09 2018 at 19:11):

i'm not familiar with sorting networks. i read about them now, and i don't get where they fit in, if not for the logN search :-?

Cezar (May 10 2018 at 11:12):

i was looking at a profile and noticed that bu_bitv_clear took a lot of time (Debug build). BU_BITV_ZEROALL is a macro that zeroes the bytes in the bit vector using a while loop. i replaced the loop with a call to memset and one call to the macro in rt_shootray went from 5.21 s to 248 ms. i was wondering what the reason for the loop was initially

Cezar (May 10 2018 at 11:16):

although i expect that the improvements would be smaller on a release build, i think they would still exist

Cezar (May 10 2018 at 12:34):

Cezar (May 10 2018 at 12:38):

regarding what you said before, that bitv functions shouldn't show up in profiles, i think it's fine. the only one i'm seeing is bitv_clear, and for 1 M bits, you still have to write 125,000 zeroes to memory

Cezar (May 10 2018 at 12:40):

Cezar (May 10 2018 at 12:49):

i'm thinking of pre-alloc'ing a large number of bytes for rt_shootray itself, and whenever a bitv is needed, just return a position within this buffer with enough bytes to fulfil the request. only zero it when a request can't be satisfied

Cezar (May 10 2018 at 12:53):

Cezar (May 10 2018 at 22:44):

although for large number of bitv_clear's with large number of elements, it would most likely matter

Cezar (May 10 2018 at 22:44):

Cezar (May 10 2018 at 22:49):

if i start with replacing bu_ptbl, i still think testing existence in O(1) using bit vectors or stl is where i should begin

Sean (May 16 2018 at 13:22):

Cezar (May 18 2018 at 06:38):

as i've written in my dev log, i replaced the bu_ptbl iteration inside bool_eval with bu_bitv. now rt_boolfinal is slower because of bit vector initialisation, but bool_final went from 14 seconds to 12 seconds. a lot of time is still spent checking if the solid is inside the partition (as expected from previous graphs), but each check is faster.

Cezar (May 18 2018 at 07:04):

i guess the next step would be to replace bu_bitv with std::bitset in this specific instance and see what happens then

Cezar (May 18 2018 at 07:17):

but i don't think it's worth it. the main problem with bool_eval is that there are lots of nodes in the tree, and each operation is already o(1), except the bu_ptbl iteration, in which there are few elements (most of the time just 1, in the worst case ~60; but even then, i would trade iterating over 60 elements for initialising a huge bit vector a lot of the time; even if i use std::bitset, i don't see it being worth it)

Cezar (May 18 2018 at 08:54):

Cezar (May 18 2018 at 09:20):

i tried with std::bitset. it seems to be consistently 2 seconds faster on Debug, which is what i would expect :-?

Cezar (May 18 2018 at 09:30):

Cezar (May 18 2018 at 09:56):

on Release, it's more like 0.5–1 seconds faster. it's consistently faster, but the impact is smaller

starseeker (May 18 2018 at 11:22):

@Cezar Go ahead and submit the std::bitset patch, since you've done the work - if it's consistently faster I don't see any particular reason not to use it, and even if we don't end up doing so it will be available to others for future testing.

Cezar (May 18 2018 at 12:58):

Sean (May 19 2018 at 03:39):

Sean (May 19 2018 at 03:41):

What's it look like in Release? Debug is fairly insignificant for performance testing... lots of things we have intentionally run slower when compiled as Debug so that they will be accessible within a debugger.

Sean (May 19 2018 at 03:53):

I'd make sure it is consistently faster in Release mode before applying the change -- @Cezar your intuition about initializing repeatedly being more expensive than short list iterations could be true when optimized. it's worth being more certain across models on both ends of the spectrum.

Cezar (May 20 2018 at 11:10):

i'm currently running bench/run.sh with my changes, on release. after that i'll run it with trunk and post the results

Cezar (May 20 2018 at 17:13):

i ran the benchmarks and they actually display a slowdown across all models. here is trunk and here are my changes. i ran a debug build inside a profiler, and the destructor for vector shows up in there, which isn't salutary

Cezar (May 20 2018 at 17:19):

one mistake i think i'm making is that i'm initialising (and then destroying, too) an ap->a_rt_i->nsolids-long bit vector for each partition, and i could do that once outside the loop, and then just set the bits at each iteration.

Cezar (May 20 2018 at 17:36):

actually, i don't think there's any way around it. basically what i'm doing is introducing nsolids operations for each partition, which previously did not exist.

Cezar (May 20 2018 at 17:38):

i think a smarter approach is to store the results of the search in OP_SOLID in bool_eval. first time through the list, it would be just as expensive, but subsequent searches would become o(1)

Cezar (May 21 2018 at 21:40):

i've tried things with std::map and std::unordered_map these days, they seem to slow down raytracing significantly. i think this is because the containers need to be constructed/destructed; there are few elements in the bu_ptbl; accessing an element in a map is more expensive than in an array; and locality (maps are trees, unordered_maps use linked lists for buckets, elements in bu_ptbls are contiguous)

Cezar (May 21 2018 at 21:43):

another thing i found out is that there are a lot of partitions (> 100k). previously, i imagined there were ~100 partitions with huge trees

starseeker (May 22 2018 at 03:36):

@Cezar Is there any way you can re-use the map/unordered_map rather than creating/destroying it? (no idea if that makes sense or is practical in this context, just a thought...)

starseeker (May 22 2018 at 03:37):

Cezar (May 22 2018 at 11:19):

i ran it under a profiler and it looked like most of the time was spent accessing elements (inside operator[])

Cezar (May 22 2018 at 11:20):

Cezar (May 22 2018 at 15:42):

i tried using hashes when the number of segments in a partition is bigger than a threshold. performance degraded now as well, probably because of all the conditionals in addition to the previous logic

Cezar (May 22 2018 at 15:48):

i'm not sure how the "space" is split into partitions (if that's what those partitions mean), but it seems like the bu_ptbl iteration was written to work well with it. in ~30 M calls to bool_eval when tracing havoc, there are < 10 segments in a partition. so it seems to have lots of partitions with few segments, and in this case, i don't think whatever data structure i use will pay off, since at the very least, i'll have to initialise it for each partition (for example zeroing the array)

Cezar (May 22 2018 at 15:51):

i tried a simple array hash, with a dumb/fast function (address of soltab mod 67), and it was still 3 seconds slower

Cezar (May 22 2018 at 15:54):

i tested using rt -B -H31 -P1 -o havoc.png havoc.g havoc, and got ~18 sec on trunk and ~21 sec with my changes

Cezar (May 23 2018 at 08:24):

so i think the next step would be to replace the bu_ptbl of struct soltabs altogether, instead of augmenting it. there is little work done when calling bool_eval once, and i think adding work there for bookkeeping will only make it slower no matter what

starseeker (May 23 2018 at 09:55):

Cezar (May 23 2018 at 11:13):

by the way, i guess i should have asked this sooner, but do you have a (measurable) performance goal in mind? i'm currently running bench/run.sh and comparing rays/sec between runs to determine if it's improving, but i was wondering if there is a goal like "at least 5.5 M rays/sec when benchmarking moss"

Sean (May 27 2018 at 18:48):

we haven't quantified the goals, and a naive attempt would likely be unusefully arbitrary (e.g., 2x faster).

Sean (May 27 2018 at 18:52):

it would be good if we could figure out an objective metric, like having a render targets for an implicit object, triangle, and nurbs/brep version that have sufficient quality that they converge to the same image (within some specified pixel deviation tolerance).

Sean (May 27 2018 at 18:53):

then we could at least talk about relative costs in a more meaningful manner. it might be the start of defining on OCH (similar to SAR used by kd-tree, but not misnomered)

Sean (May 27 2018 at 18:54):

realistically, we know there's a solid order of magnitude possible through data coherency alone, simd should give a similar order of magnitude boost -- so something that is currently 1M rays/s might attain 100M rays/s

Cezar (May 27 2018 at 19:43):

so i've finished fixing my build errors and ran benchmarks using std::vector. they're better on all models, but not by much. however, i am now able to more easily test other stl containers. what i did was 1) change the struct bu_ptbl pt_seglist member of struct partition to std::vector<struct seg *> pt_seglist (and implemented the operations used in bool.cpp with the new structure); 2) split rt/ray_partition.h from raytrace.h and included it separately where it's needed; and 3) converted the files where the header is needed to C++ (with C linkage). what this means though is that you can no longer iterate the segments list from C, and i've also had to change a lot of files. the performance improvements so far are a bit underwhelming maybe, so i'm wondering if the trade offs are worth it

Cezar (May 27 2018 at 19:43):

gsoc/bench % cat summary-trunk-*
Abs  rmbp 4041636.94    1714565.15  1734933.19  1348127.28  1830646.00  2287836.14  2159624.11  Thu May 24 01:01:31 EEST 2018
*vgr rmbp 29498.84  25567.62    30942.27    25264.75    25896.81    154.35  22887.44
Abs  rmbp 4168024.50    1761264.50  1715300.76  1403866.97  1816805.40  2308539.52  2195633.60  Thu May 24 01:17:52 EEST 2018
*vgr rmbp 30421.31  26264.00    30592.13    26309.35    25701.02    155.75  23240.59
Abs  rmbp 4191027.14    1735223.93  1774402.02  1391472.34  1786512.82  2302579.43  2196869.61  Thu May 24 01:37:09 EEST 2018
*vgr rmbp 30589.20  25875.69    31646.19    26077.06    25272.49    155.35  23269.33
Abs  rmbp 4194157.90    1708366.28  1768637.50  1422026.54  1912957.40  2312326.89  2219745.41  Thu May 24 01:56:24 EEST 2018
*vgr rmbp 30612.05  25475.19    31543.38    26649.67    27061.21    156.01  23582.91
Abs  rmbp 4252770.44    1786723.23  1836428.96  1419499.54  1919978.36  2304439.92  2253306.74  Thu May 24 02:14:48 EEST 2018
*vgr rmbp 31039.85  26643.65    32752.43    26602.31    27160.53    155.47  24059.04
gsoc/bench % cat summary-vector-*
Abs  rmbp 4684937.42    2031225.42  2025885.21  1557917.44  2090030.01  2569718.32  2493285.63  Sun May 27 21:15:55 EEST 2018
*vgr rmbp 34194.12  30289.67    36131.35    29196.35    29566.13    173.37  26591.83
Abs  rmbp 4638519.22    1932765.21  1975258.83  1554311.46  2033613.46  2507226.39  2440282.42  Sun May 27 21:33:41 EEST 2018
*vgr rmbp 33855.33  28821.43    35228.44    29128.77    28768.05    169.16  25995.19
Abs  rmbp 4605819.18    1986518.42  2022192.86  1557950.64  2083347.79  2540601.93  2466071.80  Sun May 27 21:49:56 EEST 2018
*vgr rmbp 33616.66  29623.00    36065.50    29196.97    29471.60    171.41  26357.52
Abs  rmbp 4681758.44    2003582.74  2061324.01  1560023.33  2034590.59  2532082.09  2478893.53  Sun May 27 22:05:59 EEST 2018
*vgr rmbp 34170.92  29877.46    36763.40    29235.81    28781.87    170.83  26500.04
Abs  rmbp 4581822.58    2019161.52  2035468.56  1575925.77  2070459.65  2589455.33  2478715.56  Sun May 27 22:22:35 EEST 2018
*vgr rmbp 33441.51  30109.77    36302.27    29533.84    29289.28    174.70  26475.22

Cezar (May 27 2018 at 19:43):

Cezar (May 27 2018 at 19:48):

Sean (May 27 2018 at 19:49):

Cezar (May 27 2018 at 19:50):

Sean (May 27 2018 at 19:50):

considering ptbl's were only amounting to about 20-30% of the overall time, you cut it way down

Sean (May 27 2018 at 19:50):

Sean (May 27 2018 at 19:51):

it's a way to split up geometry (whether with a kd-tree or bsp or other spatial partition method) where you want to try and balance the amount of work on both sides of a split so they're roughly even

Sean (May 27 2018 at 19:52):

if everything is a triangle, then surface area tends to be a reasonable metric for that ... the amount of geometry surface over here equals the amount over there, so this is a good split point

Sean (May 27 2018 at 19:52):

Sean (May 27 2018 at 19:53):

the generalization is some form of an object complexity or object cost metric / heuristic

Cezar (May 27 2018 at 20:02):

ok, i think i understand what that means. i'm also curious what a render target is. something like "a target number of operations", or something else? and regarding data coherency, is that cache coherency, or something else/more?

Sean (May 27 2018 at 20:56):

in what context? a render target is typically an image being made or a particular type of image

Cezar (May 27 2018 at 20:57):

oh, you said previously "it would be good if we could figure out an objective metric, like having a render targets for an implicit object"

Sean (May 27 2018 at 20:58):

data coherence does refer to cache coherency or more generally to coherency in memory and across memory boundaries -- for example, if something is on disk, you obviously want to only read from disk once and do all your processing on it (as opposed to reading it from disk repeatedly)

Sean (May 27 2018 at 20:59):

same thing applies for things read into main memory that end up on the cpu or gpu memory, like if on the cpu, keep things in l3 as long as possible, in l2 as long as possible, etc

Cezar (May 27 2018 at 21:01):

oh, i see. is reading repeatedly from disk a problem that exists right now in the codebase, or was it a hypothetical example?

Sean (May 27 2018 at 21:01):

yeah, previous comment about render target was to have a desired image (e.g., a 1024x1024 rendering of a sphere) that you would get when you run (for example) rt on a sph object, or on a sph.bot object or a sph.brep object -- the "target" aka desired image would be the same for all three

Sean (May 27 2018 at 21:01):

Sean (May 27 2018 at 21:03):

typically, coherency is disk < net < main memory < l3 cache < l2 cache < l1 cache || gpu cache

Cezar (May 27 2018 at 21:05):

Sean (May 27 2018 at 21:11):

yes! it is also memory hierarchy, coherency is simply a data-centric view (as opposed to a memory architecture perspective) where you keep related memory items together, so you can do what you need on the data without reaching down the hierarchy for more data (thus incurring a stall)

Sean (May 27 2018 at 21:23):

starseeker (May 28 2018 at 14:50):

Cezar (May 28 2018 at 15:59):

Cezar (May 28 2018 at 17:22):

so while modifying the code to use std::set, i noticed that i can't actually remove the iteration in bool_eval because in there, it's searching for a soltab, while the container holds segs, so i have to access seg_stp. i also can't change it to hold soltabs because somewhere else in the code, it requires a seg's other members and there's no pointer from a soltab to a seg

Cezar (May 28 2018 at 20:04):

i tried unordered_set, but stopped quickly. after a few seconds, it was still not done with the first item in the benchmark (which takes 0.06 sec with bu_ptbl). i profiled it and most (all?) of the time is spent in the container's rehash method, so i don't think there's much to do here

Cezar (May 28 2018 at 20:28):

also related to my previous message, i can't use bit vectors either. and since maps are similar to sets (buckets and rehashing), i don't think they'll yield good results

starseeker (May 29 2018 at 00:18):

@Cezar Out of curiosity, what do you think about maybe trying a C implementation of red black trees? We've got one in libbu, and if it has problems there are a few others out there we could replace it with.

Cezar (May 29 2018 at 08:05):

i think std::set and std::map use rb trees. i didn't get to test them yesterday because i kept getting seg faults and postponed fixing them, but i'll get to it today. after that, i'll try the libbu implementation as well. also, i think i can test for membership in O(log N) in bool_eval with std::map

Cezar (May 29 2018 at 10:13):

i got set to work, but it's worse than trunk (moss, frame 8, 3.3 M rays/sec vs 4.5 and 5.2). i'll try map next, and then the rb tree in libbu

Cezar (May 29 2018 at 12:22):

tried map, it's still slow. profiled it, spends 9 seconds (out of 1.25 minutes on Debug) in bool_eval finding the soltab with .count(). i think it's because of locality + expensive operations. it's log N, but when searching for < 5 elements, the constant factors matter

Cezar (May 29 2018 at 12:28):

i'll try libbu's implementation as well. as for other implementations, i remember the linux kernel having one. maybe there are a few tricks there. there's also the matter of licensing :-?

Cezar (May 29 2018 at 14:43):

although if i’m trying other implementations, i can try different packages for hashes, too

Sean (May 30 2018 at 05:21):

Shouldn’t need to worry about licensing. Typically only problems are with gpl and academic codes that have a discriminatory clause like cc-by-nc making it not Open Source.

Sean (May 30 2018 at 05:27):

I think you are seeing a clear pattern that init constant is dominant on the O(logN) methods so either need a diff init or change the algorithm...

Sean (May 30 2018 at 05:28):

Can you briefly summarize everything you’ve tried and the timings into a table? If the data is handy....

Cezar (May 30 2018 at 08:29):

That was my initial plan, but after seeing the poor performance of other containers, I stopped the benchmarking suite before completing. For example, using map, it went from ~2.2 M for world to ~1.2 M, so I figured it's not worth running further.

Cezar (May 30 2018 at 08:32):

But I'm curious why vector is faster than bu_ptbl. I want to try adjusting the initial capacity and see if this improves performance.

Cezar (May 30 2018 at 11:14):

i set bu_ptbl's initial capacity to 16 (down from 64) and the numbers are actually close to vector. it still has a slight edge, but the upside is that there's no need to switch to C++

Cezar (May 30 2018 at 11:14):

gsoc/bench % cat summary-trunk-*
Abs  rmbp 4041636.94    1714565.15  1734933.19  1348127.28  1830646.00  2287836.14  2159624.11  Thu May 24 01:01:31 EEST 2018
*vgr rmbp 29498.84  25567.62    30942.27    25264.75    25896.81    154.35  22887.44
Abs  rmbp 4168024.50    1761264.50  1715300.76  1403866.97  1816805.40  2308539.52  2195633.60  Thu May 24 01:17:52 EEST 2018
*vgr rmbp 30421.31  26264.00    30592.13    26309.35    25701.02    155.75  23240.59
Abs  rmbp 4191027.14    1735223.93  1774402.02  1391472.34  1786512.82  2302579.43  2196869.61  Thu May 24 01:37:09 EEST 2018
*vgr rmbp 30589.20  25875.69    31646.19    26077.06    25272.49    155.35  23269.33
Abs  rmbp 4194157.90    1708366.28  1768637.50  1422026.54  1912957.40  2312326.89  2219745.41  Thu May 24 01:56:24 EEST 2018
*vgr rmbp 30612.05  25475.19    31543.38    26649.67    27061.21    156.01  23582.91
Abs  rmbp 4252770.44    1786723.23  1836428.96  1419499.54  1919978.36  2304439.92  2253306.74  Thu May 24 02:14:48 EEST 2018
*vgr rmbp 31039.85  26643.65    32752.43    26602.31    27160.53    155.47  24059.04
gsoc/bench % cat summary-ptbl16-*
Abs  rmbp 4680155.95    1927090.59  1999106.16  1550430.99  2071242.81  2599398.72  2471237.53  Wed May 30 12:53:31 EEST 2018
*vgr rmbp 34159.22  28736.81    35653.75    29056.05    29300.36    175.38  26180.26
Abs  rmbp 4662057.95    1932794.89  1932672.45  1539780.22  2089337.31  2568054.50  2454116.22  Wed May 30 13:11:15 EEST 2018
*vgr rmbp 34027.13  28821.87    34468.92    28856.45    29556.33    173.26  25983.99
Abs  rmbp 4483025.56    1959477.52  1968239.40  1534994.64  2065281.19  2589953.44  2433495.29  Wed May 30 13:29:25 EEST 2018
*vgr rmbp 32720.42  29219.76    35103.25    28766.76    29216.03    174.74  25866.82
Abs  rmbp 4582552.38    1887762.28  2002007.33  1554848.10  2082253.88  2560400.25  2444970.70  Wed May 30 13:46:10 EEST 2018
*vgr rmbp 33446.84  28150.34    35705.49    29138.83    29456.13    172.74  26011.72
Abs  rmbp 4602999.06    1911004.39  1992763.36  1525482.72  2079252.77  2585297.14  2449466.57  Wed May 30 14:02:18 EEST 2018
*vgr rmbp 33596.08  28496.93    35540.63    28588.50    29413.67    174.42  25968.37
gsoc/bench % cat summary-vector-*
Abs  rmbp 4684937.42    2031225.42  2025885.21  1557917.44  2090030.01  2569718.32  2493285.63  Sun May 27 21:15:55 EEST 2018
*vgr rmbp 34194.12  30289.67    36131.35    29196.35    29566.13    173.37  26591.83
Abs  rmbp 4638519.22    1932765.21  1975258.83  1554311.46  2033613.46  2507226.39  2440282.42  Sun May 27 21:33:41 EEST 2018
*vgr rmbp 33855.33  28821.43    35228.44    29128.77    28768.05    169.16  25995.19
Abs  rmbp 4605819.18    1986518.42  2022192.86  1557950.64  2083347.79  2540601.93  2466071.80  Sun May 27 21:49:56 EEST 2018
*vgr rmbp 33616.66  29623.00    36065.50    29196.97    29471.60    171.41  26357.52
Abs  rmbp 4681758.44    2003582.74  2061324.01  1560023.33  2034590.59  2532082.09  2478893.53  Sun May 27 22:05:59 EEST 2018
*vgr rmbp 34170.92  29877.46    36763.40    29235.81    28781.87    170.83  26500.04
Abs  rmbp 4581822.58    2019161.52  2035468.56  1575925.77  2070459.65  2589455.33  2478715.56  Sun May 27 22:22:35 EEST 2018
*vgr rmbp 33441.51  30109.77    36302.27    29533.84    29289.28    174.70  26475.22

Cezar (May 30 2018 at 12:14):

Cezar (May 30 2018 at 21:15):

Cezar (Jun 01 2018 at 18:53):

i tried running with the rb tree implementation in libbu. it's super slow (tens of seconds where trunk takes 0.06) and that's because of acquiring and releasing semaphores. running with -P 1 makes it faster. it could be an implementation problem, maybe locking isn't required. i still think it has worse locality and initialisation, which should matter for small lists. but if you want me to try other implementations, i will do so. if you want to see my code (maybe i'm doing it wrong), i'll submit a patch

Cezar (Jun 01 2018 at 19:16):

Cezar (Jun 01 2018 at 22:21):

libbu semaphores are used because of MALLOC_NOT_MP_SAFE, which i think is not needed nowadays (?). i undefined it and performance increased significantly, but it's still ~3x slower than bu_ptbl. insertions (due to allocating new nodes) and the OP_SOLID case in bool_eval are the hotspots. i could try to keep a pool of nodes and allocate from that

starseeker (Jun 02 2018 at 01:11):

starseeker (Jun 02 2018 at 01:13):

the "memory allocation done by client" bit intrigues me, since your testing is indicating initialization costs are an issue... makes me wonder if we could pre-allocate...

starseeker (Jun 02 2018 at 01:15):

I'm not surprised our redblack tree in libbu has issues - to the best of my knowledge its not been stressed much

starseeker (Jun 02 2018 at 01:28):

@Cezar might also try a few other containers like a splay tree and see if they do anything interesting...

starseeker (Jun 02 2018 at 01:30):

Cezar (Jun 02 2018 at 06:36):

i could try it, but looking at the readme, i see that it's "Not thread safe". this seems like a big problem :-?

Cezar (Jun 02 2018 at 06:39):

since most of the time is spent allocating nodes during insertions, i think a memory pool would be helpful. i was considering it, but my thinking is that i would still need a list to manage the pool, i'd still iterate over it, and it would be bigger still than the current lists (of segments). so it might harm performance in the end

starseeker (Jun 02 2018 at 20:55):

Cezar (Jun 04 2018 at 08:15):

i think so, since it doesn't hold any global state, and the way it's used is contained to a single thread. so if thread A creates a bu_ptbl, only thread A will read/modify it

Cezar (Jun 04 2018 at 08:18):

but rb3ptr looks to be the same, i hadn't thought about it before pasting the "not thread safe" comment

Sean (Jun 05 2018 at 04:05):

On the surface, that was a shocking discovery but in retrospect it makes complete sense if static init is dominating the timing. Did you test any other sizes like 8 or 0 or 24?

Sean (Jun 05 2018 at 04:14):

I think you've just about killed all the ptbl replacement options. you found a way to make it considerably faster, great -- I suggest moving on to the next profile hot point. if ptbl is still high on the profile for specific cases, it might make sense to use a hybrid strategy (like plain array for first 8 entries, then some logN method for additional).

Sean (Jun 05 2018 at 04:16):

Cezar (Jun 05 2018 at 07:08):

I tried 8 and that's better than trunk as well, but worse than 16. I think 16 is a sweet spot because there are usually between 8 and 16 segments in a partition, so no resizing is needed and no time is wasted zeroing unused elements

Cezar (Jun 05 2018 at 07:12):

you previously mentioned data coherency, bu_list pointer aliasing and simd offering big performance improvements. i would like to look at those next, but i'm not sure where to begin

Cezar (Jun 05 2018 at 07:13):

also, i asked if there is a way to run the benchmark suite on windows (i don't see any). if there isn't, would it be worth it if i ported the run.sh script to python?

Cezar (Jun 06 2018 at 19:41):

i was looking at vectorization, and what i want to try next is to compile with clang's vectorization reports, see what couldn't be vectorized and why, and try to rewrite the code. does this sound good?

Cezar (Jun 06 2018 at 19:47):

also, while going through cmakelists, i noticed some SSE flags commented out due to invalid instructions. is there more context around it?

Sean (Jun 07 2018 at 04:07):

there's not, but not worth porting to python at this time -- there are other plans for it already in the works

Sean (Jun 07 2018 at 04:17):

I suggest finding something relatively simple and isolated. see what it takes to change it. the most egregious cases are functions that take/pass bu_list as a function parameter. looking quickly at public headers, wdb.h stands out as just having a couple and it's a fairly isolated library -- maybe start there as a test.

Sean (Jun 07 2018 at 05:22):

It's hard to test for SSE without causing a runtime crash. We need both build-time and (if not more importantly) runtime sse detection that doesn't crash when you don't have sse.

Cezar (Jun 07 2018 at 06:14):

oh, ok :D is there anything i can read on the plans? the reason i wanted to test on windows is that it is the most downloaded version on sf, and figured it is important to benchmark there as well

Cezar (Jun 07 2018 at 06:16):

Cezar (Jun 07 2018 at 12:06):

i want to see if i get this right. in int mk_pipe(struct rt_wdb *fp, const char *name, struct bu_list *headp), is the reason for pointer aliasing that struct rt_wdb's first member is a struct bu_list, and the compiler can't be sure that fp and headp point to disjoint memory?

Cezar (Jun 07 2018 at 12:07):

Cezar (Jun 07 2018 at 12:08):

i thought that if the parameters have different types, they're assumed to not be aliases of one another, so i guess that's why i don't understand why aliasing takes place in the example i gave

Sean (Jun 07 2018 at 13:07):

There's nothing to read up on, nothing fancy. It's just being reworked in another way and being used as a testing framework for another performance code (i.e., being slightly generalized). I can let you know if there's bits you can tackle since it is performance related, but don't want to spread you out too thin!

Sean (Jun 07 2018 at 13:12):

it should be and that's essentially what we have now (see src/libbu/simd.c) but then there are outliers doing different things (e.g., include/bn/dvec.h)

Sean (Jun 07 2018 at 13:33):

So this is a little complicated to explain because the aliasing typically happens in non-obvious and in unlocalized places. There's no problem with fp in that function signature. There's technically (not yet) a problem with headp either except that it's a generic bu_list -- meaning "it's a list of something".

If you look in src/libwdb/pipe.c, you'll find that function and sure enough mk_pipe() doesn't really do anything other than add that headp bu_list to another bu_list (i.e., pipe_segs_head). But if we look at the places pipe_segs_head is used, we start to find aliasing in action that screws with performance. Consider the destructor function for example: mk_pipe_free()

In there, it's iterating over a generic "list of something" headp, but casting each list element to a struct wdb_pipept, unlinking it from the list (i.e., dequeue wp->l), and releasing the memory. That only worked because wdb_pipept has a bu_list as the first byte in its structure.

With strict pointer aliasing, the compiler could know that the next element in the bu_list is += sizeof(struct bu_list) bytes ahead... but it's not in our case because it's really += sizeof(struct wdb_pipept) bytes ahead. And if the compiler assumes wrongly, trying to optimize aggressively, it will end up prefetching the wrong bytes, and it'll typically crash eventually. When we turn strict aliasing off, it tells the compiler to not do any lookahead optimization, use the manual cast sizes that switch from one thing to another depending on which function we are in.

Sean (Jun 07 2018 at 13:38):

So the issue really is bu_list's entire existence because it was implemented specifically to take advantage of non-strict aliasing, for code simplicity, brevity. The "fix" requires re-evaluating the container being used, and figuring out a different container -- like A) passing a struct wdb_pipept with traditional forw/next pointers in it (so that structure is a list), B) passing a plain NULL-terminated array of wdb_pipept structures, C) using some other generic C container (akin to using a std::vector<struct wdb_pipept>), D) ...

Cezar (Jun 07 2018 at 13:42):

yep, it makes sense now. what i was missing was the part about prefetching the wrong bytes

Sean (Jun 07 2018 at 13:55):

Sean (Jun 07 2018 at 14:03):

For what it’s worth, the answer should be to convert a bu_list to either an array of structs or a struct with array(s) in it. If you read up and see terms like AoS or SoA or AoSoA, that’s what the typical performance pattern is. Linked lists should basically never be used any more except underpinning more complex containers like std::map

Cezar (Jun 07 2018 at 14:09):

Cezar (Jun 07 2018 at 14:33):

oh, i just looked up SoA and AoS, i've done similar things when implementing graphs, but didn't know they had a name, or their performance implications :D

Sean (Jun 07 2018 at 14:49):

Yes, memory locality dominates. When a page of memory is loaded and you access a structure, it’s very likely that the next structure you’ll need is on that same page of memory, so you avoid having to refetch if it’s an array. With linked list pointers, it’s possible for every structure to live on different pages and always or frequently incur a page fault.

Cezar (Jun 08 2018 at 17:24):

i got two questions :D 1) i profiled rt(1) again, and i can't find hotspots involving bu_list. i understand what the problem is conceptually, but is there anything i can do to convince myself that it is a problem in practice? am i missing something in the rt profile, or do i have to profile something else?; 2) if i do have to replace bu_list with an array-based structure, i'll have to implement (or find something already done — i see that bu/list.h mentions some candidates for replacement) something similar to bu_ptbl, but which holds elements of arbitrary size, and change the entire code base to use this new structure. or are there specific libraries which i should focus on (but if there are, i suppose this ties into my first question — how do i identify them?)?

starseeker (Jun 09 2018 at 02:11):

@Cezar did you run rt with -F/dev/null to avoid overhead generated by image writing?

starseeker (Jun 09 2018 at 02:15):

@Cezar I don't know if you'll have to completely replace bu_list everywhere, but any substantial change to bu_list's use in librt I would expect to touch a lot of code. My advice would be to set up "parallel" implementation files and API calls which replicate functionality using the new replacement for bu_list. That way, you can instrument things up to do an apples to apples comparison on output accuracy in the old code vs. the rewritten code. And if (as I suspect) we would have to leave the old code in place while deprecating it in favor of a de-bu_list-ified API, that would position the code well to handle/manage that process.

starseeker (Jun 09 2018 at 02:19):

@Cezar an unrelated point, but something I am curious about - are there any places in the code where our use of bu_ptbl_ins_uniq show up as any sort of performance bottleneck? The uniqueness guarantee of bu_ptbl_ins_uniq I would expect to get expensive quickly as table sizes grow... Or, if it doesn't immediately show up in our code, could you try setting up some sort of unit test to get a sense of when bu_ptbl_ins_uniq starts to get problematic performance wise for both random and worst-case uniqueness searching?

starseeker (Jun 09 2018 at 02:22):

It may not matter greatly for raytracing (based on your work to date) but since bu_ptbl seems to perform well enough to keep around I'd like to at least have some sense of what we might do (and when we might need/want to do it) to avoid very high performance penalties for uniq insertion into large bu_ptbl arrays, if it's a test you can do easily and quickly...

starseeker (Jun 09 2018 at 02:24):

e.g. could we "switch" the backend container to a red-black tree implementation or something similar when the bu_ptbl size gets past some size threshold, and keep the current fast behavior at smaller array sizes when the uniq test is cheap enough to not be a concern?

Cezar (Jun 09 2018 at 07:28):

no, i was doing -o file.png. i was looking for something like -F when profiling, but i missed it when skimming through the man pages

Cezar (Jun 09 2018 at 07:29):

Cezar (Jun 09 2018 at 07:35):

bu_ptbl_ins/cat_uniq do show up when profiling, but i can't tell if it's a bottleneck. they show up in rt_boolfinal when iterating over the ~150k partitions, so i think it's because they're called a lot of times, not because the calls themselves are expensive. i drew some graphs a few weeks back showing that the number of segments in the list is <= 12

Cezar (Jun 09 2018 at 07:42):

that should be easy to test, but i'm wondering if it's not "premature optimisation", since large bu_ptbls don't seem to show up anywhere. theoretically they can get huge, and i think even for ~10k elements a rb tree would be faster in the worst case, but if that's never exhibited, what's the point? also, i'm not familiar with the particulars of the raytracer, but i think it's splitting the space into lots (~150k) of small (~16 segments) partitions, which works well with the bu_ptbl implementation. or maybe it's a coincidence i've noticed

starseeker (Jun 09 2018 at 12:29):

@Cezar fair enough. libged's search command uses bu_ptbl, and that was actually the use case I had in mind where bu_ptbl_ins_uniq might end up expensive, but you're correct that that's not the focus for this project.

starseeker (Jun 09 2018 at 12:31):

I can always throw it together later if needed, and if you're going to tackle the bu_list question that's going to demand full attention ;-)

Cezar (Jun 13 2018 at 14:07):

i started working on moving struct region to SoA. i should look at the code and see how the structure is used and see if AoSoA is worth trying as well (for example, i've seen loops where only the reg_namemember is used, but there are also places where two or more members are used)

Cezar (Jun 13 2018 at 14:08):

i'm also thinking of using xxhash when inserting/updating an element for fast equality check (is this region equal to this other region?). but this should make working with these structs a bit more complicated

Cezar (Jun 13 2018 at 14:10):

a problem i'm currently having is that when inserting/updating, i have to update a lot of arrays (15 in this case). this means at least 30 lines of code (15 if the array needs to be realloc'd and 15 for updating the data), and i'm looking for ways of making this shorter. i was thinking of making the struct a void * array of arrays, where each outer line is a member. then i need to #define the names of the members like #define reg_name el[0] (like tree.h does, i think). but then i think that when using the structure, the members will have to be casted to their actual type. this seems cumbersome. if i can't find a proper solution, i'll just write the code to update each element (maybe generate it with python, outside trunk)

Cezar (Jun 13 2018 at 14:15):

i also still don't know how to measure the impact the data organisation has on the cache. using PMCs, i've determined there are ~1.7x as many "no execute" cycles (waiting for memory) as execute cycles. but i don't know how to measure the impact of each line of code (i think it should be possible). i think this is important because there is a lot of code to change when moving to SoA, and i would prefer to have a way to measure this at the onset

Cezar (Jun 26 2018 at 03:35):

@starseeker @Sean any thoughts on the above? :D i also ran cachegrind on rt inside a debian VM and created a table with the results, sorted by L1d misses

Cezar (Jun 26 2018 at 03:37):

the tree structure is responsible for quite a lot of those misses, so i think it's worth looking into? i remember looking at it a few weeks back and it seemed like the addresses of the nodes were close together in memory

Sean (Jun 26 2018 at 22:52):

If you're type punning, automatic vectorization will not be possible. Anything getting converted to arrays needs to be non-void, non-pointer arrays.

Existing #defines are indirections that are no longer best practice and should be avoided.

In general, I think you jumped on one that is WAY too complicated with struct region... there are tons of considerations, problems, and really hard complexities with most of the core structures. that's why the original suggestion was to simply pick a bu_list somewhere/anywhere, and convert it to an array (or some other container). experience with doing a few of those will greatly help solve the harder problems down the road (and much will be learned along the way). Have you looked at any of the lists?

Sean (Jun 26 2018 at 22:57):

These results are amplified by incoherent pointer chasing. Even doing something as simple as shooting rays in bundles should have a major impact on L1/L2 hits.

Note from an optimization strategy perspective, you want to first look at main memory access stalls, then L3 stalls, then L2, then L1. If you jump straight to L1, you'll end up injecting complexity that will likely get thrown away (completely) when you then look at other stalls (that are orders of magnitude bigger).

Cezar (Jun 28 2018 at 11:41):

ok, i see what you mean. i've found struct temp_file_list inside libbu/temp.c, that looks self-contained. i'll start working on that one

Erik (Feb 29 2020 at 20:18):

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
Benchmark results indicate an approximate VGR performance metric of 134454
Logarithmic VGR metric is 5.13  (natural logarithm is 11.81)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#

Sean (Feb 29 2020 at 23:42):

Erik (Feb 29 2020 at 23:50):

Sean (Feb 29 2020 at 23:51):

40? nice! might be able to get a higher number then, but that's still pretty outstanding.

Erik (Feb 29 2020 at 23:53):

I'm sure, I only did =Release, probably lots of other fun flags and minor tweaks to do... mebbe I'll hit it with 'vtune' to learn vtune :D

Sean (Mar 01 2020 at 07:33):

used vtune for the first time a couple years ago, it's nice and intuitive, bit of a learning curve.

Sean (Mar 01 2020 at 07:34):

perf works great in a pinch though, super easy and just as good metrics (just without the nice gui)

Erik (May 16 2020 at 12:05):

Abs **** 3402463.46 1380366.82 1359886.60 1030002.51 1221374.29 1546660.32 1656792.33 Fri May 15 22:28:48 EDT 2020

brand new ryzen 3 running minimal ubuntu.
ryzen 3 3200G 4core 3.6ghz
16gb ddr4 2400
'asus tuff b450-plus' motherboard and a slow assed wd spinning platter
(odd duck, ainnit? it's an anti-gaming desktop that can be turned into a gaming rig, for my oldest. Homework first.)

Stream: brlcad

Topic: performance

Sean (Jan 30 2018 at 02:46):

Sean (Jan 30 2018 at 02:46):

Sean (Feb 12 2018 at 03:33):

Rahul Saxena (Feb 13 2018 at 13:06):

Sean (Feb 13 2018 at 14:08):

Cezar (Feb 13 2018 at 15:01):

Cezar (Feb 13 2018 at 15:03):

Cezar (Feb 13 2018 at 15:04):

Sean (Feb 14 2018 at 05:38):

Sean (Feb 14 2018 at 05:40):

Sean (Feb 14 2018 at 05:46):

Naseef (Feb 20 2018 at 20:11):

Cezar (Mar 22 2018 at 10:59):

Cezar (Mar 22 2018 at 10:59):

Sean (Mar 22 2018 at 14:15):

Cezar (Mar 22 2018 at 17:27):

Cezar (Mar 22 2018 at 17:28):

Cezar (Mar 22 2018 at 17:31):

Sean (Mar 22 2018 at 18:33):

Sean (Mar 22 2018 at 18:33):

Sean (Mar 22 2018 at 18:35):

Sean (Mar 22 2018 at 18:37):

Cezar (Mar 23 2018 at 12:14):

Sean (Mar 24 2018 at 02:34):

Cezar (Mar 24 2018 at 08:29):

Cezar (May 09 2018 at 08:42):

Cezar (May 09 2018 at 08:44):

Cezar (May 09 2018 at 08:46):

Cezar (May 09 2018 at 08:49):

Sean (May 09 2018 at 12:08):

Sean (May 09 2018 at 12:12):

Sean (May 09 2018 at 12:52):

Cezar (May 09 2018 at 17:59):

Cezar (May 09 2018 at 18:00):

Sean (May 09 2018 at 18:06):

Sean (May 09 2018 at 18:07):

Cezar (May 09 2018 at 18:10):

Cezar (May 09 2018 at 18:13):

Cezar (May 09 2018 at 18:14):

Sean (May 09 2018 at 18:23):

Sean (May 09 2018 at 18:24):

Cezar (May 09 2018 at 19:11):

Cezar (May 10 2018 at 11:12):

Cezar (May 10 2018 at 11:16):

Cezar (May 10 2018 at 12:34):

Cezar (May 10 2018 at 12:38):

Cezar (May 10 2018 at 12:40):

Cezar (May 10 2018 at 12:49):

Cezar (May 10 2018 at 12:53):

Cezar (May 10 2018 at 22:44):

Cezar (May 10 2018 at 22:44):

Cezar (May 10 2018 at 22:49):

Sean (May 16 2018 at 13:22):

Cezar (May 18 2018 at 06:38):

Cezar (May 18 2018 at 07:04):

Cezar (May 18 2018 at 07:17):

Cezar (May 18 2018 at 08:54):

Cezar (May 18 2018 at 09:20):

Cezar (May 18 2018 at 09:30):

Cezar (May 18 2018 at 09:56):

starseeker (May 18 2018 at 11:22):

Cezar (May 18 2018 at 12:58):

Sean (May 19 2018 at 03:39):

Sean (May 19 2018 at 03:41):

Sean (May 19 2018 at 03:53):

Cezar (May 20 2018 at 11:10):

Cezar (May 20 2018 at 17:13):

Cezar (May 20 2018 at 17:19):

Cezar (May 20 2018 at 17:36):

Cezar (May 20 2018 at 17:38):

Cezar (May 21 2018 at 21:40):

Cezar (May 21 2018 at 21:43):

starseeker (May 22 2018 at 03:36):

starseeker (May 22 2018 at 03:37):

Cezar (May 22 2018 at 11:19):

Cezar (May 22 2018 at 11:20):

Cezar (May 22 2018 at 15:42):

Cezar (May 22 2018 at 15:48):