| 03:27.41 | Maloeran | That's strange, what kind of threading problem was it? |
| 04:11.14 | Twingy | play yoot's tower? |
| 04:30.33 | Maloeran | Not yet still, some AD&D game consumed most of the day, I soon will |
| 04:30.57 | Twingy | heh |
| 04:31.29 | Twingy | yoot's seems to be a bit more challenging than the previous one, the locality of offices seems to be more sensitive to traffic and noise |
| 04:32.24 | Maloeran | Excellent, Sim Tower was way too easy |
| 04:32.45 | Twingy | yah |
| 04:32.52 | Twingy | easy and loooong |
| 04:33.01 | Twingy | took forever to finish when I was playing it on my P133 |
| 04:33.06 | Twingy | back in '95 |
| 04:33.25 | Maloeran | Ahah. It didn't take too long on the dual-mp, time was flying way too fast even in non-fast mode |
| 04:33.31 | Maloeran | Faster than I could spend anyway |
| 04:33.45 | Twingy | heh, that certainly makes it quick |
| 05:42.29 | *** join/#brlcad Erroneous (n=DTRemena@DHCP-170-143.caltech.edu) | |
| 05:44.24 | *** join/#brlcad Erroneous (n=DTRemena@DHCP-170-143.caltech.edu) | |
| 05:44.28 | Twingy | o.O |
| 14:21.25 | *** join/#brlcad docelic (n=docelic@ri01-150.dialin.iskon.hr) | |
| 16:16.08 | silvap_ | yoot's tower came b4 sim tower, no? |
| 18:30.25 | Twingy | after |
| 18:40.14 | Twingy | wow, I've got a little test app here that shows how sucky performance is when multiple threads reading from the same buffer |
| 18:40.53 | Maloeran | Of course, the processors keep trashing each other's cache |
| 18:40.59 | Twingy | yah :| |
| 18:41.10 | Twingy | but this is a 64 byte array |
| 18:41.17 | Twingy | it should always live in cache o.O |
| 18:42.30 | Twingy | that explains why my ray-tracer with 4 threads performans at half the speed as 4 separate instances |
| 18:42.49 | Twingy | I'm getting same results on linux,bsd (xeon,opteron) |
| 18:42.51 | Maloeran | If it was up to the programmer to deal with coherency of the caches, we really could scale motherboards to large amount of processors |
| 18:43.02 | Maloeran | Ah I see |
| 18:43.03 | Twingy | that's not the problem |
| 18:43.09 | Twingy | it's the memory |
| 18:43.20 | Twingy | my stryker model sucks up 1.3GB of memory when fully loaded |
| 18:43.40 | Twingy | and unless I have 2GB of ram per cpu on the mobo a bajillion processors doesn't make it any better |
| 18:44.05 | Twingy | which is why I got this opteron machine with 8GB of ram per box |
| 18:44.12 | Twingy | so I could have (4) 2GB instances going |
| 18:44.34 | Twingy | which is why I'm seeing performance soar |
| 18:45.00 | Twingy | I mean this one opteron machine with 4 instances is out performing the entire cluster cause each node only has 2GB |
| 18:45.16 | Maloeran | Ahah |
| 18:45.40 | Twingy | now if each node has 4GB of ram (prohibitively expensive at the team, each node has 6 ram slots) |
| 18:45.50 | Twingy | then I'd be seing twice the performance out of each one |
| 18:46.11 | Twingy | based on these findings I'm half tempted to rip all of the threading code out |
| 18:46.22 | Maloeran | I don't see the problem with multiple threads|processors sharing the same banks, although Opterons do have per-processor banks with faster access |
| 18:46.26 | Twingy | cause my application is omfg fast and cache-coherent |
| 18:46.33 | Twingy | it's not benefiting from threads hardly at all |
| 18:46.49 | Twingy | are you getting linear speed-ups with threading? |
| 18:46.55 | Maloeran | You have to garantee processors won't be stepping on each other's feet, regarding read and write to their cache |
| 18:47.18 | Maloeran | I... have no idea, I haven't even tested on the dual-mp |
| 18:47.22 | Twingy | ah |
| 18:47.36 | Twingy | you got a demo with dynamic number of threads I can specify with no graphics output handy? |
| 18:47.41 | Twingy | just the text? |
| 18:47.57 | Maloeran | Hum, give me 2 minutes |
| 18:47.59 | Twingy | btw, when you ship another demo |
| 18:48.10 | Twingy | make the fps output a: |
| 18:48.18 | Twingy | printf("fps: %d\r", fps); |
| 18:48.20 | Twingy | fflush(stdout); |
| 18:48.24 | Maloeran | I'm deep into rewriting large chunks of the scene prep, it might take a bit |
| 18:48.32 | Twingy | that's what cvs is for :) |
| 18:48.40 | Twingy | so you can checkout an older stable version |
| 18:48.40 | Maloeran | So noted, fbsd doesn't naturally flush on \n |
| 18:49.05 | Twingy | use cvs damnit :) |
| 18:49.42 | Twingy | ugh |
| 18:49.43 | Maloeran | Yes, maybe I should :) |
| 18:50.05 | Maloeran | It's for historical purposes, back to the castle days and our 300k rays per second |
| 18:50.20 | Twingy | you should shove those into cvs in sequential order |
| 18:51.58 | Twingy | I'm boggled is doesn't shove this 64 byte array in cache |
| 18:52.05 | Twingy | It's got a friggen 1MB of cache |
| 18:55.18 | Twingy | heh |
| 18:55.30 | Twingy | I'm causing system time to spike and user/wall time just plop along |
| 18:56.39 | Maloeran | I quickly modified the latest backup to use the first argument as number of threads |
| 18:56.45 | Twingy | k |
| 18:56.56 | Twingy | I'll benchmark on my opteron |
| 18:57.06 | Twingy | if I have linux abi going |
| 18:57.12 | Maloeran | Running on 64 bits of course? |
| 18:57.17 | Twingy | let me see |
| 18:57.28 | Twingy | yah |
| 18:57.35 | Twingy | *should* work |
| 18:57.57 | Maloeran | All right, that's an ELF 64-bit LSB executable, AMD x86-64 |
| 18:59.05 | Twingy | ELF binary type "0" not known. |
| 18:59.20 | Twingy | *shrug* |
| 18:59.29 | Twingy | guess it's not bsd friendly |
| 18:59.53 | Twingy | for me here's how I do on threading |
| 19:00.11 | Twingy | compared to 100% performance on four separate threads |
| 19:00.30 | Twingy | 1: 25%, 2: 15%, 3: 10%, 4: 5% |
| 19:00.34 | Twingy | in terms of performance gains |
| 19:00.52 | Twingy | so the 5 threaded version is 50 - 55% of 4 separate threads |
| 19:00.58 | Twingy | and the 4th thread buys you almost nothing |
| 19:01.12 | Twingy | so you really need 4 cpus |
| 19:01.30 | Twingy | cause you'll probly see a nearly 2x performance gain with 2 threaded version |
| 19:01.38 | Twingy | so it won't tell you much |
| 19:01.40 | Maloeran | That's for dual-xeons? |
| 19:01.43 | Twingy | opteron |
| 19:01.49 | Twingy | 2core dual cpu |
| 19:01.53 | Maloeran | Ah yes, dual core |
| 19:01.58 | Twingy | and xeon is slightly worse |
| 19:02.04 | Twingy | but more or less the same |
| 19:02.26 | Maloeran | Do each of your threads work on fairly distinct chunks of memory? |
| 19:02.43 | Twingy | they get a *chunk* of pixels to work on that are coherent |
| 19:02.44 | Maloeran | Especially the write operations, they must not trash the other processors cache |
| 19:02.51 | Twingy | the writes are to distinct memory |
| 19:03.04 | Twingy | the reads are probly from common memory most of the time, (the triangle data) |
| 19:03.05 | Maloeran | No problem there, I suppose the chunks are big enough |
| 19:03.08 | Twingy | cause the rays are coherent |
| 19:03.24 | Twingy | so for 128x128 pixels |
| 19:03.33 | Twingy | 4 threads will take on 128 pixel scan lines for now |
| 19:03.39 | Twingy | 128x128 goes across the network |
| 19:03.48 | Twingy | and 4 threads gobble up a 128 pixel scanline at a time |
| 19:03.54 | Twingy | (all configurable in the environment file for the project) |
| 19:03.59 | Maloeran | Could it be a problem of synchronisation, mutexes and network I/O ? |
| 19:04.02 | Twingy | this is is an example, nothing is hard-wired |
| 19:04.22 | Twingy | I don't think so, cause I got a demo here that doesn't do any network i/o |
| 19:04.36 | Maloeran | I suppose the count of mutexes is very low too |
| 19:04.39 | Twingy | and the only thing a mutex is doing is locking to increment a variable (i++) then unlocking |
| 19:04.52 | Twingy | and after each mutex it does millions of computations |
| 19:05.00 | Twingy | so mutex should be neglidgible |
| 19:05.05 | Twingy | yes |
| 19:05.11 | Twingy | mutex locking is in the 10's or 100's |
| 19:05.16 | Twingy | computations in the loop are in the millions |
| 19:05.18 | Maloeran | Is this variable shared by all threads? How often does it increment? |
| 19:05.26 | Maloeran | Ah. |
| 19:05.27 | Twingy | 10 to 100 times |
| 19:05.34 | Twingy | it represents work units |
| 19:05.40 | Twingy | work unit index rather |
| 19:05.45 | Twingy | I am working on work unit 'i' |
| 19:06.27 | Twingy | an the 'work' is a float[16] array, where it does a random read from it (as happens in triangle ray-tracing from big triangle array) and does math on the value at float[...] |
| 19:06.29 | Twingy | now |
| 19:06.44 | Twingy | if I put a = array[rand....] in the loop |
| 19:06.55 | Twingy | where it goes done millions of time, performance is horrible nasty omfg slow |
| 19:07.04 | Twingy | if I put it before the loop so it only gets called a few hundred times |
| 19:07.11 | Twingy | then performance is parallel |
| 19:07.16 | Twingy | more threads = more cpu |
| 19:07.20 | Twingy | which is why for ray-tracing |
| 19:07.24 | Twingy | it's reall a mix of both |
| 19:07.38 | Twingy | which is why we get 'some' performance, but it ramps down rather quickly |
| 19:08.10 | Maloeran | Yes I get the picture |
| 19:08.26 | Twingy | so I'm curious if it's just me |
| 19:08.36 | Twingy | or if my conclusions are similiar in your case |
| 19:08.51 | Maloeran | Would you be able to run a 32 bits linux binary? |
| 19:08.55 | Twingy | I dunno |
| 19:08.56 | Twingy | we could try |
| 19:09.01 | Twingy | I don't think the 'speed' matters |
| 19:09.06 | Twingy | just the threaded relationship |
| 19:09.25 | Twingy | in regards to performance |
| 19:09.58 | Maloeran | Right |
| 19:10.27 | Maloeran | I have been glancing over your code, I don't really see a potential problem, yet |
| 19:10.45 | Twingy | ah, in the engine? |
| 19:10.51 | Twingy | the camera in libutil/camera.c |
| 19:10.54 | Twingy | is the threading |
| 19:11.02 | Twingy | the engine tie.c is the intersection code |
| 19:11.12 | Maloeran | I know ;) |
| 19:11.14 | Twingy | k :) |
| 19:11.20 | Twingy | when is the last time you updated? |
| 19:11.25 | Twingy | I have been making changes left and right |
| 19:11.34 | Maloeran | It might be up to a week old |
| 19:11.37 | Twingy | k |
| 19:12.07 | Maloeran | Did you try the 32 bits binary? |
| 19:13.04 | Twingy | anyway to remove sdl quickly? |
| 19:13.14 | Twingy | I don't think I have sdl installed for linux on freebsd |
| 19:13.17 | Maloeran | Oops. |
| 19:13.24 | Maloeran | A moment. |
| 19:13.28 | Twingy | k |
| 19:13.39 | Twingy | looks like it should otherwise run though |
| 19:17.45 | Maloeran | All right, download from same address |
| 19:19.07 | Twingy | your printf still pours out pages and pages of numbers |
| 19:19.08 | Maloeran | and I forgot the fflush() :) |
| 19:19.12 | Twingy | heh |
| 19:19.43 | Maloeran | You can set the number of threads by the first argument |
| 19:19.49 | Twingy | yep |
| 19:20.20 | Twingy | can you do the fflush thing? |
| 19:20.35 | Twingy | don't think my terminal is keeping up to speed |
| 19:21.07 | Maloeran | Okay... Give me a moment to re-do all the SDL removal :} |
| 19:21.16 | Twingy | you just delete it? |
| 19:21.39 | Twingy | this is the part where you would've tagged a branch in cvs for me ;) |
| 19:24.14 | Maloeran | You know... There is already a fflush( stdout ); in there |
| 19:24.31 | Twingy | using '\r' as the terminator? |
| 19:24.44 | Twingy | '\n' can't be there |
| 19:24.53 | Maloeran | Oh? |
| 19:25.05 | Twingy | '\n' means new line |
| 19:25.08 | Twingy | we don't want a new line |
| 19:26.30 | Maloeran | Fine, download again |
| 19:26.35 | Twingy | k :) |
| 19:27.10 | Twingy | much better |
| 19:27.32 | Twingy | 1: 14 fps |
| 19:28.00 | Twingy | 2: 27 fps |
| 19:28.29 | Twingy | 3: 37 fps |
| 19:28.54 | Twingy | 4: 43 fps |
| 19:29.26 | Maloeran | Four processors to run 3 times faster than one, the loss is reasonable |
| 19:29.38 | Twingy | yah, it's pretty good I'd say |
| 19:29.46 | Twingy | how are you threading? |
| 19:29.48 | Twingy | per scanline? |
| 19:29.57 | Maloeran | Per block of 32x32 pixels in what I sent you |
| 19:30.09 | Twingy | each thread works on a block? |
| 19:30.19 | Maloeran | Right, and fetch the next block pending in queue once it's done |
| 19:30.29 | Twingy | k, that was my next step |
| 19:30.39 | Twingy | for threading |
| 19:30.47 | Twingy | I think you want as little coherency as possible between threads |
| 19:31.04 | Twingy | otherwise you have threads doing reads from the same memory segment it seems |
| 19:31.08 | Twingy | but that *should* be in cache |
| 19:31.14 | Twingy | but |
| 19:31.21 | Maloeran | Reads from the same segments aren't generally a problem, not a big one anyway |
| 19:31.22 | Twingy | a thread has no idea if that data is current |
| 19:31.27 | Twingy | so I think in general it has to do a fresh read |
| 19:31.35 | Maloeran | It doesn't trash the cache, it might just be a bit of a bottleneck on a memory bank |
| 19:31.45 | Twingy | right, but do you agree with my statement? |
| 19:32.02 | Maloeran | I don't see what you mean by "data being current" |
| 19:32.14 | Twingy | thread 1 pulls from array[1234] at time interval X |
| 19:32.21 | Twingy | thread 2 pulls from array[1234] at time interval Y |
| 19:32.36 | Twingy | how does thread 2 know that array[1234] at time interval Y read the same data as interval X |
| 19:32.48 | Twingy | I doesn't... so I think it cannot rely on cache |
| 19:33.03 | Maloeran | If no other processor has modified the data, it will remain in the processor's cache |
| 19:33.26 | Twingy | okie |
| 19:33.45 | Twingy | welp, let me try postage stamping the threaded work units like I'm already doing for the distributed compute nodes |
| 19:34.08 | Twingy | (I've done this before in previous experiments) |
| 19:34.25 | Twingy | but I never tested improvement of linearity of threads |
| 19:34.43 | Maloeran | Mmhm, good luck. Verify your writes especially, I suspect something is trashing the caches somewhere |
| 19:35.04 | Twingy | possibly |
| 19:35.26 | Twingy | I generate a ray, I fire the ray, the ray gets intersected against triangles, the result goes to the render method, render method shades a pixel, pixel gets shoved into a 128x128x3 buffer |
| 19:35.36 | Maloeran | Where do threads pull their ray vectors from? |
| 19:35.47 | Twingy | each thread has an instance of a ray object |
| 19:38.13 | Twingy | according to your single threaded performance, using your SSE stuff and optical bundle tricks you seem to be 7x faster than mine |
| 19:39.13 | Twingy | for non-coherent path tracing stuff I think you're 1.5 - 2.0x faster |
| 19:39.26 | Maloeran | :) Not too bad, but the new scene analysis/preparation is still in the workshop |
| 19:39.30 | Twingy | but you don't have rays all the way through or doubled sided normals on |
| 19:39.34 | Twingy | k |
| 19:39.58 | Twingy | I've got a potential 15% performance improvement for optical rendering only I might drop in tonight |
| 19:40.24 | Twingy | that should inch me towards 2 mil/sec |
| 19:40.40 | Twingy | for certain views, but on average about 1.5 mil/sec I think |
| 19:40.53 | Maloeran | I went to implement rays going through geometry with callbacks a few days ago, but it looked really messy to implement in the SSE path working in bundles |
| 19:41.14 | Twingy | yes |
| 19:41.21 | Twingy | now you see why I haven't done it yet |
| 19:41.30 | Twingy | or anyone has |
| 19:41.38 | Maloeran | I may well just support that for single rays, which is trivial, for now anyway |
| 19:41.42 | Twingy | yah |
| 19:41.47 | Twingy | btw |
| 19:41.52 | Twingy | I'm thinking about removing callbacks |
| 19:42.02 | Twingy | and just putting the intersection stack into the ray itself |
| 19:42.13 | Twingy | so if you fire a ray again with a used stack |
| 19:42.18 | Twingy | it'll just resume using the stack |
| 19:42.30 | Twingy | removes an extra function call for optical rendering |
| 19:42.34 | Maloeran | Will this ray be able to change direction? |
| 19:42.47 | Twingy | huh? of course you fire a new ray |
| 19:43.07 | Twingy | a changed direction means generating a new ray for me |
| 19:43.27 | Maloeran | So you could re-use the intersection point of a ray, as a source for new rays, exploiting the locality of rays |
| 19:43.44 | Twingy | not with a tree |
| 19:43.59 | Twingy | with a graph, sure |
| 19:44.22 | Twingy | when I feel I've completely run out of tricks |
| 19:44.27 | Twingy | I might try the graph approach |
| 19:44.39 | Twingy | brb, gonna make a quick lunch |
| 19:44.44 | Maloeran | I think a shortcut would be possible, but in any case... yes, I believe a graph is much more appropriate :) |
| 19:44.53 | Twingy | for path tracing, definetly |
| 19:45.17 | Twingy | but *ahem* lee wanted me to implement wald's paper *ahem* |
| 19:45.27 | Twingy | so I have to roll with that for now |
| 19:45.35 | Maloeran | For any ray-tracing needs really, since a graph can both exploit coherency and locality |
| 19:45.47 | Twingy | jup |
| 19:45.49 | Maloeran | *nods* So I understood... Lee isn't here, is he? :p |
| 19:45.55 | Twingy | he's polyspin |
| 19:46.08 | Twingy | brb, gonna make some lunch |
| 20:04.26 | Twingy | back |
| 20:05.58 | Maloeran | If you are going to try a graph based technique, I still think a tetrahedron graph could be promising |
| 20:06.13 | Maloeran | No sectors, no nodes, no triangle, a single primitive ; tetrahedrons |
| 20:06.30 | Twingy | maybe we can work on this together... |
| 20:07.23 | Maloeran | *nods* I still have a fairly long to-do list on my current design |
| 20:07.45 | Twingy | I'm sure this will be some ongoing research we'll be involved with for years to come |
| 20:08.07 | Twingy | we'll probly be 30 or so before this winds down |
| 20:09.23 | Maloeran | Eheh, I wouldn't mind :), I haven't found a field that led to so interesting, mind boggling problems just waiting to be solved |
| 20:09.46 | Maloeran | another field, that is |
| 20:10.46 | Twingy | the only other thing I found nearly this interested was when I was working on my coil gun |
| 20:11.18 | Twingy | it was easy to see how years of research could be speant optimizing the efficiency of the energy transfer |
| 20:12.15 | Maloeran | Eh I suppose, how did this project work out? |
| 20:12.42 | Twingy | I was able to get my single stage coil gun up to an efficiency of about 2% using off the shelf power supply capacitors and hand-wound litz wire |
| 20:12.54 | Twingy | max efficiency I've ever seen somone get out of single stage is 10% |
| 20:13.03 | Twingy | in research labs etc |
| 20:13.03 | Maloeran | Hum :), I see |
| 20:13.14 | Maloeran | I have been thinking sporadically about AI since my early genetic experiments too, that's something I want to get back into, I have my ideas for a lower level approach to the problem |
| 20:13.32 | Maloeran | Much closer to the fundamental definition of intelligence, nothing like faking it at high-level with neural networks |
| 20:16.59 | Maloeran | Any more clues on the threading performance issue? |
| 20:17.09 | Twingy | I'm still playing with my demo |
| 20:17.15 | Twingy | recreating a base-model of what's going on |
| 20:17.28 | Twingy | and I'm seeing resonable performance from it |
| 20:17.30 | Twingy | near-linear |
| 20:17.44 | Twingy | 16384 random accesses to a 16MB array distributed among 4 threads |
| 20:17.52 | Twingy | doing a billion iteration computation loop |
| 20:18.13 | Twingy | each work unit is a random access |
| 20:18.16 | Maloeran | Read operations can't be the source of such a performance bottleneck |
| 20:18.23 | Twingy | indeed |
| 20:23.38 | Twingy | one thing you have to worry about with opterons is the hyper transport too |
| 20:23.59 | Twingy | it's a big bottle neck |
| 20:27.11 | silvap_ | i swear i saw a reference to yoot saito's "yoot's tower" in the sim tower splash screen |
| 20:30.59 | silvap_ | Maloeran, what sorta AI stuff have u done |
| 20:32.14 | Maloeran | Long ago, genetic algorithms to build up an AI for a primitive strategy game of some sort |
| 20:32.40 | Maloeran | It played for months against itself on a P3, results were deceiving ( the learning pace was below my expectations ) |
| 20:33.29 | silvap_ | hehe cool |
| 20:34.13 | silvap_ | was it coupled with an ANN or was the state space too large? |
| 20:35.00 | Maloeran | It wasn't, it really was a straight genetic algorithm |
| 20:35.21 | silvap_ | ah |
| 20:35.35 | Twingy | i.e brute force :) |
| 20:35.47 | Maloeran | On a Pentium 3! :) |
| 20:36.01 | Twingy | meh, that's not that long ago |
| 20:36.09 | Twingy | now if it were a P133 ... |
| 20:37.10 | Maloeran | Pentium 3 733mhz, running all nights on my father's computer, it was high end back then |
| 20:52.48 | Twingy | getting closer to the problem |
| 20:53.09 | Twingy | just replaced my ray intersection with a loop that does 1024 floating multiplies |
| 20:53.26 | Twingy | 100% 75% 25% 19% efficiency |
| 20:53.35 | Twingy | so it's definetly not the engine |
| 20:53.46 | Twingy | so I think I know what to fix now |
| 20:56.01 | Maloeran | Your engine's code did seem quite fine on this aspect. Would it be the callbacks? |
| 21:06.02 | Twingy | no, I think it's the semaphore usage in my camera |
| 21:06.17 | Twingy | I'm going running now, so I'll investigate in an hour |
| 21:07.19 | Twingy | and indeed it was |
| 21:07.50 | Twingy | I must've been pinching a nerve or something on the soft one |
| 21:10.47 | Maloeran | Eh, good |
| 21:58.54 | *** join/#brlcad ``Erik (i=erik@pcp0011474399pcs.chrchv01.md.comcast.net) | |