irclog2html for #brlcad on 20050911

03:27.41 Maloeran That's strange, what kind of threading problem was it?
04:11.14 Twingy play yoot's tower?
04:30.33 Maloeran Not yet still, some AD&D game consumed most of the day, I soon will
04:30.57 Twingy heh
04:31.29 Twingy yoot's seems to be a bit more challenging than the previous one, the locality of offices seems to be more sensitive to traffic and noise
04:32.24 Maloeran Excellent, Sim Tower was way too easy
04:32.45 Twingy yah
04:32.52 Twingy easy and loooong
04:33.01 Twingy took forever to finish when I was playing it on my P133
04:33.06 Twingy back in '95
04:33.25 Maloeran Ahah. It didn't take too long on the dual-mp, time was flying way too fast even in non-fast mode
04:33.31 Maloeran Faster than I could spend anyway
04:33.45 Twingy heh, that certainly makes it quick
05:42.29 *** join/#brlcad Erroneous (n=DTRemena@DHCP-170-143.caltech.edu)
05:44.24 *** join/#brlcad Erroneous (n=DTRemena@DHCP-170-143.caltech.edu)
05:44.28 Twingy o.O
14:21.25 *** join/#brlcad docelic (n=docelic@ri01-150.dialin.iskon.hr)
16:16.08 silvap_ yoot's tower came b4 sim tower, no?
18:30.25 Twingy after
18:40.14 Twingy wow, I've got a little test app here that shows how sucky performance is when multiple threads reading from the same buffer
18:40.53 Maloeran Of course, the processors keep trashing each other's cache
18:40.59 Twingy yah :|
18:41.10 Twingy but this is a 64 byte array
18:41.17 Twingy it should always live in cache o.O
18:42.30 Twingy that explains why my ray-tracer with 4 threads performans at half the speed as 4 separate instances
18:42.49 Twingy I'm getting same results on linux,bsd (xeon,opteron)
18:42.51 Maloeran If it was up to the programmer to deal with coherency of the caches, we really could scale motherboards to large amount of processors
18:43.02 Maloeran Ah I see
18:43.03 Twingy that's not the problem
18:43.09 Twingy it's the memory
18:43.20 Twingy my stryker model sucks up 1.3GB of memory when fully loaded
18:43.40 Twingy and unless I have 2GB of ram per cpu on the mobo a bajillion processors doesn't make it any better
18:44.05 Twingy which is why I got this opteron machine with 8GB of ram per box
18:44.12 Twingy so I could have (4) 2GB instances going
18:44.34 Twingy which is why I'm seeing performance soar
18:45.00 Twingy I mean this one opteron machine with 4 instances is out performing the entire cluster cause each node only has 2GB
18:45.16 Maloeran Ahah
18:45.40 Twingy now if each node has 4GB of ram (prohibitively expensive at the team, each node has 6 ram slots)
18:45.50 Twingy then I'd be seing twice the performance out of each one
18:46.11 Twingy based on these findings I'm half tempted to rip all of the threading code out
18:46.22 Maloeran I don't see the problem with multiple threads|processors sharing the same banks, although Opterons do have per-processor banks with faster access
18:46.26 Twingy cause my application is omfg fast and cache-coherent
18:46.33 Twingy it's not benefiting from threads hardly at all
18:46.49 Twingy are you getting linear speed-ups with threading?
18:46.55 Maloeran You have to garantee processors won't be stepping on each other's feet, regarding read and write to their cache
18:47.18 Maloeran I... have no idea, I haven't even tested on the dual-mp
18:47.22 Twingy ah
18:47.36 Twingy you got a demo with dynamic number of threads I can specify with no graphics output handy?
18:47.41 Twingy just the text?
18:47.57 Maloeran Hum, give me 2 minutes
18:47.59 Twingy btw, when you ship another demo
18:48.10 Twingy make the fps output a:
18:48.18 Twingy printf("fps: %d\r", fps);
18:48.20 Twingy fflush(stdout);
18:48.24 Maloeran I'm deep into rewriting large chunks of the scene prep, it might take a bit
18:48.32 Twingy that's what cvs is for :)
18:48.40 Twingy so you can checkout an older stable version
18:48.40 Maloeran So noted, fbsd doesn't naturally flush on \n
18:49.05 Twingy use cvs damnit :)
18:49.42 Twingy ugh
18:49.43 Maloeran Yes, maybe I should :)
18:50.05 Maloeran It's for historical purposes, back to the castle days and our 300k rays per second
18:50.20 Twingy you should shove those into cvs in sequential order
18:51.58 Twingy I'm boggled is doesn't shove this 64 byte array in cache
18:52.05 Twingy It's got a friggen 1MB of cache
18:55.18 Twingy heh
18:55.30 Twingy I'm causing system time to spike and user/wall time just plop along
18:56.39 Maloeran I quickly modified the latest backup to use the first argument as number of threads
18:56.45 Twingy k
18:56.56 Twingy I'll benchmark on my opteron
18:57.06 Twingy if I have linux abi going
18:57.12 Maloeran Running on 64 bits of course?
18:57.17 Twingy let me see
18:57.28 Twingy yah
18:57.35 Twingy *should* work
18:57.57 Maloeran All right, that's an ELF 64-bit LSB executable, AMD x86-64
18:59.05 Twingy ELF binary type "0" not known.
18:59.20 Twingy *shrug*
18:59.29 Twingy guess it's not bsd friendly
18:59.53 Twingy for me here's how I do on threading
19:00.11 Twingy compared to 100% performance on four separate threads
19:00.30 Twingy 1: 25%, 2: 15%, 3: 10%, 4: 5%
19:00.34 Twingy in terms of performance gains
19:00.52 Twingy so the 5 threaded version is 50 - 55% of 4 separate threads
19:00.58 Twingy and the 4th thread buys you almost nothing
19:01.12 Twingy so you really need 4 cpus
19:01.30 Twingy cause you'll probly see a nearly 2x performance gain with 2 threaded version
19:01.38 Twingy so it won't tell you much
19:01.40 Maloeran That's for dual-xeons?
19:01.43 Twingy opteron
19:01.49 Twingy 2core dual cpu
19:01.53 Maloeran Ah yes, dual core
19:01.58 Twingy and xeon is slightly worse
19:02.04 Twingy but more or less the same
19:02.26 Maloeran Do each of your threads work on fairly distinct chunks of memory?
19:02.43 Twingy they get a *chunk* of pixels to work on that are coherent
19:02.44 Maloeran Especially the write operations, they must not trash the other processors cache
19:02.51 Twingy the writes are to distinct memory
19:03.04 Twingy the reads are probly from common memory most of the time, (the triangle data)
19:03.05 Maloeran No problem there, I suppose the chunks are big enough
19:03.08 Twingy cause the rays are coherent
19:03.24 Twingy so for 128x128 pixels
19:03.33 Twingy 4 threads will take on 128 pixel scan lines for now
19:03.39 Twingy 128x128 goes across the network
19:03.48 Twingy and 4 threads gobble up a 128 pixel scanline at a time
19:03.54 Twingy (all configurable in the environment file for the project)
19:03.59 Maloeran Could it be a problem of synchronisation, mutexes and network I/O ?
19:04.02 Twingy this is is an example, nothing is hard-wired
19:04.22 Twingy I don't think so, cause I got a demo here that doesn't do any network i/o
19:04.36 Maloeran I suppose the count of mutexes is very low too
19:04.39 Twingy and the only thing a mutex is doing is locking to increment a variable (i++) then unlocking
19:04.52 Twingy and after each mutex it does millions of computations
19:05.00 Twingy so mutex should be neglidgible
19:05.05 Twingy yes
19:05.11 Twingy mutex locking is in the 10's or 100's
19:05.16 Twingy computations in the loop are in the millions
19:05.18 Maloeran Is this variable shared by all threads? How often does it increment?
19:05.26 Maloeran Ah.
19:05.27 Twingy 10 to 100 times
19:05.34 Twingy it represents work units
19:05.40 Twingy work unit index rather
19:05.45 Twingy I am working on work unit 'i'
19:06.27 Twingy an the 'work' is a float[16] array, where it does a random read from it (as happens in triangle ray-tracing from big triangle array) and does math on the value at float[...]
19:06.29 Twingy now
19:06.44 Twingy if I put a = array[rand....] in the loop
19:06.55 Twingy where it goes done millions of time, performance is horrible nasty omfg slow
19:07.04 Twingy if I put it before the loop so it only gets called a few hundred times
19:07.11 Twingy then performance is parallel
19:07.16 Twingy more threads = more cpu
19:07.20 Twingy which is why for ray-tracing
19:07.24 Twingy it's reall a mix of both
19:07.38 Twingy which is why we get 'some' performance, but it ramps down rather quickly
19:08.10 Maloeran Yes I get the picture
19:08.26 Twingy so I'm curious if it's just me
19:08.36 Twingy or if my conclusions are similiar in your case
19:08.51 Maloeran Would you be able to run a 32 bits linux binary?
19:08.55 Twingy I dunno
19:08.56 Twingy we could try
19:09.01 Twingy I don't think the 'speed' matters
19:09.06 Twingy just the threaded relationship
19:09.25 Twingy in regards to performance
19:09.58 Maloeran Right
19:10.27 Maloeran I have been glancing over your code, I don't really see a potential problem, yet
19:10.45 Twingy ah, in the engine?
19:10.51 Twingy the camera in libutil/camera.c
19:10.54 Twingy is the threading
19:11.02 Twingy the engine tie.c is the intersection code
19:11.12 Maloeran I know ;)
19:11.14 Twingy k :)
19:11.20 Twingy when is the last time you updated?
19:11.25 Twingy I have been making changes left and right
19:11.34 Maloeran It might be up to a week old
19:11.37 Twingy k
19:12.07 Maloeran Did you try the 32 bits binary?
19:13.04 Twingy anyway to remove sdl quickly?
19:13.14 Twingy I don't think I have sdl installed for linux on freebsd
19:13.17 Maloeran Oops.
19:13.24 Maloeran A moment.
19:13.28 Twingy k
19:13.39 Twingy looks like it should otherwise run though
19:17.45 Maloeran All right, download from same address
19:19.07 Twingy your printf still pours out pages and pages of numbers
19:19.08 Maloeran and I forgot the fflush() :)
19:19.12 Twingy heh
19:19.43 Maloeran You can set the number of threads by the first argument
19:19.49 Twingy yep
19:20.20 Twingy can you do the fflush thing?
19:20.35 Twingy don't think my terminal is keeping up to speed
19:21.07 Maloeran Okay... Give me a moment to re-do all the SDL removal :}
19:21.16 Twingy you just delete it?
19:21.39 Twingy this is the part where you would've tagged a branch in cvs for me ;)
19:24.14 Maloeran You know... There is already a fflush( stdout ); in there
19:24.31 Twingy using '\r' as the terminator?
19:24.44 Twingy '\n' can't be there
19:24.53 Maloeran Oh?
19:25.05 Twingy '\n' means new line
19:25.08 Twingy we don't want a new line
19:26.30 Maloeran Fine, download again
19:26.35 Twingy k :)
19:27.10 Twingy much better
19:27.32 Twingy 1: 14 fps
19:28.00 Twingy 2: 27 fps
19:28.29 Twingy 3: 37 fps
19:28.54 Twingy 4: 43 fps
19:29.26 Maloeran Four processors to run 3 times faster than one, the loss is reasonable
19:29.38 Twingy yah, it's pretty good I'd say
19:29.46 Twingy how are you threading?
19:29.48 Twingy per scanline?
19:29.57 Maloeran Per block of 32x32 pixels in what I sent you
19:30.09 Twingy each thread works on a block?
19:30.19 Maloeran Right, and fetch the next block pending in queue once it's done
19:30.29 Twingy k, that was my next step
19:30.39 Twingy for threading
19:30.47 Twingy I think you want as little coherency as possible between threads
19:31.04 Twingy otherwise you have threads doing reads from the same memory segment it seems
19:31.08 Twingy but that *should* be in cache
19:31.14 Twingy but
19:31.21 Maloeran Reads from the same segments aren't generally a problem, not a big one anyway
19:31.22 Twingy a thread has no idea if that data is current
19:31.27 Twingy so I think in general it has to do a fresh read
19:31.35 Maloeran It doesn't trash the cache, it might just be a bit of a bottleneck on a memory bank
19:31.45 Twingy right, but do you agree with my statement?
19:32.02 Maloeran I don't see what you mean by "data being current"
19:32.14 Twingy thread 1 pulls from array[1234] at time interval X
19:32.21 Twingy thread 2 pulls from array[1234] at time interval Y
19:32.36 Twingy how does thread 2 know that array[1234] at time interval Y read the same data as interval X
19:32.48 Twingy I doesn't... so I think it cannot rely on cache
19:33.03 Maloeran If no other processor has modified the data, it will remain in the processor's cache
19:33.26 Twingy okie
19:33.45 Twingy welp, let me try postage stamping the threaded work units like I'm already doing for the distributed compute nodes
19:34.08 Twingy (I've done this before in previous experiments)
19:34.25 Twingy but I never tested improvement of linearity of threads
19:34.43 Maloeran Mmhm, good luck. Verify your writes especially, I suspect something is trashing the caches somewhere
19:35.04 Twingy possibly
19:35.26 Twingy I generate a ray, I fire the ray, the ray gets intersected against triangles, the result goes to the render method, render method shades a pixel, pixel gets shoved into a 128x128x3 buffer
19:35.36 Maloeran Where do threads pull their ray vectors from?
19:35.47 Twingy each thread has an instance of a ray object
19:38.13 Twingy according to your single threaded performance, using your SSE stuff and optical bundle tricks you seem to be 7x faster than mine
19:39.13 Twingy for non-coherent path tracing stuff I think you're 1.5 - 2.0x faster
19:39.26 Maloeran :) Not too bad, but the new scene analysis/preparation is still in the workshop
19:39.30 Twingy but you don't have rays all the way through or doubled sided normals on
19:39.34 Twingy k
19:39.58 Twingy I've got a potential 15% performance improvement for optical rendering only I might drop in tonight
19:40.24 Twingy that should inch me towards 2 mil/sec
19:40.40 Twingy for certain views, but on average about 1.5 mil/sec I think
19:40.53 Maloeran I went to implement rays going through geometry with callbacks a few days ago, but it looked really messy to implement in the SSE path working in bundles
19:41.14 Twingy yes
19:41.21 Twingy now you see why I haven't done it yet
19:41.30 Twingy or anyone has
19:41.38 Maloeran I may well just support that for single rays, which is trivial, for now anyway
19:41.42 Twingy yah
19:41.47 Twingy btw
19:41.52 Twingy I'm thinking about removing callbacks
19:42.02 Twingy and just putting the intersection stack into the ray itself
19:42.13 Twingy so if you fire a ray again with a used stack
19:42.18 Twingy it'll just resume using the stack
19:42.30 Twingy removes an extra function call for optical rendering
19:42.34 Maloeran Will this ray be able to change direction?
19:42.47 Twingy huh? of course you fire a new ray
19:43.07 Twingy a changed direction means generating a new ray for me
19:43.27 Maloeran So you could re-use the intersection point of a ray, as a source for new rays, exploiting the locality of rays
19:43.44 Twingy not with a tree
19:43.59 Twingy with a graph, sure
19:44.22 Twingy when I feel I've completely run out of tricks
19:44.27 Twingy I might try the graph approach
19:44.39 Twingy brb, gonna make a quick lunch
19:44.44 Maloeran I think a shortcut would be possible, but in any case... yes, I believe a graph is much more appropriate :)
19:44.53 Twingy for path tracing, definetly
19:45.17 Twingy but *ahem* lee wanted me to implement wald's paper *ahem*
19:45.27 Twingy so I have to roll with that for now
19:45.35 Maloeran For any ray-tracing needs really, since a graph can both exploit coherency and locality
19:45.47 Twingy jup
19:45.49 Maloeran *nods* So I understood... Lee isn't here, is he? :p
19:45.55 Twingy he's polyspin
19:46.08 Twingy brb, gonna make some lunch
20:04.26 Twingy back
20:05.58 Maloeran If you are going to try a graph based technique, I still think a tetrahedron graph could be promising
20:06.13 Maloeran No sectors, no nodes, no triangle, a single primitive ; tetrahedrons
20:06.30 Twingy maybe we can work on this together...
20:07.23 Maloeran *nods* I still have a fairly long to-do list on my current design
20:07.45 Twingy I'm sure this will be some ongoing research we'll be involved with for years to come
20:08.07 Twingy we'll probly be 30 or so before this winds down
20:09.23 Maloeran Eheh, I wouldn't mind :), I haven't found a field that led to so interesting, mind boggling problems just waiting to be solved
20:09.46 Maloeran another field, that is
20:10.46 Twingy the only other thing I found nearly this interested was when I was working on my coil gun
20:11.18 Twingy it was easy to see how years of research could be speant optimizing the efficiency of the energy transfer
20:12.15 Maloeran Eh I suppose, how did this project work out?
20:12.42 Twingy I was able to get my single stage coil gun up to an efficiency of about 2% using off the shelf power supply capacitors and hand-wound litz wire
20:12.54 Twingy max efficiency I've ever seen somone get out of single stage is 10%
20:13.03 Twingy in research labs etc
20:13.03 Maloeran Hum :), I see
20:13.14 Maloeran I have been thinking sporadically about AI since my early genetic experiments too, that's something I want to get back into, I have my ideas for a lower level approach to the problem
20:13.32 Maloeran Much closer to the fundamental definition of intelligence, nothing like faking it at high-level with neural networks
20:16.59 Maloeran Any more clues on the threading performance issue?
20:17.09 Twingy I'm still playing with my demo
20:17.15 Twingy recreating a base-model of what's going on
20:17.28 Twingy and I'm seeing resonable performance from it
20:17.30 Twingy near-linear
20:17.44 Twingy 16384 random accesses to a 16MB array distributed among 4 threads
20:17.52 Twingy doing a billion iteration computation loop
20:18.13 Twingy each work unit is a random access
20:18.16 Maloeran Read operations can't be the source of such a performance bottleneck
20:18.23 Twingy indeed
20:23.38 Twingy one thing you have to worry about with opterons is the hyper transport too
20:23.59 Twingy it's a big bottle neck
20:27.11 silvap_ i swear i saw a reference to yoot saito's "yoot's tower" in the sim tower splash screen
20:30.59 silvap_ Maloeran, what sorta AI stuff have u done
20:32.14 Maloeran Long ago, genetic algorithms to build up an AI for a primitive strategy game of some sort
20:32.40 Maloeran It played for months against itself on a P3, results were deceiving ( the learning pace was below my expectations )
20:33.29 silvap_ hehe cool
20:34.13 silvap_ was it coupled with an ANN or was the state space too large?
20:35.00 Maloeran It wasn't, it really was a straight genetic algorithm
20:35.21 silvap_ ah
20:35.35 Twingy i.e brute force :)
20:35.47 Maloeran On a Pentium 3! :)
20:36.01 Twingy meh, that's not that long ago
20:36.09 Twingy now if it were a P133 ...
20:37.10 Maloeran Pentium 3 733mhz, running all nights on my father's computer, it was high end back then
20:52.48 Twingy getting closer to the problem
20:53.09 Twingy just replaced my ray intersection with a loop that does 1024 floating multiplies
20:53.26 Twingy 100% 75% 25% 19% efficiency
20:53.35 Twingy so it's definetly not the engine
20:53.46 Twingy so I think I know what to fix now
20:56.01 Maloeran Your engine's code did seem quite fine on this aspect. Would it be the callbacks?
21:06.02 Twingy no, I think it's the semaphore usage in my camera
21:06.17 Twingy I'm going running now, so I'll investigate in an hour
21:07.19 Twingy and indeed it was
21:07.50 Twingy I must've been pinching a nerve or something on the soft one
21:10.47 Maloeran Eh, good
21:58.54 *** join/#brlcad ``Erik (i=erik@pcp0011474399pcs.chrchv01.md.comcast.net)

Generated by irclog2html.pl by Jeff Waugh - find it at freshmeat.net! Modified by Tim Riker to work with blootbot logs, split per channel, etc.