03:27.41 |
Maloeran |
That's strange, what kind of threading problem
was it? |
04:11.14 |
Twingy |
play yoot's tower? |
04:30.33 |
Maloeran |
Not yet still, some AD&D game consumed
most of the day, I soon will |
04:30.57 |
Twingy |
heh |
04:31.29 |
Twingy |
yoot's seems to be a bit more challenging than
the previous one, the locality of offices seems to be more
sensitive to traffic and noise |
04:32.24 |
Maloeran |
Excellent, Sim Tower was way too
easy |
04:32.45 |
Twingy |
yah |
04:32.52 |
Twingy |
easy and loooong |
04:33.01 |
Twingy |
took forever to finish when I was playing it
on my P133 |
04:33.06 |
Twingy |
back in '95 |
04:33.25 |
Maloeran |
Ahah. It didn't take too long on the dual-mp,
time was flying way too fast even in non-fast mode |
04:33.31 |
Maloeran |
Faster than I could spend anyway |
04:33.45 |
Twingy |
heh, that certainly makes it quick |
05:42.29 |
*** join/#brlcad Erroneous
(n=DTRemena@DHCP-170-143.caltech.edu) |
05:44.24 |
*** join/#brlcad Erroneous
(n=DTRemena@DHCP-170-143.caltech.edu) |
05:44.28 |
Twingy |
o.O |
14:21.25 |
*** join/#brlcad docelic
(n=docelic@ri01-150.dialin.iskon.hr) |
16:16.08 |
silvap_ |
yoot's tower came b4 sim tower, no? |
18:30.25 |
Twingy |
after |
18:40.14 |
Twingy |
wow, I've got a little test app here that
shows how sucky performance is when multiple threads reading from
the same buffer |
18:40.53 |
Maloeran |
Of course, the processors keep trashing each
other's cache |
18:40.59 |
Twingy |
yah :| |
18:41.10 |
Twingy |
but this is a 64 byte array |
18:41.17 |
Twingy |
it should always live in cache o.O |
18:42.30 |
Twingy |
that explains why my ray-tracer with 4 threads
performans at half the speed as 4 separate instances |
18:42.49 |
Twingy |
I'm getting same results on linux,bsd
(xeon,opteron) |
18:42.51 |
Maloeran |
If it was up to the programmer to deal with
coherency of the caches, we really could scale motherboards to
large amount of processors |
18:43.02 |
Maloeran |
Ah I see |
18:43.03 |
Twingy |
that's not the problem |
18:43.09 |
Twingy |
it's the memory |
18:43.20 |
Twingy |
my stryker model sucks up 1.3GB of memory when
fully loaded |
18:43.40 |
Twingy |
and unless I have 2GB of ram per cpu on the
mobo a bajillion processors doesn't make it any better |
18:44.05 |
Twingy |
which is why I got this opteron machine with
8GB of ram per box |
18:44.12 |
Twingy |
so I could have (4) 2GB instances
going |
18:44.34 |
Twingy |
which is why I'm seeing performance
soar |
18:45.00 |
Twingy |
I mean this one opteron machine with 4
instances is out performing the entire cluster cause each node only
has 2GB |
18:45.16 |
Maloeran |
Ahah |
18:45.40 |
Twingy |
now if each node has 4GB of ram (prohibitively
expensive at the team, each node has 6 ram slots) |
18:45.50 |
Twingy |
then I'd be seing twice the performance out of
each one |
18:46.11 |
Twingy |
based on these findings I'm half tempted to
rip all of the threading code out |
18:46.22 |
Maloeran |
I don't see the problem with multiple
threads|processors sharing the same banks, although Opterons do
have per-processor banks with faster access |
18:46.26 |
Twingy |
cause my application is omfg fast and
cache-coherent |
18:46.33 |
Twingy |
it's not benefiting from threads hardly at
all |
18:46.49 |
Twingy |
are you getting linear speed-ups with
threading? |
18:46.55 |
Maloeran |
You have to garantee processors won't be
stepping on each other's feet, regarding read and write to their
cache |
18:47.18 |
Maloeran |
I... have no idea, I haven't even tested on
the dual-mp |
18:47.22 |
Twingy |
ah |
18:47.36 |
Twingy |
you got a demo with dynamic number of threads
I can specify with no graphics output handy? |
18:47.41 |
Twingy |
just the text? |
18:47.57 |
Maloeran |
Hum, give me 2 minutes |
18:47.59 |
Twingy |
btw, when you ship another demo |
18:48.10 |
Twingy |
make the fps output a: |
18:48.18 |
Twingy |
printf("fps: %d\r", fps); |
18:48.20 |
Twingy |
fflush(stdout); |
18:48.24 |
Maloeran |
I'm deep into rewriting large chunks of the
scene prep, it might take a bit |
18:48.32 |
Twingy |
that's what cvs is for :) |
18:48.40 |
Twingy |
so you can checkout an older stable
version |
18:48.40 |
Maloeran |
So noted, fbsd doesn't naturally flush on
\n |
18:49.05 |
Twingy |
use cvs damnit :) |
18:49.42 |
Twingy |
ugh |
18:49.43 |
Maloeran |
Yes, maybe I should :) |
18:50.05 |
Maloeran |
It's for historical purposes, back to the
castle days and our 300k rays per second |
18:50.20 |
Twingy |
you should shove those into cvs in sequential
order |
18:51.58 |
Twingy |
I'm boggled is doesn't shove this 64 byte
array in cache |
18:52.05 |
Twingy |
It's got a friggen 1MB of cache |
18:55.18 |
Twingy |
heh |
18:55.30 |
Twingy |
I'm causing system time to spike and user/wall
time just plop along |
18:56.39 |
Maloeran |
I quickly modified the latest backup to use
the first argument as number of threads |
18:56.45 |
Twingy |
k |
18:56.56 |
Twingy |
I'll benchmark on my opteron |
18:57.06 |
Twingy |
if I have linux abi going |
18:57.12 |
Maloeran |
Running on 64 bits of course? |
18:57.17 |
Twingy |
let me see |
18:57.28 |
Twingy |
yah |
18:57.35 |
Twingy |
*should* work |
18:57.57 |
Maloeran |
All right, that's an ELF 64-bit LSB
executable, AMD x86-64 |
18:59.05 |
Twingy |
ELF binary type "0" not known. |
18:59.20 |
Twingy |
*shrug* |
18:59.29 |
Twingy |
guess it's not bsd friendly |
18:59.53 |
Twingy |
for me here's how I do on threading |
19:00.11 |
Twingy |
compared to 100% performance on four separate
threads |
19:00.30 |
Twingy |
1: 25%, 2: 15%, 3: 10%, 4: 5% |
19:00.34 |
Twingy |
in terms of performance gains |
19:00.52 |
Twingy |
so the 5 threaded version is 50 - 55% of 4
separate threads |
19:00.58 |
Twingy |
and the 4th thread buys you almost
nothing |
19:01.12 |
Twingy |
so you really need 4 cpus |
19:01.30 |
Twingy |
cause you'll probly see a nearly 2x
performance gain with 2 threaded version |
19:01.38 |
Twingy |
so it won't tell you much |
19:01.40 |
Maloeran |
That's for dual-xeons? |
19:01.43 |
Twingy |
opteron |
19:01.49 |
Twingy |
2core dual cpu |
19:01.53 |
Maloeran |
Ah yes, dual core |
19:01.58 |
Twingy |
and xeon is slightly worse |
19:02.04 |
Twingy |
but more or less the same |
19:02.26 |
Maloeran |
Do each of your threads work on fairly
distinct chunks of memory? |
19:02.43 |
Twingy |
they get a *chunk* of pixels to work on that
are coherent |
19:02.44 |
Maloeran |
Especially the write operations, they must not
trash the other processors cache |
19:02.51 |
Twingy |
the writes are to distinct memory |
19:03.04 |
Twingy |
the reads are probly from common memory most
of the time, (the triangle data) |
19:03.05 |
Maloeran |
No problem there, I suppose the chunks are big
enough |
19:03.08 |
Twingy |
cause the rays are coherent |
19:03.24 |
Twingy |
so for 128x128 pixels |
19:03.33 |
Twingy |
4 threads will take on 128 pixel scan lines
for now |
19:03.39 |
Twingy |
128x128 goes across the network |
19:03.48 |
Twingy |
and 4 threads gobble up a 128 pixel scanline
at a time |
19:03.54 |
Twingy |
(all configurable in the environment file for
the project) |
19:03.59 |
Maloeran |
Could it be a problem of synchronisation,
mutexes and network I/O ? |
19:04.02 |
Twingy |
this is is an example, nothing is
hard-wired |
19:04.22 |
Twingy |
I don't think so, cause I got a demo here that
doesn't do any network i/o |
19:04.36 |
Maloeran |
I suppose the count of mutexes is very low
too |
19:04.39 |
Twingy |
and the only thing a mutex is doing is locking
to increment a variable (i++) then unlocking |
19:04.52 |
Twingy |
and after each mutex it does millions of
computations |
19:05.00 |
Twingy |
so mutex should be neglidgible |
19:05.05 |
Twingy |
yes |
19:05.11 |
Twingy |
mutex locking is in the 10's or
100's |
19:05.16 |
Twingy |
computations in the loop are in the
millions |
19:05.18 |
Maloeran |
Is this variable shared by all threads? How
often does it increment? |
19:05.26 |
Maloeran |
Ah. |
19:05.27 |
Twingy |
10 to 100 times |
19:05.34 |
Twingy |
it represents work units |
19:05.40 |
Twingy |
work unit index rather |
19:05.45 |
Twingy |
I am working on work unit 'i' |
19:06.27 |
Twingy |
an the 'work' is a float[16] array, where it
does a random read from it (as happens in triangle ray-tracing from
big triangle array) and does math on the value at
float[...] |
19:06.29 |
Twingy |
now |
19:06.44 |
Twingy |
if I put a = array[rand....] in the
loop |
19:06.55 |
Twingy |
where it goes done millions of time,
performance is horrible nasty omfg slow |
19:07.04 |
Twingy |
if I put it before the loop so it only gets
called a few hundred times |
19:07.11 |
Twingy |
then performance is parallel |
19:07.16 |
Twingy |
more threads = more cpu |
19:07.20 |
Twingy |
which is why for ray-tracing |
19:07.24 |
Twingy |
it's reall a mix of both |
19:07.38 |
Twingy |
which is why we get 'some' performance, but it
ramps down rather quickly |
19:08.10 |
Maloeran |
Yes I get the picture |
19:08.26 |
Twingy |
so I'm curious if it's just me |
19:08.36 |
Twingy |
or if my conclusions are similiar in your
case |
19:08.51 |
Maloeran |
Would you be able to run a 32 bits linux
binary? |
19:08.55 |
Twingy |
I dunno |
19:08.56 |
Twingy |
we could try |
19:09.01 |
Twingy |
I don't think the 'speed' matters |
19:09.06 |
Twingy |
just the threaded relationship |
19:09.25 |
Twingy |
in regards to performance |
19:09.58 |
Maloeran |
Right |
19:10.27 |
Maloeran |
I have been glancing over your code, I don't
really see a potential problem, yet |
19:10.45 |
Twingy |
ah, in the engine? |
19:10.51 |
Twingy |
the camera in libutil/camera.c |
19:10.54 |
Twingy |
is the threading |
19:11.02 |
Twingy |
the engine tie.c is the intersection
code |
19:11.12 |
Maloeran |
I know ;) |
19:11.14 |
Twingy |
k :) |
19:11.20 |
Twingy |
when is the last time you updated? |
19:11.25 |
Twingy |
I have been making changes left and
right |
19:11.34 |
Maloeran |
It might be up to a week old |
19:11.37 |
Twingy |
k |
19:12.07 |
Maloeran |
Did you try the 32 bits binary? |
19:13.04 |
Twingy |
anyway to remove sdl quickly? |
19:13.14 |
Twingy |
I don't think I have sdl installed for linux
on freebsd |
19:13.17 |
Maloeran |
Oops. |
19:13.24 |
Maloeran |
A moment. |
19:13.28 |
Twingy |
k |
19:13.39 |
Twingy |
looks like it should otherwise run
though |
19:17.45 |
Maloeran |
All right, download from same
address |
19:19.07 |
Twingy |
your printf still pours out pages and pages of
numbers |
19:19.08 |
Maloeran |
and I forgot the fflush() :) |
19:19.12 |
Twingy |
heh |
19:19.43 |
Maloeran |
You can set the number of threads by the first
argument |
19:19.49 |
Twingy |
yep |
19:20.20 |
Twingy |
can you do the fflush thing? |
19:20.35 |
Twingy |
don't think my terminal is keeping up to
speed |
19:21.07 |
Maloeran |
Okay... Give me a moment to re-do all the SDL
removal :} |
19:21.16 |
Twingy |
you just delete it? |
19:21.39 |
Twingy |
this is the part where you would've tagged a
branch in cvs for me ;) |
19:24.14 |
Maloeran |
You know... There is already a fflush( stdout
); in there |
19:24.31 |
Twingy |
using '\r' as the terminator? |
19:24.44 |
Twingy |
'\n' can't be there |
19:24.53 |
Maloeran |
Oh? |
19:25.05 |
Twingy |
'\n' means new line |
19:25.08 |
Twingy |
we don't want a new line |
19:26.30 |
Maloeran |
Fine, download again |
19:26.35 |
Twingy |
k :) |
19:27.10 |
Twingy |
much better |
19:27.32 |
Twingy |
1: 14 fps |
19:28.00 |
Twingy |
2: 27 fps |
19:28.29 |
Twingy |
3: 37 fps |
19:28.54 |
Twingy |
4: 43 fps |
19:29.26 |
Maloeran |
Four processors to run 3 times faster than
one, the loss is reasonable |
19:29.38 |
Twingy |
yah, it's pretty good I'd say |
19:29.46 |
Twingy |
how are you threading? |
19:29.48 |
Twingy |
per scanline? |
19:29.57 |
Maloeran |
Per block of 32x32 pixels in what I sent
you |
19:30.09 |
Twingy |
each thread works on a block? |
19:30.19 |
Maloeran |
Right, and fetch the next block pending in
queue once it's done |
19:30.29 |
Twingy |
k, that was my next step |
19:30.39 |
Twingy |
for threading |
19:30.47 |
Twingy |
I think you want as little coherency as
possible between threads |
19:31.04 |
Twingy |
otherwise you have threads doing reads from
the same memory segment it seems |
19:31.08 |
Twingy |
but that *should* be in cache |
19:31.14 |
Twingy |
but |
19:31.21 |
Maloeran |
Reads from the same segments aren't generally
a problem, not a big one anyway |
19:31.22 |
Twingy |
a thread has no idea if that data is
current |
19:31.27 |
Twingy |
so I think in general it has to do a fresh
read |
19:31.35 |
Maloeran |
It doesn't trash the cache, it might just be a
bit of a bottleneck on a memory bank |
19:31.45 |
Twingy |
right, but do you agree with my
statement? |
19:32.02 |
Maloeran |
I don't see what you mean by "data being
current" |
19:32.14 |
Twingy |
thread 1 pulls from array[1234] at time
interval X |
19:32.21 |
Twingy |
thread 2 pulls from array[1234] at time
interval Y |
19:32.36 |
Twingy |
how does thread 2 know that array[1234] at
time interval Y read the same data as interval X |
19:32.48 |
Twingy |
I doesn't... so I think it cannot rely on
cache |
19:33.03 |
Maloeran |
If no other processor has modified the data,
it will remain in the processor's cache |
19:33.26 |
Twingy |
okie |
19:33.45 |
Twingy |
welp, let me try postage stamping the threaded
work units like I'm already doing for the distributed compute
nodes |
19:34.08 |
Twingy |
(I've done this before in previous
experiments) |
19:34.25 |
Twingy |
but I never tested improvement of linearity of
threads |
19:34.43 |
Maloeran |
Mmhm, good luck. Verify your writes
especially, I suspect something is trashing the caches
somewhere |
19:35.04 |
Twingy |
possibly |
19:35.26 |
Twingy |
I generate a ray, I fire the ray, the ray gets
intersected against triangles, the result goes to the render
method, render method shades a pixel, pixel gets shoved into a
128x128x3 buffer |
19:35.36 |
Maloeran |
Where do threads pull their ray vectors
from? |
19:35.47 |
Twingy |
each thread has an instance of a ray
object |
19:38.13 |
Twingy |
according to your single threaded performance,
using your SSE stuff and optical bundle tricks you seem to be 7x
faster than mine |
19:39.13 |
Twingy |
for non-coherent path tracing stuff I think
you're 1.5 - 2.0x faster |
19:39.26 |
Maloeran |
:) Not too bad, but the new scene
analysis/preparation is still in the workshop |
19:39.30 |
Twingy |
but you don't have rays all the way through or
doubled sided normals on |
19:39.34 |
Twingy |
k |
19:39.58 |
Twingy |
I've got a potential 15% performance
improvement for optical rendering only I might drop in
tonight |
19:40.24 |
Twingy |
that should inch me towards 2
mil/sec |
19:40.40 |
Twingy |
for certain views, but on average about 1.5
mil/sec I think |
19:40.53 |
Maloeran |
I went to implement rays going through
geometry with callbacks a few days ago, but it looked really messy
to implement in the SSE path working in bundles |
19:41.14 |
Twingy |
yes |
19:41.21 |
Twingy |
now you see why I haven't done it
yet |
19:41.30 |
Twingy |
or anyone has |
19:41.38 |
Maloeran |
I may well just support that for single rays,
which is trivial, for now anyway |
19:41.42 |
Twingy |
yah |
19:41.47 |
Twingy |
btw |
19:41.52 |
Twingy |
I'm thinking about removing
callbacks |
19:42.02 |
Twingy |
and just putting the intersection stack into
the ray itself |
19:42.13 |
Twingy |
so if you fire a ray again with a used
stack |
19:42.18 |
Twingy |
it'll just resume using the stack |
19:42.30 |
Twingy |
removes an extra function call for optical
rendering |
19:42.34 |
Maloeran |
Will this ray be able to change
direction? |
19:42.47 |
Twingy |
huh? of course you fire a new ray |
19:43.07 |
Twingy |
a changed direction means generating a new ray
for me |
19:43.27 |
Maloeran |
So you could re-use the intersection point of
a ray, as a source for new rays, exploiting the locality of
rays |
19:43.44 |
Twingy |
not with a tree |
19:43.59 |
Twingy |
with a graph, sure |
19:44.22 |
Twingy |
when I feel I've completely run out of
tricks |
19:44.27 |
Twingy |
I might try the graph approach |
19:44.39 |
Twingy |
brb, gonna make a quick lunch |
19:44.44 |
Maloeran |
I think a shortcut would be possible, but in
any case... yes, I believe a graph is much more appropriate
:) |
19:44.53 |
Twingy |
for path tracing, definetly |
19:45.17 |
Twingy |
but *ahem* lee wanted me to implement wald's
paper *ahem* |
19:45.27 |
Twingy |
so I have to roll with that for now |
19:45.35 |
Maloeran |
For any ray-tracing needs really, since a
graph can both exploit coherency and locality |
19:45.47 |
Twingy |
jup |
19:45.49 |
Maloeran |
*nods* So I understood... Lee isn't here, is
he? :p |
19:45.55 |
Twingy |
he's polyspin |
19:46.08 |
Twingy |
brb, gonna make some lunch |
20:04.26 |
Twingy |
back |
20:05.58 |
Maloeran |
If you are going to try a graph based
technique, I still think a tetrahedron graph could be
promising |
20:06.13 |
Maloeran |
No sectors, no nodes, no triangle, a single
primitive ; tetrahedrons |
20:06.30 |
Twingy |
maybe we can work on this
together... |
20:07.23 |
Maloeran |
*nods* I still have a fairly long to-do list
on my current design |
20:07.45 |
Twingy |
I'm sure this will be some ongoing research
we'll be involved with for years to come |
20:08.07 |
Twingy |
we'll probly be 30 or so before this winds
down |
20:09.23 |
Maloeran |
Eheh, I wouldn't mind :), I haven't found a
field that led to so interesting, mind boggling problems just
waiting to be solved |
20:09.46 |
Maloeran |
another field, that is |
20:10.46 |
Twingy |
the only other thing I found nearly this
interested was when I was working on my coil gun |
20:11.18 |
Twingy |
it was easy to see how years of research could
be speant optimizing the efficiency of the energy
transfer |
20:12.15 |
Maloeran |
Eh I suppose, how did this project work
out? |
20:12.42 |
Twingy |
I was able to get my single stage coil gun up
to an efficiency of about 2% using off the shelf power supply
capacitors and hand-wound litz wire |
20:12.54 |
Twingy |
max efficiency I've ever seen somone get out
of single stage is 10% |
20:13.03 |
Twingy |
in research labs etc |
20:13.03 |
Maloeran |
Hum :), I see |
20:13.14 |
Maloeran |
I have been thinking sporadically about AI
since my early genetic experiments too, that's something I want to
get back into, I have my ideas for a lower level approach to the
problem |
20:13.32 |
Maloeran |
Much closer to the fundamental definition of
intelligence, nothing like faking it at high-level with neural
networks |
20:16.59 |
Maloeran |
Any more clues on the threading performance
issue? |
20:17.09 |
Twingy |
I'm still playing with my demo |
20:17.15 |
Twingy |
recreating a base-model of what's going
on |
20:17.28 |
Twingy |
and I'm seeing resonable performance from
it |
20:17.30 |
Twingy |
near-linear |
20:17.44 |
Twingy |
16384 random accesses to a 16MB array
distributed among 4 threads |
20:17.52 |
Twingy |
doing a billion iteration computation
loop |
20:18.13 |
Twingy |
each work unit is a random access |
20:18.16 |
Maloeran |
Read operations can't be the source of such a
performance bottleneck |
20:18.23 |
Twingy |
indeed |
20:23.38 |
Twingy |
one thing you have to worry about with
opterons is the hyper transport too |
20:23.59 |
Twingy |
it's a big bottle neck |
20:27.11 |
silvap_ |
i swear i saw a reference to yoot saito's
"yoot's tower" in the sim tower splash screen |
20:30.59 |
silvap_ |
Maloeran, what sorta AI stuff have u
done |
20:32.14 |
Maloeran |
Long ago, genetic algorithms to build up an AI
for a primitive strategy game of some sort |
20:32.40 |
Maloeran |
It played for months against itself on a P3,
results were deceiving ( the learning pace was below my
expectations ) |
20:33.29 |
silvap_ |
hehe cool |
20:34.13 |
silvap_ |
was it coupled with an ANN or was the state
space too large? |
20:35.00 |
Maloeran |
It wasn't, it really was a straight genetic
algorithm |
20:35.21 |
silvap_ |
ah |
20:35.35 |
Twingy |
i.e brute force :) |
20:35.47 |
Maloeran |
On a Pentium 3! :) |
20:36.01 |
Twingy |
meh, that's not that long ago |
20:36.09 |
Twingy |
now if it were a P133 ... |
20:37.10 |
Maloeran |
Pentium 3 733mhz, running all nights on my
father's computer, it was high end back then |
20:52.48 |
Twingy |
getting closer to the problem |
20:53.09 |
Twingy |
just replaced my ray intersection with a loop
that does 1024 floating multiplies |
20:53.26 |
Twingy |
100% 75% 25% 19% efficiency |
20:53.35 |
Twingy |
so it's definetly not the engine |
20:53.46 |
Twingy |
so I think I know what to fix now |
20:56.01 |
Maloeran |
Your engine's code did seem quite fine on this
aspect. Would it be the callbacks? |
21:06.02 |
Twingy |
no, I think it's the semaphore usage in my
camera |
21:06.17 |
Twingy |
I'm going running now, so I'll investigate in
an hour |
21:07.19 |
Twingy |
and indeed it was |
21:07.50 |
Twingy |
I must've been pinching a nerve or something
on the soft one |
21:10.47 |
Maloeran |
Eh, good |
21:58.54 |
*** join/#brlcad ``Erik
(i=erik@pcp0011474399pcs.chrchv01.md.comcast.net) |