00:02.51 |
*** join/#brlcad merzo
(~merzo@43-61-133-95.pool.ukrtel.net) |
00:42.43 |
*** join/#brlcad merzo
(~merzo@43-61-133-95.pool.ukrtel.net) |
00:57.09 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
01:04.00 |
*** join/#brlcad LordOfBikes
(~armin@dslb-088-065-188-154.088.065.pools.vodafone-ip.de) |
02:25.27 |
*** join/#brlcad kintel
(~textual@unaffiliated/kintel) |
07:10.14 |
brlcad |
getting closer, but going to have to continue
this debugging in a few hours with fresh eyes |
07:28.16 |
*** join/#brlcad KimK
(~Kim__@2001:579:d00c:600:4a5b:39ff:fe0b:57d2) |
08:44.51 |
*** join/#brlcad hightower2
(~hightower@141-210.dsl.iskon.hr) |
08:45.25 |
*** join/#brlcad hightower2
(~hightower@unaffiliated/hightower2) |
08:57.55 |
*** join/#brlcad merzo
(~merzo@43-61-133-95.pool.ukrtel.net) |
09:48.37 |
*** join/#brlcad teepee_
(~teepee@unaffiliated/teepee) |
10:39.26 |
*** join/#brlcad merzo
(~merzo@195.20.130.10) |
11:01.19 |
*** join/#brlcad teepee-
(bc5c2133@gateway/web/freenode/ip.188.92.33.51) |
13:38.21 |
*** join/#brlcad kintel
(~textual@unaffiliated/kintel) |
14:14.09 |
*** join/#brlcad merzo
(~merzo@195.20.130.10) |
15:11.33 |
*** join/#brlcad kintel
(~textual@unaffiliated/kintel) |
15:34.26 |
starseeker |
brlcad: Don't know if they'll be any use for
debugging, but I'm trying to get cache tests set up that will
ensure things stay working once we hammer out the last
issue(s) |
15:36.35 |
starseeker |
is there any way to make an attempt to bu_free
a null pointer fatal? A quick look in the code suggests there
isn't... Could we set up an environment variable or something we
could set for testing purposes to make bu_free exit on attempt to
free null? |
15:37.12 |
brlcad |
I noticed the test, sounds like a good
plan |
15:37.59 |
brlcad |
the usual way to catch that would be a sanity
macro before the free call |
15:38.51 |
starseeker |
if we knew where to put it... |
15:39.44 |
starseeker |
what I'm after is that spewage of bu_free
message we got turning into a fatal error, because in a lot of
cases things will "keep working" after that happens |
15:39.45 |
brlcad |
fwiw, i'm debugging a db_close() failure.
there's some memory management issue when two processes attempt to
cache at the same time |
15:40.06 |
brlcad |
i think that's the same bug |
15:40.27 |
starseeker |
brlcad: cool. Hope I didn't stomp on anything
- I figured I'd work on the tests this morning after I saw you were
digging into the main code |
15:40.45 |
brlcad |
easily reproduced because it only happens the
first time an object is created and another process tries to create
it too |
15:40.53 |
brlcad |
working on tests is golden |
15:41.13 |
brlcad |
I'm close it feels like something
simple |
15:41.46 |
starseeker |
so you're seeing an actual multi-process issue
(e.g. two different programs at the same time), or just multiple
threads? |
15:42.32 |
brlcad |
in theory, it can happen with two threads or
processes -- it's whenever two try to cache at the same
time. |
15:42.42 |
starseeker |
nods |
15:42.52 |
brlcad |
whoever gets there second ends up with bad
dbip book-keeping reliably |
15:43.27 |
Stragus |
Multiple processes playing in the same files,
without file locking? Neat, a bit tricky too |
15:43.29 |
brlcad |
I'm probably adding the wrong dbip to the hash
or something stupid |
15:43.54 |
starseeker |
Stragus: an efficient way to go bald, so
far |
15:44.17 |
starseeker |
Stragus: locking is fine, if we need to -
we're just missing a guard somewhere |
15:44.24 |
brlcad |
Stragus: yeah, that part actually works
alright -- it's when they go to clean up their memory ... on of
them stashed a bad pointer |
15:45.00 |
brlcad |
I don't think this is a guard issue, there's
no indication -- it seems like a straight up book-keeping
bug |
15:45.14 |
starseeker |
ah, k |
15:45.24 |
brlcad |
could be, of course, but so far it's straight
up repeatable, not raced |
15:46.09 |
starseeker |
considers whether a raced bug
might be hiding behind the current straight up failure, shudders,
and goes back to test writing |
15:47.38 |
Stragus |
So it's just memory and unrelated to file
locking/access... If you get desperate and want to try something
new, I wrote a LD_PRELOAD memory debugger: putting each allocation
between pages that core dump on access, tracking all memory
allocations and their full backtraces (handy to trace memory leaks,
etc.), and so on |
15:47.57 |
Stragus |
I actually use it all the time, it beats
Valgrind for me |
15:48.16 |
starseeker |
I probably shouldn't be doing two rtips at
once in the same program... I don't know that that is actually
intended to work... |
15:48.19 |
starseeker |
Stragus: cool |
15:48.24 |
starseeker |
is that up on your site? |
15:49.54 |
Stragus |
I never really shared it anywhere. I wrote
that once out of desperation to track a bug |
15:51.16 |
starseeker |
Stragus: you should sent it to the Valigrind
people and have them build it in. --totally-desperate or some such
option ;-) |
15:52.07 |
Stragus |
Eheh. Technically Valgrind is fancier, but #1
Valgrind is too slow for many uses #2 I want to core dump
INSTANTLY when I access where I shouldn't, not some time later
"when I use the value" as decided by Valgrind |
15:52.39 |
starseeker |
nods |
15:52.40 |
Stragus |
Ah, and #3 I like getting a detailed list of
all memory allocations at any time while the code is
running |
15:53.01 |
Stragus |
Like this: http://www.rayforce.net/mmdebug.log |
15:54.51 |
starseeker |
nice |
16:04.04 |
brlcad |
think I just found the bug |
16:04.18 |
Stragus |
cheers for bug-crushing
brlcad |
16:05.11 |
brlcad |
just a a stray db_close() where we shouldn't
be closing anything |
16:06.15 |
Stragus |
Sounds like a typical double-free. I thought
you would a debugging mode #define for bu_alloc and bu_free (hence
the wrappers), catching that stuff and more |
16:06.30 |
Stragus |
you would +have |
16:06.33 |
brlcad |
starseeker: simultaneous rtips should work
just fine... |
16:10.37 |
brlcad |
Stragus: a bit more complicated. this isn't
allocation-related, it's a handle to reference counted memory
mapped files that are stored in a hash -- code pulled the handle
from the hash to see if it was there... closed it (mind you it's
still in the hash), then went to use it again. that was all good
and fine until it came time to clean up and shut down, and a bit of
naive code simply iterated over the |
16:10.42 |
brlcad |
hash and closed everything (because it should
only be there if it's open) |
16:11.14 |
brlcad |
could've caught that it was already closed,
but that would have just masked the mistake. just took a bit to
find where it was getting closd prematurely |
16:12.07 |
Stragus |
nods |
16:20.45 |
brlcad |
Stragus: sounds like you implemented
_FORTIFY_SOURCE=2 |
16:21.10 |
brlcad |
mind you, with some fancy stack printing
instead of just detect and abort |
16:24.47 |
Stragus |
I thought _FORTIFY_SOURCE was only for glibc
calls? |
16:25.09 |
brlcad |
forget which OS, but there was one out there
(maybe openbsd, solaris, I forget) for a while whose libc had all
allocations set up to intentionally incur a fault. some of that
carried over to Mac with their libc as well. |
16:25.18 |
Stragus |
I want a core dump the moment I step over a
byte I shouldn't, anywhere |
16:25.26 |
Stragus |
Cool |
16:27.21 |
Stragus |
(technically, my mmdebug has mechanisms to
exclude some allocations from the mmap stuff, 12288 bytes of
overhead per malloc() can cause trouble) |
16:28.17 |
brlcad |
yeah, _FORTIFY_SOURCE tries to do the least
expensive and only checks when access via some call (iirc, they
could have gotten more advanced), but the OS-level one was
definitely any access. read one byte past a char[12] - boom,
segfault. |
16:29.13 |
Stragus |
Neat. Yes, I'm not surprised others have done
it before, it's very handy |
16:29.18 |
brlcad |
there was a lot of commotion back at the time
because so many apps wouldn't run when they turned it on |
16:29.26 |
Stragus |
Ahah |
16:31.01 |
Stragus |
I had to put some exclusion mechanisms for
various reasons, for example the NVIDIA GL drivers malloc'ed 3
bytes then later read 4 bytes from that address |
16:36.53 |
starseeker |
probably openbsd, that sounds like their
style |
16:43.55 |
brlcad |
starseeker: just slammed through a bunch of
tests including deleting cache mid-processings, read-only, dozens
of simultaneous collisions ... so far looking good. |
19:49.06 |
starseeker |
brlcad: sweet |
19:49.33 |
starseeker |
I've got at least some of the tests in place
(not actually shooting the rays yet, but the cache bit is
there) |
19:49.59 |
*** join/#brlcad merzo
(~merzo@48-10-132-95.pool.ukrtel.net) |
19:50.02 |
starseeker |
I haven't figured out what I'm doing wrong
with the rtip yet |
19:53.00 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |
20:09.05 |
*** join/#brlcad kintel
(~textual@unaffiliated/kintel) |
20:36.35 |
starseeker |
brlcad: are you able to run the cache test
rt_cache 5 10 successfully? |
20:51.56 |
*** join/#brlcad merzo
(~merzo@48-10-132-95.pool.ukrtel.net) |
20:52.16 |
starseeker |
hang on, I'm doing something wrong with
mappedfile.c... |
21:23.01 |
*** join/#brlcad merzo
(~merzo@48-10-132-95.pool.ukrtel.net) |
21:39.27 |
starseeker |
brlcad: OK, r72749 and r72751 may do it - need
to try some real tests and have a go on Windows |
21:44.47 |
starseeker |
brlcad: should the rt_cache tests follow
through and actually do a shot, or do we not need that level of
validation? (I'm going to add some basic sanity checks on number
of files present, but setting up the all-up shot validation is a
bit more infrastructure...) |
23:23.38 |
*** join/#brlcad teepee
(~teepee@unaffiliated/teepee) |