regress-repository · brlcad · Zulip Chat Archive

Not scanning at all, actually - that's probably where I was getting the cmakefiles.cmake reference. I use that as my initial list, so there's no scanning at all.

Sean (May 14 2020 at 05:16):

Sean (May 14 2020 at 05:17):

starseeker (May 14 2020 at 05:17):

Oh. I'm resetting the offset in the stream once per pass type - getting finer than that is going to get pretty tricky.

Sean (May 14 2020 at 05:18):

traversal is measured in hundreds of thousands of inodes per second, thousands of dirs per second

starseeker (May 14 2020 at 05:18):

so I open the ifstream, do a bio.h check, reset seeking to 0, do the bnetwork.h check, etc.

Sean (May 14 2020 at 05:18):

Sean (May 14 2020 at 05:19):

starseeker (May 14 2020 at 05:20):

In principle I might be able to reuse each getline to do all the processing, but I'm leery of trying - that would involve multiple simultaneous scanning state managements, and just to save 5 or 6 seconds I doubt it's worth it.

Sean (May 14 2020 at 05:20):

to eliminate the effect, you'd need to parse the ifstream into a stringstream or other buffer type and do all your work on that

starseeker (May 14 2020 at 05:21):

Hmm. More ram usage, but none of our files are big enough to be an issue anyway on any sane system...

Sean (May 14 2020 at 05:21):

Sean (May 14 2020 at 05:22):

the only downside would be if you don't actually have enough work, i.e., if the cost of a malloc+memcpy is more than your regex (which seems unlikely)

Sean (May 14 2020 at 05:24):

the memcpy can be avoided, but you'd need to use bu_mapped_file or mmap or similar and just scan the buffer instead of using getline. it's an order of magnitude faster for read-only work.

Sean (May 14 2020 at 05:25):

relies on the virtual memory manager to do the best thing, and it's really hard to do better oneself

starseeker (May 14 2020 at 05:29):

Sean (May 14 2020 at 05:29):

for what it's worth, it's very likely possible to get the repo check down to a sub-second time compiled. not necessary of course, and any improvement will be gravy.

Sean (May 14 2020 at 05:29):

you're already disk-cached -- you'd have to invalidate the cache to see the effect

starseeker (May 14 2020 at 05:30):

Sean (May 14 2020 at 05:31):

I'd still expect it to be a smidgen gain, but it's mostly saving seek time. if it's cached, though, it's all in memory already and there is no actual seeking going on

starseeker (May 14 2020 at 05:32):

I'm probably not using it correctly - I literally dropped in the stringstream instead of the ifstream, and I doubt that's right

starseeker (May 14 2020 at 05:33):

If you want to watch it blow up on OSX, you can try it with cmake -DREPOCHECK_TEST=ON to have "make regress-repository" use the new version.

Sean (May 14 2020 at 05:33):

starseeker (May 14 2020 at 05:35):

for myself I'd rather figure out how to get it to check all of our files defining main to make sure they call bu_setprogname

Sean (May 14 2020 at 05:36):

struct bu_mapped_file *mf =  bu_open_mapped_file(myfile, "whatever");

starseeker (May 14 2020 at 05:36):

Don't know if I've added a new program in the last 5 years without forgetting to do that and then having to debug why the relative lookup acted weird...

starseeker (May 14 2020 at 05:37):

Sean (May 14 2020 at 05:39):

starseeker (May 14 2020 at 05:41):

Well, if you think it's worth it I can take a look tomorrow. Need to try it on some more machines, but for myself if it's reliably <10sec (or even <30 sec) across the board that's probably good enough for now...

Sean (May 14 2020 at 05:41):

Yeah, I agree with you on the bu_setprogname -- that's bit us repeatedly. ideally we shouldn't need to call it at all, but we still don't have all platforms yet.

starseeker (May 14 2020 at 05:41):

I don't think we can get around it for OpenBSD - I couldn't find any other solution that worked reliably

starseeker (May 14 2020 at 05:42):

It's actually pretty cool that the iwd/setprogname pairing does work - I think that might be an unusual capability

Sean (May 14 2020 at 05:43):

if one of your goals/reasons for transcoding it was to make it faster, then it's definitely worth it. I think it would be worthwhile simply because the code definitely has a 10x cyclomatic complexity increase, so we should at least take advantage it'll give decreasing iterative testing times.

Sean (May 14 2020 at 05:44):

if it can be made fast enough, then we might even be able to break up the tests themselves so it's not a compound test

starseeker (May 14 2020 at 05:45):

Well, kinda - I wanted to get to a place where we don't see those crazy outliers that sometimes break the CI results with a "timed out" fail

starseeker (May 14 2020 at 05:45):

Sean (May 14 2020 at 05:45):

starseeker (May 14 2020 at 05:46):

I could probably do that now, actually (running the individual tests bit) - didn't bother since it's no longer a drop-in replacement for repository.sh when I do that, but shouldn't be too hard.

Sean (May 14 2020 at 05:48):

the only reason all those tests were munged together was because of shell scripting I/O performance limitations. I would have needed to use shared memory or ports/socket communication to break it up effectively, which is a pain in shell scripts.

starseeker (May 14 2020 at 05:49):

@Sean Most of the complexity is to support threading - if you want to go simpler and single threaded it can probably be about the same or maybe a little simpler than the shell script

Sean (May 14 2020 at 05:49):

starseeker (May 14 2020 at 05:49):

Sean (May 14 2020 at 05:50):

starseeker (May 14 2020 at 05:51):

look around 748 (you may need to svn update if you haven't in the last few hours)

starseeker (May 14 2020 at 05:57):

Sean (May 14 2020 at 05:57):

Sean (May 14 2020 at 05:58):

starseeker (May 14 2020 at 05:59):

Sean (May 14 2020 at 05:59):

this is already turning 1-line shell script commands into 100 line C++ logic, which has it's obvious benefits, but I don't think it's healthy to add to that complexity unnecessarily. it won't be as maintainable.

starseeker (May 14 2020 at 06:01):

repository.sh is 382 lines, repocheck.cpp is (currently) 885 - more like 3x overall...

Sean (May 14 2020 at 06:01):

Sean (May 14 2020 at 06:02):

Sean (May 14 2020 at 06:03):

the script could be distilled down to something like 10 lines, though, one grep for each test

starseeker (May 14 2020 at 06:04):

<snort> if we don't factor in readability of the lines, sure you can win that way ;-)

Sean (May 14 2020 at 06:04):

starseeker (May 14 2020 at 06:05):

Well, maybe in the sense of competing solutions - the C++ code has to be good enough to warrant replacement of an existing, working solution

starseeker (May 14 2020 at 06:06):

Sean (May 14 2020 at 06:06):

readability of both is dominated by reader experience. assuming an experienced programmer, the difference here is going to predominantly be lines of code and potential for error, long-term maintainability

Sean (May 14 2020 at 06:08):

cyclomatic complexity of c++ is nowhere near that of any higher level language including shell script. it's way higher.

starseeker (May 14 2020 at 06:08):

When I'm doing the coding you can add a 2x multiplier for error potential to shell scripting just due to spaces and special characters in path names.

Sean (May 14 2020 at 06:08):

starseeker (May 14 2020 at 06:10):

Anyway, I'll crank it back to the single threaded version tomorrow and see if mapped files do anything helpful for performance.

Sean (May 14 2020 at 06:10):

any single developer's experience or preference, mine and yours included, shouldn't be the dominant calculus in my opinion. that leads to things like sticking with tcl/tk.

starseeker (May 14 2020 at 06:11):

heh. In fairness, Tcl/Tk was the rational choice in 1990. They just decided they liked it in 1990 and stayed put too long.

starseeker (May 14 2020 at 06:12):

Sean (May 14 2020 at 06:12):

certainly, and shell scripts can still be written in significant fraction of the time as c++, they have their place

Sean (May 14 2020 at 06:14):

Sean (May 14 2020 at 06:18):

fyi, jenkins is barking a failure: make[3]: don't know how to make embedded_licenses.txt. Stop

Sean (May 14 2020 at 06:31):

starseeker (May 14 2020 at 12:05):

Was considering the "root" of the reference system to be the root of the BRL-CAD src tree (not the filesystem) so in that sense they're absolute. However, if you prefer something like the reuse.software format we can do that - I just want to pick one answer, set it up, and not worry about it anymore.

starseeker (May 14 2020 at 12:12):

starseeker (May 14 2020 at 14:38):

starseeker (May 14 2020 at 14:44):

Sean (May 14 2020 at 15:38):

I know what you were considering; it was clear. Just letting you know that it's technically wrong per the URI spec if that was the reason for using them. There's little reason to prefer file:/path/to/file over file:path/to/file and vice versa. The actual correct form is simply path/to/file or ./path/to/file as a URI reference, which is combined with a root path to create a valid URL, like file:/usr/home/brlcad/path/to/file or file:pwd/$filepath or file:${PWD}/${filepath} etc.

Sean (May 14 2020 at 15:39):

Note of the two relative forms, lots of software does support the latter (file:rel/path/to/file) despite being invalid (e.g., I believe some terminals will handle it)

Sean (May 14 2020 at 15:41):

yes, it's called the pmc interface on freebsd. I think the main comparable tool is hwpmc but I have not tried it

Sean (May 14 2020 at 15:41):

Sean (May 14 2020 at 15:42):

starseeker (May 14 2020 at 15:54):

starseeker (May 14 2020 at 15:55):

pmcstat: ERROR: Initialization of the pmc(3) library failed: No such file or directory

starseeker (May 14 2020 at 15:56):

Sean (May 14 2020 at 15:56):

@starseeker just looking at 75799, you're not likely going to see a performance improvement with that approach. you're making three copies of every file. without copy on right, that could even be slower than what you had before.

starseeker (May 14 2020 at 15:58):

@Sean I didn't see a way with mapped file to go line-by-line - it's just a big char buffer. That won't work for this...

Sean (May 14 2020 at 15:58):

the benefit of mapped files is when you use the mmap->buf directly. you do that to completely avoid user memory allocation, and it avoids unnecessary disk seeking.

Sean (May 14 2020 at 15:58):

starseeker (May 14 2020 at 16:00):

Sean (May 14 2020 at 16:00):

starseeker (May 14 2020 at 16:01):

For reporting I need line numbers, which means I'm processing per-line, not per file string

Sean (May 14 2020 at 16:01):

starseeker (May 14 2020 at 16:01):

starseeker (May 14 2020 at 16:03):

it seems to be literally the bytes in a buffer, which isn't enough structure by itself

Sean (May 14 2020 at 16:03):

starseeker (May 14 2020 at 16:03):

Sean (May 14 2020 at 16:04):

C/C++ literally has no concept of lines, it's a construct by convention. a line only exists because we say it exists as denoted by special character values.

Sean (May 14 2020 at 16:05):

there is std api that implements that convention, but it's not rote and not the only way

Sean (May 14 2020 at 16:07):

the performance gain is had from just keeping track where things are in the ->buf

Sean (May 14 2020 at 16:07):

you can certainly find a match with a regex regardless of newlines or with newlines being taken into account

Sean (May 14 2020 at 16:07):

Sean (May 14 2020 at 16:09):

to find a line number, you just count newlines backwards from a match point (of course, probably would make sense to wrap that 2-liner into a whatsthislinenumber(off_t position) function)

Sean (May 14 2020 at 16:10):

but that happens entirely in memory, without a memory copy, so it's instantaneous compared to the alternatives

Sean (May 14 2020 at 16:11):

there's probably a c++ helper, but the C was would be to iteratively call strchr() from a match point

Sean (May 14 2020 at 16:13):

starseeker (May 14 2020 at 16:14):

groan... that's a lot of really finicky overhaul for a few seconds, assuming I can get it to work - that would involve ripping out virtually all the std::string usage.

Sean (May 14 2020 at 16:15):

shrug would be a REALLY good real usecase to learn about performance, but certainly up to you

Sean (May 14 2020 at 16:17):

this is applicable everywhere in the code, and particularly relevant given how much you like to use std::string and the stl containers.

Sean (May 14 2020 at 16:20):

also it doesn't get any simpler than this.
but you're also right that it's certainly not necessary for this unit test, just rare to have learning opportunities isolated like this.

Sean (May 14 2020 at 16:30):

std::string fbuff((char *)ifile->buf);  // A) this potentially reads entire file from disk, mallocs memory, and copies the entire memory span
std::istringstream fs(fbuff); // B) this potentially mallocs more memory and copies memory, albeit typically in blocksizes but could result in a full allocation duplicate of fbuff
...
while (std::getline(fs, sline) ... // C) this necessary causes an allocation, albeit typically only one blocksize worth

starseeker (May 14 2020 at 16:31):

<nod> - there's also some risk that the regex_match calls may malloc under the hood if I feed them const char * inputs - not sure about that

Sean (May 14 2020 at 16:31):

gcc and llvm have optimizations that try to defer the allocations in A and B, but they're not guaranteed or portably reliable

Sean (May 14 2020 at 16:33):

Sean (May 14 2020 at 16:34):

starseeker (May 14 2020 at 16:34):

Sean (May 14 2020 at 16:35):

I'd be surprised if they do too... that's old behavior, there's no need for it to allocate unless you do something like request the match in a new object

Sean (May 14 2020 at 16:35):

Sean (May 14 2020 at 16:40):

I mean the implementation certainly could under the hood if they wanted, but that's true of the C version and nearly any API that doesn't explicitly say they don't make syscalls.

starseeker (May 14 2020 at 18:28):

Sean (May 14 2020 at 18:35):

Sean (May 14 2020 at 18:36):

you could make pos_to_line_num a bit faster as you don't need to calculate offset. can just check NULL result.

Sean (May 14 2020 at 18:37):

the fastest version of this probably involves a start-to-end positional parameters so you can accumulate instead of parsing all the way back to the start repeatedly. still should be really fast, but O(n) vs O(n^2) in the worst case

Sean (May 14 2020 at 18:38):

you also might not gain much from the bio.h precheck, might even be slowing it down, but that's easy enough to test and observe.

starseeker (May 14 2020 at 19:47):

@Sean here's the weird part - compared to the prior commit, this version is slower on FreeBSD

Sean (May 14 2020 at 19:52):

the only additional work is the repeat backscanning, so pos_to_line_num() could be a bigger issue than I gave it. not that surprising -- it's going to be more sensitive to cpu. I can try a profile if you want me to dig deeper, or install them if they're not there if you want to give it a go.

starseeker (May 14 2020 at 19:53):

I'll fiddle with it a bit first, if you think that's the likely source. I don't trust FreeBSD's regex performance, but my code is the likelier issue.

starseeker (May 14 2020 at 19:54):

Would it be worth making a libbu function in the bu/mapped_files.h API that wraps the "correct" position to line number translation, once that's done? Would be a waste to have to re-invent it down the road if we're going to use mapped file more.

starseeker (May 14 2020 at 19:58):

I'm not quite following what you're proposing about not calculating the offset. Don't I have to know that to know when to stop accumulating line counts?

starseeker (May 14 2020 at 20:00):

perf on linux puts most of the time in the bu_open_mapped_file command, but it's also much faster than FreeBSD so that's probably not a helpful indication.

Sean (May 14 2020 at 20:09):

don't think it belongs with mapped files, but could exist as some sort of helper function elsewhere, bu_str_ maybe. I'd wait until there's a second use case. More importantly, though -- the faster version of this is probably going to require a different function signature, different args.

Sean (May 14 2020 at 20:28):

hm, thinking about it some more, there's potential for a really big performance boost by only loading the top of the file. that works for the header checks but I guess wouldn't be so great for some of the other checks.

starseeker (May 14 2020 at 20:31):

starseeker (May 14 2020 at 20:33):

Even with the C++ isms I'm sub-second on my ubuntu box, so I'm going to go ahead and wire it in - if nothing else we should be able to git rid of that 600 second timeout special case...

Sean (May 14 2020 at 20:37):

starseeker (May 14 2020 at 21:50):

starseeker (May 14 2020 at 21:51):

starseeker (May 14 2020 at 21:53):

Ah, right - int may be on a different line. That's more like it - 303 that don't call it

starseeker (May 14 2020 at 21:58):

starseeker (May 15 2020 at 02:07):

Sean (May 15 2020 at 15:25):

@starseeker it looks like you changed some of the repository regular expressions, was there a problem?

starseeker (May 15 2020 at 15:26):

Sean (May 15 2020 at 15:28):

starseeker (May 15 2020 at 15:28):

Sean (May 15 2020 at 15:28):

not sure if it was intentional or fixing a mismatch
looks like you fixed one of the ones changed in 75812

starseeker (May 15 2020 at 15:29):

I think in one or two cases I simplified while testing, I may have forgotten to put back one or two of the more specific regex strings

starseeker (May 15 2020 at 15:29):

Sean (May 15 2020 at 15:29):

Sean (May 15 2020 at 15:30):

starseeker (May 15 2020 at 15:30):

Heh - I mean simplifying the regex expression while I was debugging setting them up in C++

starseeker (May 15 2020 at 15:31):

When translating the longer ones a single char misstep could wipe out badly - if I couldn't see what the problem was the first step was to take it back down to something I could readily parse by eye to make sure I hadn't made some more basic mistake.

starseeker (May 15 2020 at 15:32):

I introduced each problem deliberately into my source tree to detect them, so there is some positive evidence they'll catch things

starseeker (May 15 2020 at 15:33):

Sean (May 15 2020 at 15:40):

in doing so, I think some matching behavior has changed. could be wrong, but the regular expressions for the function matches and platform checks look different. some less aggressive, some more aggressive.

Sean (May 15 2020 at 15:41):

starseeker (May 15 2020 at 15:41):

The one behavior I know has changed is that platform symbols are counted per-symbol, not per line - so some of the lines with WIN32 and CYGWIN on one line, for example, are now registering 2 counts instead of 1

Sean (May 15 2020 at 15:42):

you're also catching mixed case now, which it wasn't doing before (not saying that's good or bad, just different)

starseeker (May 15 2020 at 15:43):

you mean the platform symbols? I thought that was still matching just all upper and all lower...

Sean (May 15 2020 at 15:43):

can I make a request on the function tests -- can you make it so the things to match are only on one line?

Stream: brlcad

Topic: regress-repository

starseeker (May 14 2020 at 05:15):

Sean (May 14 2020 at 05:16):

Sean (May 14 2020 at 05:17):

starseeker (May 14 2020 at 05:17):

Sean (May 14 2020 at 05:18):

starseeker (May 14 2020 at 05:18):

Sean (May 14 2020 at 05:18):

Sean (May 14 2020 at 05:19):

starseeker (May 14 2020 at 05:20):

Sean (May 14 2020 at 05:20):

starseeker (May 14 2020 at 05:21):

Sean (May 14 2020 at 05:21):

Sean (May 14 2020 at 05:22):

Sean (May 14 2020 at 05:24):

Sean (May 14 2020 at 05:25):

starseeker (May 14 2020 at 05:29):

Sean (May 14 2020 at 05:29):

Sean (May 14 2020 at 05:29):

starseeker (May 14 2020 at 05:30):

Sean (May 14 2020 at 05:31):

starseeker (May 14 2020 at 05:32):

starseeker (May 14 2020 at 05:33):

Sean (May 14 2020 at 05:33):

starseeker (May 14 2020 at 05:35):

Sean (May 14 2020 at 05:36):

starseeker (May 14 2020 at 05:36):

starseeker (May 14 2020 at 05:37):

starseeker (May 14 2020 at 05:37):

Sean (May 14 2020 at 05:39):

Sean (May 14 2020 at 05:39):

Sean (May 14 2020 at 05:39):

starseeker (May 14 2020 at 05:41):

Sean (May 14 2020 at 05:41):

starseeker (May 14 2020 at 05:41):

starseeker (May 14 2020 at 05:42):

Sean (May 14 2020 at 05:43):

Sean (May 14 2020 at 05:44):

starseeker (May 14 2020 at 05:45):

starseeker (May 14 2020 at 05:45):

Sean (May 14 2020 at 05:45):

Sean (May 14 2020 at 05:45):

starseeker (May 14 2020 at 05:46):

Sean (May 14 2020 at 05:48):

starseeker (May 14 2020 at 05:49):

Sean (May 14 2020 at 05:49):

starseeker (May 14 2020 at 05:49):

Sean (May 14 2020 at 05:50):

Sean (May 14 2020 at 05:50):

starseeker (May 14 2020 at 05:51):

starseeker (May 14 2020 at 05:57):

Sean (May 14 2020 at 05:57):

Sean (May 14 2020 at 05:58):

starseeker (May 14 2020 at 05:59):

Sean (May 14 2020 at 05:59):

starseeker (May 14 2020 at 06:01):

Sean (May 14 2020 at 06:01):

Sean (May 14 2020 at 06:02):

Sean (May 14 2020 at 06:03):

starseeker (May 14 2020 at 06:04):

Sean (May 14 2020 at 06:04):

starseeker (May 14 2020 at 06:05):

starseeker (May 14 2020 at 06:06):

Sean (May 14 2020 at 06:06):

Sean (May 14 2020 at 06:08):

starseeker (May 14 2020 at 06:08):

Sean (May 14 2020 at 06:08):

starseeker (May 14 2020 at 06:10):

Sean (May 14 2020 at 06:10):

starseeker (May 14 2020 at 06:11):

starseeker (May 14 2020 at 06:12):

Sean (May 14 2020 at 06:12):

Sean (May 14 2020 at 06:14):

Sean (May 14 2020 at 06:18):

Sean (May 14 2020 at 06:31):

starseeker (May 14 2020 at 12:05):

starseeker (May 14 2020 at 12:12):

starseeker (May 14 2020 at 14:38):

starseeker (May 14 2020 at 14:44):