#94 closed defect (released)
remotetest netstats1 fails
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Backend-Remote | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Olly Betts | Blocked By: | |
Blocking: | Operating System: | Linux |
Description
Running the testsuite on my development box (Ubuntu Dapper, uname -a = "Linux scary 2.6.15-27-k7 #1 SMP PREEMPT Sat Sep 16 02:35:20 UTC 2006 i686 GNU/Linux"), I get 1 failure with remotetest for the netstats1 test case. Rerunning with -v, I get:
Running test 'remotetest -v netstats1' under valgrind
Running test: netstats1...
/home/richard/private/Working/xapian/xapian-core/tests/remotetest.cc:261:
((mset) == (mset_alllocal))
Expected mset' and
mset_alllocal' to be equal: were
Xapian::MSet(Xapian::MSet::Internal(firstitem=0, matches_lower_bound=7,
matches_estimated=7, matches_upper_bound=7, max_possible=2.2339228546726124236,
max_attained=1.445962071042388164, Xapian::MSetItem(7, 1.445962071042388164, ),
Xapian::MSetItem(3, 1.4140112748017070743, ), Xapian::MSetItem(1,
1.3747698831232337824, ), Xapian::MSetItem(5, 1.1654938419498412916, ),
Xapian::MSetItem(9, 1.1654938419498412916, ), Xapian::MSetItem(4,
1.1543806706320836053, ), Xapian::MSetItem(2, 0.12268031290495592933, ))) and
Xapian::MSet(Xapian::MSet::Internal(firstitem=0, matches_lower_bound=7,
matches_estimated=7, matches_upper_bound=7, max_possible=2.2339228546726124236,
max_attained=1.445962071042388164, Xapian::MSetItem(7, 1.445962071042388164, ),
Xapian::MSetItem(3, 1.4140112748017070743, ), Xapian::MSetItem(1,
1.3747698831232337824, ), Xapian::MSetItem(5, 1.1654938419498412916, ),
Xapian::MSetItem(9, 1.1654938419498412916, ), Xapian::MSetItem(4,
1.1543806706320836053, ), Xapian::MSetItem(2, 0.12268031290495594321, )))
As far as I can tell, the only difference between the expected and actual output is the last 4 digits of the last MSetItem in the MSets: 2933 for the remote case, and 4321 for the local case.
I'm guessing that this is a serialise-double issue, but I'll try and investigate more later. Just wanted to log it first.
Change History (15)
comment:1 by , 18 years ago
Status: | new → assigned |
---|
comment:2 by , 18 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
comment:3 by , 18 years ago
Cc: | added |
---|
comment:4 by , 18 years ago
Putting the numbers into internaltest doesn't show the problem for me either. Odd... Maybe it's nothing to do with the serialising. I'll investigate further.
comment:5 by , 18 years ago
Did you check to see if the printed numbers have the full precision? I think they're just printed using iostream's default precision, so that may be rounding off the lowest digits.
Incidentally, I just checked with vimdiff and it is indeed only those 4 digits which differ.
comment:6 by , 18 years ago
Status: | new → assigned |
---|
comment:7 by , 18 years ago
Interestingly, this failure only happens under valgrind. If I disable valgrind (by doing "VALGRIND= ./runtest ./remotetest netstats1") the test passes, whereas "./runtest ./remotetest netstats1" fails. Valgrind doesn't report any errors though, so maybe it's a bug in valgrind itself.
I'm using "valgrind-3.1.0-Debian"; I'll have a go with a newer release soon.
comment:8 by , 18 years ago
Valgrind's FP emulation (or simulation or whatever the best word is) isn't exact:
http://article.gmane.org/gmane.comp.debugging.valgrind/3108
Essentially it does everything in 64 bit precision, whereas x86 naturally gives you excess precision (80 bits) in FP registers. So if this is the cause, it suggests the problems could manifest on non-x86, or is SSE maths is used.
Or it could be a valgrind bug I guess, but that's probably less likely.
Does compiling with "-ffloat-store" enable you to reproduce the bug without valgrind?
comment:9 by , 18 years ago
(A new version of valgrind makes no difference.)
Unravelling runtests further, if I disable valgrind it just calls "./remotetest netstats1", but if I enable valgrind it calls "/bin/sh ./../libtool --mode=execute /usr/bin/valgrind --log-fd=255 ./remotetest netstats1 -v"
I'm not quite sure why libtool is needed in one case, but not in the other. Running "/usr/bin/valgrind ./remotetest netstats1 -v" passes, with valgrind reporting no errors, and "../libtool --mode=execute ./remotetest netstats1 -v" also passes. However, "../libtool --mode=execute /usr/bin/valgrind ./remotetest netstats1 -v" fails the test, and valgrind reports possible memory leaks:
==22328== 355 bytes in 11 blocks are possibly lost in loss record 3 of 3 ==22328== at 0x401B7F3: operator new(unsigned) (vg_replace_malloc.c:164) ==22328== by 0x4380489: std::string::_Rep::_S_create(unsigned, unsigned, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.7) ==22328== by 0x438094F: (within /usr/lib/libstdc++.so.6.0.7) ==22328== by 0x4380A5A: std::string::string(char const*, unsigned, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.7) ==22328== by 0x40F8E2D: static_initialization_and_destruction_0(int, int) (flint_postlist.cc:33) ==22328== by 0x42580E8: (within /home/richard/private/Working/xapian/build-debug/xapian-core/.libs/libxapian.so.11.1.0) ==22328== by 0x4052B44: (within /home/richard/private/Working/xapian/build-debug/xapian-core/.libs/libxapian.so.11.1.0) ==22328== by 0x400B2AA: (within /lib/ld-2.3.6.so) ==22328== by 0x400B35C: (within /lib/ld-2.3.6.so) ==22328== by 0x40007CE: (within /lib/ld-2.3.6.so)
Will running under libtool cause valgrind to do something different with sub-processes, or something like that?
comment:10 by , 18 years ago
Using libtool --mode=execute means that we run valgrind on the actual binary (i.e. .libs/remotetest or something like that). If you run "valgrind ./remotetest" you just valgrind /bin/sh since remotetest is a shell script wrapper, and valgrind doesn't trace child processes by default.
The shell script wrapper does much the same magic as libtool --mode=execute so running the latter on the former isn't likely to make a difference to the results.
Possibly lost (IIRC) means there's a pointer to inside the block, but none to the start. There's an environmental variable you can set to disable the GNU C++ STL's memory pooling (GLIBCXX_FORCE_NEW - see HACKING for details) which runtest in SVN HEAD will now set for you automatically. Maybe that's why you are seeing these reports when you run valgrind "by hand"?
comment:11 by , 18 years ago
Okay, compiling with "-ffloat-store" stops the test failing whether valgrind is used or not.
My current theory is that the test was failing, as Olly suggested, because of Valgrind's differences with floating point arithmetic from my processor. For the remote test, the remote end of the connection won't be run under valgrind, so the differences will cause the test to fail. Therefore, if I can make the remote server run under valgrind, the test should pass again; and as a side benefit, we'll check for memory bugs in the remote server. So, I'll try and get that to happen, and see if that fixes things.
comment:12 by , 18 years ago
That explanation seems fairly plausible.
IIRC, the remote server is run by code in testsuite/backendmanager.cc, but simply changing runtest to run valgrind with --trace-children=yes will probably do the trick for less effort.
comment:13 by , 18 years ago
Indeed, changing runtests to pass --trace-children=yes to valgrind makes the test pass. However, it also makes the test take _much_ longer, so I'll modify backendmanager to invoke valgrind for just the server process.
comment:15 by , 18 years ago
Operating System: | → Linux |
---|---|
Resolution: | fixed → released |
I don't see this on ixion unfortunately - if I copy the "local" double value and plug it into test_serialisedouble1 in tests/internaltest.cc, then it's converted correctly. Perhaps you could try doing the same - just stick it into the array of double constants, recompile, and re-run. Might be worth printing out the local value with an explicitly large precision in case it has been rounded on output.
I saw similar issues during testing which seemed to be due to x86 having more precision on FP registers than is stored in a double in memory. Perhaps the compiler in dapper generates different code to sarge and this causes a problem somehow. Unfortunately my dapper box is x86_64 (hmm, I should see if there's a 32 bit compiler package...)
The other possibility that comes to mind is that we don't actually generate quite enough bytes in the base 256 mantissa, which is limited by either having no remainder or by having generated N bytes - I calculated N should be enough for the standard IEEE double most platforms use, but maybe I was wrong. However if this is the case, why don't I see it on x86?
Feel free to reassign this bug to yourself if you're likely to be looking at it...