Opened 18 years ago

Closed 18 years ago

Last modified 18 years ago

#94 closed defect (released)

remotetest netstats1 fails

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone:
Component: Backend-Remote Version: SVN trunk
Severity: normal Keywords:
Cc: Olly Betts Blocked By:
Blocking: Operating System: Linux

Description

Running the testsuite on my development box (Ubuntu Dapper, uname -a = "Linux scary 2.6.15-27-k7 #1 SMP PREEMPT Sat Sep 16 02:35:20 UTC 2006 i686 GNU/Linux"), I get 1 failure with remotetest for the netstats1 test case. Rerunning with -v, I get:

Running test 'remotetest -v netstats1' under valgrind Running test: netstats1... /home/richard/private/Working/xapian/xapian-core/tests/remotetest.cc:261: ((mset) == (mset_alllocal)) Expected mset' and mset_alllocal' to be equal: were Xapian::MSet(Xapian::MSet::Internal(firstitem=0, matches_lower_bound=7, matches_estimated=7, matches_upper_bound=7, max_possible=2.2339228546726124236, max_attained=1.445962071042388164, Xapian::MSetItem(7, 1.445962071042388164, ), Xapian::MSetItem(3, 1.4140112748017070743, ), Xapian::MSetItem(1, 1.3747698831232337824, ), Xapian::MSetItem(5, 1.1654938419498412916, ), Xapian::MSetItem(9, 1.1654938419498412916, ), Xapian::MSetItem(4, 1.1543806706320836053, ), Xapian::MSetItem(2, 0.12268031290495592933, ))) and Xapian::MSet(Xapian::MSet::Internal(firstitem=0, matches_lower_bound=7, matches_estimated=7, matches_upper_bound=7, max_possible=2.2339228546726124236, max_attained=1.445962071042388164, Xapian::MSetItem(7, 1.445962071042388164, ), Xapian::MSetItem(3, 1.4140112748017070743, ), Xapian::MSetItem(1, 1.3747698831232337824, ), Xapian::MSetItem(5, 1.1654938419498412916, ), Xapian::MSetItem(9, 1.1654938419498412916, ), Xapian::MSetItem(4, 1.1543806706320836053, ), Xapian::MSetItem(2, 0.12268031290495594321, )))

As far as I can tell, the only difference between the expected and actual output is the last 4 digits of the last MSetItem in the MSets: 2933 for the remote case, and 4321 for the local case.

I'm guessing that this is a serialise-double issue, but I'll try and investigate more later. Just wanted to log it first.

Change History (15)

comment:1 by Olly Betts, 18 years ago

Status: newassigned

I don't see this on ixion unfortunately - if I copy the "local" double value and plug it into test_serialisedouble1 in tests/internaltest.cc, then it's converted correctly. Perhaps you could try doing the same - just stick it into the array of double constants, recompile, and re-run. Might be worth printing out the local value with an explicitly large precision in case it has been rounded on output.

I saw similar issues during testing which seemed to be due to x86 having more precision on FP registers than is stored in a double in memory. Perhaps the compiler in dapper generates different code to sarge and this causes a problem somehow. Unfortunately my dapper box is x86_64 (hmm, I should see if there's a 32 bit compiler package...)

The other possibility that comes to mind is that we don't actually generate quite enough bytes in the base 256 mantissa, which is limited by either having no remainder or by having generated N bytes - I calculated N should be enough for the standard IEEE double most platforms use, but maybe I was wrong. However if this is the case, why don't I see it on x86?

Feel free to reassign this bug to yourself if you're likely to be looking at it...

comment:2 by Richard Boulton, 18 years ago

Owner: changed from Olly Betts to Richard Boulton
Status: assignednew

comment:3 by Olly Betts, 18 years ago

Cc: olly@… added

comment:4 by Richard Boulton, 18 years ago

Putting the numbers into internaltest doesn't show the problem for me either. Odd... Maybe it's nothing to do with the serialising. I'll investigate further.

comment:5 by Olly Betts, 18 years ago

Did you check to see if the printed numbers have the full precision? I think they're just printed using iostream's default precision, so that may be rounding off the lowest digits.

Incidentally, I just checked with vimdiff and it is indeed only those 4 digits which differ.

comment:6 by Richard Boulton, 18 years ago

Status: newassigned

comment:7 by Richard Boulton, 18 years ago

Interestingly, this failure only happens under valgrind. If I disable valgrind (by doing "VALGRIND= ./runtest ./remotetest netstats1") the test passes, whereas "./runtest ./remotetest netstats1" fails. Valgrind doesn't report any errors though, so maybe it's a bug in valgrind itself.

I'm using "valgrind-3.1.0-Debian"; I'll have a go with a newer release soon.

comment:8 by Olly Betts, 18 years ago

Valgrind's FP emulation (or simulation or whatever the best word is) isn't exact:

http://article.gmane.org/gmane.comp.debugging.valgrind/3108

Essentially it does everything in 64 bit precision, whereas x86 naturally gives you excess precision (80 bits) in FP registers. So if this is the cause, it suggests the problems could manifest on non-x86, or is SSE maths is used.

Or it could be a valgrind bug I guess, but that's probably less likely.

Does compiling with "-ffloat-store" enable you to reproduce the bug without valgrind?

comment:9 by Richard Boulton, 18 years ago

(A new version of valgrind makes no difference.)

Unravelling runtests further, if I disable valgrind it just calls "./remotetest netstats1", but if I enable valgrind it calls "/bin/sh ./../libtool --mode=execute /usr/bin/valgrind --log-fd=255 ./remotetest netstats1 -v"

I'm not quite sure why libtool is needed in one case, but not in the other. Running "/usr/bin/valgrind ./remotetest netstats1 -v" passes, with valgrind reporting no errors, and "../libtool --mode=execute ./remotetest netstats1 -v" also passes. However, "../libtool --mode=execute /usr/bin/valgrind ./remotetest netstats1 -v" fails the test, and valgrind reports possible memory leaks:

==22328== 355 bytes in 11 blocks are possibly lost in loss record 3 of 3 ==22328== at 0x401B7F3: operator new(unsigned) (vg_replace_malloc.c:164) ==22328== by 0x4380489: std::string::_Rep::_S_create(unsigned, unsigned, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.7) ==22328== by 0x438094F: (within /usr/lib/libstdc++.so.6.0.7) ==22328== by 0x4380A5A: std::string::string(char const*, unsigned, std::allocator<char> const&) (in /usr/lib/libstdc++.so.6.0.7) ==22328== by 0x40F8E2D: static_initialization_and_destruction_0(int, int) (flint_postlist.cc:33) ==22328== by 0x42580E8: (within /home/richard/private/Working/xapian/build-debug/xapian-core/.libs/libxapian.so.11.1.0) ==22328== by 0x4052B44: (within /home/richard/private/Working/xapian/build-debug/xapian-core/.libs/libxapian.so.11.1.0) ==22328== by 0x400B2AA: (within /lib/ld-2.3.6.so) ==22328== by 0x400B35C: (within /lib/ld-2.3.6.so) ==22328== by 0x40007CE: (within /lib/ld-2.3.6.so)

Will running under libtool cause valgrind to do something different with sub-processes, or something like that?

comment:10 by Olly Betts, 18 years ago

Using libtool --mode=execute means that we run valgrind on the actual binary (i.e. .libs/remotetest or something like that). If you run "valgrind ./remotetest" you just valgrind /bin/sh since remotetest is a shell script wrapper, and valgrind doesn't trace child processes by default.

The shell script wrapper does much the same magic as libtool --mode=execute so running the latter on the former isn't likely to make a difference to the results.

Possibly lost (IIRC) means there's a pointer to inside the block, but none to the start. There's an environmental variable you can set to disable the GNU C++ STL's memory pooling (GLIBCXX_FORCE_NEW - see HACKING for details) which runtest in SVN HEAD will now set for you automatically. Maybe that's why you are seeing these reports when you run valgrind "by hand"?

comment:11 by Richard Boulton, 18 years ago

Okay, compiling with "-ffloat-store" stops the test failing whether valgrind is used or not.

My current theory is that the test was failing, as Olly suggested, because of Valgrind's differences with floating point arithmetic from my processor. For the remote test, the remote end of the connection won't be run under valgrind, so the differences will cause the test to fail. Therefore, if I can make the remote server run under valgrind, the test should pass again; and as a side benefit, we'll check for memory bugs in the remote server. So, I'll try and get that to happen, and see if that fixes things.

comment:12 by Olly Betts, 18 years ago

That explanation seems fairly plausible.

IIRC, the remote server is run by code in testsuite/backendmanager.cc, but simply changing runtest to run valgrind with --trace-children=yes will probably do the trick for less effort.

comment:13 by Richard Boulton, 18 years ago

Indeed, changing runtests to pass --trace-children=yes to valgrind makes the test pass. However, it also makes the test take _much_ longer, so I'll modify backendmanager to invoke valgrind for just the server process.

comment:14 by Richard Boulton, 18 years ago

Resolution: fixed
Status: assignedclosed

Fixed by revision 7291

comment:15 by Olly Betts, 18 years ago

Operating System: Linux
Resolution: fixedreleased
Note: See TracTickets for help on using tickets.