Opened 13 years ago
Closed 9 years ago
#553 closed defect (fixed)
Failed test bigoaddvalue1 on Solaris 9 i386
Reported by: | Dagobert Michelsen | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.4 |
Component: | Test Suite | Version: | 1.2.6 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | Solaris |
Description
I have a failed test on Solaris 9 i386 with Sun Studio 12:
Running test: bigoaddvalue1... FAILED Test with 5000 repetitions took 0.13 secs Test with 50000 repetitions took 41.4 secs harness/scalability.cc:46: (time10) < (time1 * threshold) Evaluates to: 41.4 < 5.811
On Solaris 9 Sparc the testsuite runs cleanly on both 32 and 64 bit.
Attachments (1)
Change History (18)
comment:1 by , 13 years ago
comment:2 by , 13 years ago
Could you attach the file config.h which was generated by configure when you built xapian-core?
Also, is this a repeatable failure? You can rerun just that testcase with:
./runtest ./apitest -bbrass bigoaddvalue1
by , 13 years ago
comment:3 by , 13 years ago
I had one other run:
Running test: bigoaddvalue1... FAILED Test with 5000 repetitions took 0.01 secs Test with 50000 repetitions took 0.55 secs
I cannot reproduce it at the moment running it 4 times in a row. The OS is running on a vSphere farm without reservation. Probably there was a load spike when I ran the previous two tests causing the error. I guess this is not an error of Xapian...
Thanks! -- Dago
comment:4 by , 13 years ago
Failed again on a full "gmake check":
Running test: bigoaddvalue1... FAILED Test with 5000 repetitions took 0.09 secs Test with 50000 repetitions took 4.26 secs
Two consecutive runs with the test-specific run you gave above worked. This is pretty strange.
comment:5 by , 13 years ago
OK, config.h has:
#define HAVE_GETRUSAGE 1
So we should be using getrusage()
to get the CPU time used for this test, which should make it much less sensitive to load spikes from other processes (on some platforms, the testsuite falls back to just measuring elapsed time, which is more problematic).
It's really unlikely to be a bug in the library code, or else we'd probably see it elsewhere. But it's arguably a bug in the testsuite harness that it can fail due to unrelated activity on the machine.
You could try excluding system time from the measurement by modifying
tests/harness/cputimer.cc and removing r.ru_stime.tv_sec
from line 58, and the similar change on the next line.
comment:6 by , 13 years ago
Taking out r.ru_stime.tv_sec
from tests/harness/cputimer.cc
unfortunately does not work reliably. I had two failures in about 50 runs:
Running tests with backend "brass"... Running test: bigoaddvalue1... FAILED Test with 5000 repetitions took 0.02 secs Test with 50000 repetitions took 1.02 secs ... Running tests with backend "brass"... Running test: bigoaddvalue1... FAILED Test with 5000 repetitions took 0.02 secs Test with 50000 repetitions took 0.95 secs
Unfortunately I were unable to track down the error, after enabling more system statistics it didn't happen again (yet).
As I now know it can fail sporadically I would like to temporarily disable this specific test during check, is there some variable I can set to skip specific tests?
comment:7 by , 13 years ago
Not sure what's going on. I guess it might be some cache effect where the 5000 run fits in some cache but the 50000 doesn't.
The easiest way to disable a single testcase is just to add this at the start of its code:
SKIP_TEST("disabled");
The string is purely informational, so put what you like there.
comment:8 by , 13 years ago
Just a quick update: the error still occurs exactly the same way in 1.2.7
comment:9 by , 13 years ago
Hmm, I just tried to build on the opencsx "current9x" machine. I configured like so:
PATH=/usr/ccs/bin:$PATH ./configure CXX=CC gmake gmake check
And CC -V
says: CC: Sun C++ 5.9 SunOS_i386 Patch 124864-27 2011/08/09
But I have to patch tests/soaktest/soaktest.cc to get it to compile (<cstdlib>
-> <stdlib.h>
). Also various tests fail, for example:
Running test: stubdb2... NetworkError: Received EOF (context: remote:prog(../bin/xapian-progsrv .brass/db=apitest_simpledata) Running test: uuid1... SIGSEGV at d2ef4a
So testing this is kind of hard right now, and it seems we have worse issues (or else I picked a bad compiler version).
I've applied a patch to trunk for the <stdlib.h>
issue, but could you let me know where and how you built?
comment:10 by , 13 years ago
You can find the build recipe here:
I needed to add -lCrun
to the linker flags due to 0002 (see below). To build I applied three patches:
Here, 0001 is roughly your stdlib.h patch, 0002 is a packaging issue in libtool which needs to be applied for lots of builds and should not be included in your distribution, 0003 makes finding the libtool .la files optional as OpenCSW does not ship them due to general relocation problems when building with DESTDIR.
The total number of build options is quite big as there are lots of defaults from the build system inherited.
comment:11 by , 13 years ago
Status: | new → assigned |
---|
Thanks for the recipe pointer - I'll give it a try when it's less late at night.
The 0003 patch is probably better done by adding solaris*
to the case statement in configure where it checks if it's OK to force link_all_deplibs_CXX=no
:
# Checked: freebsd8.0 openbsd4.6 case $host_os in linux* | k*bsd*-gnu | freebsd* | openbsd*) dnl Vanilla libtool sets this to "unknown" which it then handles as "yes". link_all_deplibs_CXX=no ;; esac
If just patching out xapian-config as in 0003 works, then Solaris presumably must load the dependencies of a library automatically, so the configure change should work, and that will make xapian-config avoid trying to use the .la file there. If you get a chance to try that and it works, let me know and I'll fix it in the next release.
comment:12 by , 13 years ago
I compile cleanly with
./configure CC=/opt/SUNWspro/bin/cc CXX=/opt/SUNWspro/bin/CC CPPFLAGS=-I/opt/csw/include CFLAGS="-xO3 -m32 -xarch=386" CXXFLAGS="-xO3 -m32 -xarch=386" LDFLAGS="-m32 -xarch=386 -norunpath -lCrun -L/opt/csw/lib -R/opt/csw/lib"
The following
gmake check
then may fail the above test, but not always. I suspect the virtualized environment.
comment:13 by , 13 years ago
I checked and (as I suspected in comment#11) solaris indeed does link library dependencies automatically, so I've made that change on trunk in r16737 and will backport for 1.2.11. So you shouldn't need the 0003 patch for xapian-core >= 1.2.11.
The 0002 patch is probably worth pushing to libtool upstream. Meanwhile, I think if you use -Wl,-norunpath instead of -norunpath then libtool will pass -norunpath to the linker, and you won't need the patch.
comment:14 by , 10 years ago
comment:15 by , 10 years ago
Milestone: | → 1.3.5 |
---|
Setting a milestone for this, so it doesn't languish forever.
comment:16 by , 9 years ago
Returning to the original report, perhaps we should split out tests like this that time operations and so might fail under uneven load, etc into a separate make
target, and not run them under make check
by default. For auto-builders tests which occasionally fail are very annoying, and it doesn't add much to the test coverage to be running these everywhere - they're checking that the algorithm used scales in a desirable way, and that algorithm is common to all platforms.
comment:17 by , 9 years ago
Milestone: | 1.3.5 → 1.3.4 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
[8be35f5e1b1753cf83ce3794daf1e4558c94451f] skips timed tests if AUTOMATED_TESTING
is set in the environment, so automated builds should just set that. That crudely but effectively deals with this issue, so closing.
One more addition: this is for the "brass" backend:
For other backends the test runs fine.