Opened 13 years ago

Closed 8 years ago

#553 closed defect (fixed)

Failed test bigoaddvalue1 on Solaris 9 i386

Reported by: Dagobert Michelsen Owned by: Olly Betts
Priority: normal Milestone: 1.3.4
Component: Test Suite Version: 1.2.6
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: Solaris

Description

I have a failed test on Solaris 9 i386 with Sun Studio 12:

Running test: bigoaddvalue1... FAILED
Test with 5000 repetitions took 0.13 secs
Test with 50000 repetitions took 41.4 secs
harness/scalability.cc:46: (time10) < (time1 * threshold)
Evaluates to: 41.4 < 5.811

On Solaris 9 Sparc the testsuite runs cleanly on both 32 and 64 bit.

Attachments (1)

config.h (7.0 KB ) - added by Dagobert Michelsen 13 years ago.

Download all attachments as: .zip

Change History (18)

comment:1 by Dagobert Michelsen, 13 years ago

One more addition: this is for the "brass" backend:

./apitest backend brass: 276 tests passed, 1 failed, 2 skipped.

For other backends the test runs fine.

comment:2 by Olly Betts, 13 years ago

Could you attach the file config.h which was generated by configure when you built xapian-core?

Also, is this a repeatable failure? You can rerun just that testcase with:

./runtest ./apitest -bbrass bigoaddvalue1

by Dagobert Michelsen, 13 years ago

Attachment: config.h added

comment:3 by Dagobert Michelsen, 13 years ago

I had one other run:

Running test: bigoaddvalue1... FAILED                                          
Test with 5000 repetitions took 0.01 secs
Test with 50000 repetitions took 0.55 secs

I cannot reproduce it at the moment running it 4 times in a row. The OS is running on a vSphere farm without reservation. Probably there was a load spike when I ran the previous two tests causing the error. I guess this is not an error of Xapian...

Thanks! -- Dago

comment:4 by Dagobert Michelsen, 13 years ago

Failed again on a full "gmake check":

Running test: bigoaddvalue1... FAILED                                          
Test with 5000 repetitions took 0.09 secs
Test with 50000 repetitions took 4.26 secs

Two consecutive runs with the test-specific run you gave above worked. This is pretty strange.

comment:5 by Olly Betts, 13 years ago

OK, config.h has:

#define HAVE_GETRUSAGE 1

So we should be using getrusage() to get the CPU time used for this test, which should make it much less sensitive to load spikes from other processes (on some platforms, the testsuite falls back to just measuring elapsed time, which is more problematic).

It's really unlikely to be a bug in the library code, or else we'd probably see it elsewhere. But it's arguably a bug in the testsuite harness that it can fail due to unrelated activity on the machine.

You could try excluding system time from the measurement by modifying tests/harness/cputimer.cc and removing r.ru_stime.tv_sec from line 58, and the similar change on the next line.

comment:6 by Dagobert Michelsen, 13 years ago

Taking out r.ru_stime.tv_sec from tests/harness/cputimer.cc unfortunately does not work reliably. I had two failures in about 50 runs:

Running tests with backend "brass"...
Running test: bigoaddvalue1... FAILED
Test with 5000 repetitions took 0.02 secs
Test with 50000 repetitions took 1.02 secs
...
Running tests with backend "brass"...
Running test: bigoaddvalue1... FAILED
Test with 5000 repetitions took 0.02 secs
Test with 50000 repetitions took 0.95 secs

Unfortunately I were unable to track down the error, after enabling more system statistics it didn't happen again (yet).

As I now know it can fail sporadically I would like to temporarily disable this specific test during check, is there some variable I can set to skip specific tests?

comment:7 by Olly Betts, 13 years ago

Not sure what's going on. I guess it might be some cache effect where the 5000 run fits in some cache but the 50000 doesn't.

The easiest way to disable a single testcase is just to add this at the start of its code:

    SKIP_TEST("disabled");

The string is purely informational, so put what you like there.

comment:8 by Dagobert Michelsen, 12 years ago

Just a quick update: the error still occurs exactly the same way in 1.2.7

comment:9 by Olly Betts, 12 years ago

Hmm, I just tried to build on the opencsx "current9x" machine. I configured like so:

PATH=/usr/ccs/bin:$PATH
./configure CXX=CC
gmake
gmake check

And CC -V says: CC: Sun C++ 5.9 SunOS_i386 Patch 124864-27 2011/08/09

But I have to patch tests/soaktest/soaktest.cc to get it to compile (<cstdlib> -> <stdlib.h>). Also various tests fail, for example:

Running test: stubdb2... NetworkError: Received EOF (context: remote:prog(../bin/xapian-progsrv .brass/db=apitest_simpledata)
Running test: uuid1... SIGSEGV at d2ef4a

So testing this is kind of hard right now, and it seems we have worse issues (or else I picked a bad compiler version).

I've applied a patch to trunk for the <stdlib.h> issue, but could you let me know where and how you built?

comment:10 by Dagobert Michelsen, 12 years ago

You can find the build recipe here:

I needed to add -lCrun to the linker flags due to 0002 (see below). To build I applied three patches:

Here, 0001 is roughly your stdlib.h patch, 0002 is a packaging issue in libtool which needs to be applied for lots of builds and should not be included in your distribution, 0003 makes finding the libtool .la files optional as OpenCSW does not ship them due to general relocation problems when building with DESTDIR.

The total number of build options is quite big as there are lots of defaults from the build system inherited.

comment:11 by Olly Betts, 12 years ago

Status: newassigned

Thanks for the recipe pointer - I'll give it a try when it's less late at night.

The 0003 patch is probably better done by adding solaris* to the case statement in configure where it checks if it's OK to force link_all_deplibs_CXX=no:

# Checked: freebsd8.0 openbsd4.6
case $host_os in
  linux* | k*bsd*-gnu | freebsd* | openbsd*)
    dnl Vanilla libtool sets this to "unknown" which it then handles as "yes".
    link_all_deplibs_CXX=no
    ;;
esac

If just patching out xapian-config as in 0003 works, then Solaris presumably must load the dependencies of a library automatically, so the configure change should work, and that will make xapian-config avoid trying to use the .la file there. If you get a chance to try that and it works, let me know and I'll fix it in the next release.

comment:12 by Dagobert Michelsen, 12 years ago

I compile cleanly with

./configure CC=/opt/SUNWspro/bin/cc CXX=/opt/SUNWspro/bin/CC CPPFLAGS=-I/opt/csw/include CFLAGS="-xO3 -m32 -xarch=386" CXXFLAGS="-xO3 -m32 -xarch=386" LDFLAGS="-m32 -xarch=386 -norunpath -lCrun -L/opt/csw/lib -R/opt/csw/lib"

The following

gmake check

then may fail the above test, but not always. I suspect the virtualized environment.

comment:13 by Olly Betts, 12 years ago

I checked and (as I suspected in comment#11) solaris indeed does link library dependencies automatically, so I've made that change on trunk in r16737 and will backport for 1.2.11. So you shouldn't need the 0003 patch for xapian-core >= 1.2.11.

The 0002 patch is probably worth pushing to libtool upstream. Meanwhile, I think if you use -Wl,-norunpath instead of -norunpath then libtool will pass -norunpath to the linker, and you won't need the patch.

comment:15 by Olly Betts, 9 years ago

Milestone: 1.3.5

Setting a milestone for this, so it doesn't languish forever.

comment:16 by Olly Betts, 8 years ago

Returning to the original report, perhaps we should split out tests like this that time operations and so might fail under uneven load, etc into a separate make target, and not run them under make check by default. For auto-builders tests which occasionally fail are very annoying, and it doesn't add much to the test coverage to be running these everywhere - they're checking that the algorithm used scales in a desirable way, and that algorithm is common to all platforms.

comment:17 by Olly Betts, 8 years ago

Milestone: 1.3.51.3.4
Resolution: fixed
Status: assignedclosed

[8be35f5e1b1753cf83ce3794daf1e4558c94451f] skips timed tests if AUTOMATED_TESTING is set in the environment, so automated builds should just set that. That crudely but effectively deals with this issue, so closing.

Note: See TracTickets for help on using tickets.