Opened 18 years ago

Closed 18 years ago

Last modified 17 years ago

#86 closed defect (released)

crash in multimatch.cc when collapsing and sorting a particular mset

Reported by: Mike Trinkala Owned by: Olly Betts
Priority: normal Milestone: 1.0.0
Component: Library API Version: 0.9.9
Severity: normal Keywords:
Cc: Richard Boulton Blocked By:
Blocking: #118 Operating System: Linux

Description (last modified by Richard Boulton)

* glibc detected * /var/www/html/index.cgi: double free or corruption (fasttop): 0x08ef1cc0 *

If I disable sorting OR collapsing the query works fine.

The sort value (0) is a zero padded string representation of the unix time stamp. The collapse value (1) is the md5 digest of the document (the 16 byte binary representation is stored in the value string).

I was able the reduce the match set to 265 documents but have not been able to track down the cause of the problem. If necessary I can extract that document set and send you a small flint db. Sorry I cannot provide more details. I will attach the full debug log, let me know what else you need.

Additional Info: OS: Fedora Core 5 DB: Flint Version: 0.9.6 with the multimatch "Faster sort by value" patch applied

Attachments (8)

debug.tgz (281.1 KB ) - added by Mike Trinkala 18 years ago.
Apache error_log containing the Xapian debug output
valgrind.txt (43.3 KB ) - added by Mike Trinkala 18 years ago.
valgrind output
valgrind.2.txt (31.7 KB ) - added by Mike Trinkala 18 years ago.
Second valgrind run (recompiled multimatch)
bug86.tgz (1.9 KB ) - added by Mike Trinkala 18 years ago.
Flint DBs to reproduce the crash
matcher-min-weight.patch (522 bytes ) - added by Olly Betts 18 years ago.
Patch to fix incorrect part of fix for bug#110
bug86-fix.patch (8.1 KB ) - added by Olly Betts 18 years ago.
Fix for this bug (I hope!)
matcher-min-weight.2.patch (922 bytes ) - added by Olly Betts 18 years ago.
Patch to fix incorrect part of fix for bug#110 (v2)
multimatch-abort.patch (425 bytes ) - added by Olly Betts 18 years ago.
Patch to abort if we don't find the item we're replacing

Download all attachments as: .zip

Change History (34)

by Mike Trinkala, 18 years ago

Attachment: debug.tgz added

Apache error_log containing the Xapian debug output

comment:1 by Olly Betts, 18 years ago

Status: newassigned

The debug log narrows things down - it looks like the problem is when C++ tries to release the internal memory buffer associated with a string. But I can't see why it would try to do that twice for the same string.

Can you try running this under valgrind?

by Mike Trinkala, 18 years ago

Attachment: valgrind.txt added

valgrind output

comment:2 by Mike Trinkala, 18 years ago

I also tried saving the entire match set from a failed query to another DB before the collapse and sort and then ran the collapse and sort version of the query against that DB. Unfortunately that worked fine. Since I cannot send you a small test case here is what I get stepping through in gdb.

GNU gdb Red Hat Linux (6.3.0.0-1.122rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".

(gdb) break multimatch.cc:521 No source file named multimatch.cc. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (multimatch.cc:521) pending. (gdb) run Starting program: /var/www/html/find.cgi Reading symbols from shared object read from target memory...done. Loaded system supplied DSO at 0x92f000 Breakpoint 2 at 0xc06d0c: file multimatch.cc, line 521. Pending breakpoint "multimatch.cc:521" resolved omega - omega 0.9.6 (compiled Jul 28 2006 20:28:36) Enter NAME=VALUE lines, end with blank line P=day:20060801 audio versioning -site:listme -site:bugme mime:text/html SORT=0

Breakpoint 2, MultiMatch::get_mset (this=0xbfe32cc8, first=0, maxitems=10, check_at_least=11, mset=@0xbfe333b0, mdecider=0x0) at multimatch.cc:522 522 Xapian::docid olddid = old_item.did; Current language: auto; currently c++ (gdb) s 650 normal_iterator(const _Iterator& i) : _M_current(i) { } (gdb) 522 Xapian::docid olddid = old_item.did; (gdb) 526 for (i = items.begin(); i->did != olddid; ++i) { (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 540 oldkey->second.first.wt); (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 85 : first(a), second(b) { } (gdb) 486 { return this->assign(str); } (gdb) 85 : first(a), second(b) { } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) * glibc detected * /var/www/html/find.cgi: double free or corruption (fasttop): 0x0840ffa0 * ======= Backtrace: ========= /lib/libc.so.6[0x9b0f18] /lib/libc.so.6(libc_free+0x78)[0x9b43ef] /usr/lib/libstdc++.so.6(_ZdlPv+0x21)[0x1c66c1] /usr/lib/libstdc++.so.6(_ZNSs4_Rep10_M_destroyERKSaIcE+0x1d)[0x1a238d] /usr/lib/libstdc++.so.6(_ZNSs6assignERKSs+0xa6)[0x1a47c6] /usr/local/lib/libxapian.so.11(_ZN10MultiMatch8get_msetEjjjRN6Xapian4MSetEPKNS0_12MatchDeciderE+0x1aa0)[0xc06ea0] /usr/local/lib/libxapian.so.11(_ZNK6Xapian7Enquire8Internal8get_msetEjjjPKNS_4RSetEPKNS_12MatchDeciderE+0x17c)[0xc24acc] /usr/local/lib/libxapian.so.11(_ZNK6Xapian7Enquire8get_msetEjjjPKNS_4RSetEPKNS_12MatchDeciderE+0x51)[0xc24de1] /var/www/html/find.cgi[0x8059b62] /var/www/html/find.cgi[0x80582f3] /var/www/html/find.cgi[0x80582f3] /var/www/html/find.cgi[0x8064294] /var/www/html/find.cgi[0x806489b] /var/www/html/find.cgi[0x804f319] /lib/libc.so.6(libc_start_main+0xdc)[0x962724] /var/www/html/find.cgi(gxx_personality_v0+0xb5)[0x804d321] ======= Memory map: ======== 00111000-001f3000 r-xp 00000000 fd:00 39430295 /usr/lib/libstdc++.so.6.0.8 001f3000-001f7000 r-xp 000e1000 fd:00 39430295 /usr/lib/libstdc++.so.6.0.8 001f7000-001f8000 rwxp 000e5000 fd:00 39430295 /usr/lib/libstdc++.so.6.0.8 001f8000-001fe000 rwxp 001f8000 00:00 0 001fe000-00209000 r-xp 00000000 fd:00 20317033 /lib/libgcc_s-4.1.1-20060525.so.1 00209000-0020a000 rwxp 0000a000 fd:00 20317033 /lib/libgcc_s-4.1.1-20060525.so.1 0092f000-00930000 r-xp 0092f000 00:00 0 [vdso] 00930000-00949000 r-xp 00000000 fd:00 20316215 /lib/ld-2.4.so 00949000-0094a000 r-xp 00018000 fd:00 20316215 /lib/ld-2.4.so 0094a000-0094b000 rwxp 00019000 fd:00 20316215 /lib/ld-2.4.so 0094d000-00a7a000 r-xp 00000000 fd:00 20316293 /lib/libc-2.4.so 00a7a000-00a7c000 r-xp 0012d000 fd:00 20316293 /lib/libc-2.4.so 00a7c000-00a7d000 rwxp 0012f000 fd:00 20316293 /lib/libc-2.4.so 00a7d000-00a80000 rwxp 00a7d000 00:00 0 00a82000-00aa5000 r-xp 00000000 fd:00 20317028 /lib/libm-2.4.so 00aa5000-00aa6000 r-xp 00022000 fd:00 20317028 /lib/libm-2.4.so 00aa6000-00aa7000 rwxp 00023000 fd:00 20317028 /lib/libm-2.4.so 00b40000-00c9e000 r-xp 00000000 fd:00 39423125 /usr/local/lib/libxapian.so.11.1.0 00c9e000-00cac000 rwxp 0015e000 fd:00 39423125 /usr/local/lib/libxapian.so.11.1.0 08048000-08074000 r-xp 00000000 fd:00 42795102 /var/www/html/find.cgi 08074000-08076000 rw-p 0002b000 fd:00 42795102 /var/www/html/find.cgi 083ab000-08483000 rw-p 083ab000 00:00 0 [heap] b7e00000-b7e21000 rw-p b7e00000 00:00 0 b7e21000-b7f00000 ---p b7e21000 00:00 0 b7f32000-b7fc4000 rw-p b7f32000 00:00 0 b7fc9000-b7fcc000 rw-p b7fc9000 00:00 0 bfe20000-bfe35000 rw-p bfe20000 00:00 0 [stack]

Program received signal SIGABRT, Aborted. 0x0092f402 in kernel_vsyscall () (gdb)

comment:3 by Olly Betts, 18 years ago

The valgrind output shows roughly what's going on, but don't show the line number from multimatch.cc where this is happening (because of inlining I think), and I'm struggling to see exactly where this is happening.

If you could recompile multimatch.cc without optimisation and rerun with valgrind that would be very helpful. You should be able to recompile just the one file like so:

cd matcher rm multimatch.lo multimatch.o make CXXFLAGS=-g cd .. make make install

by Mike Trinkala, 18 years ago

Attachment: valgrind.2.txt added

Second valgrind run (recompiled multimatch)

comment:4 by Olly Betts, 18 years ago

The gdb output actually shows the problem at line 540 of multimatch.cc I think, but that's just where we discover the problem, not where it originated... I was hoping the valgrind output would show that, which it seems to, but I still don't understand exactly what's happening here. It seems that the internal representation of a std::string gets deleted and then we access it. I think this must be due to reference counting, since it isn't the same string object that gets deleted as that which gets accessed (they're members of different MSetItem objects).

Perhaps it would be simplest if you could send the database and anything else needed (the OmegaScript query template at least) and then I can prod things from the debugger. The database should be pretty small with just 265 documents.

comment:5 by Olly Betts, 18 years ago

Hi Mike - currently this bug is in limbo. Is it possible to get a copy of the database which demonstrates this?

I suspect this won't be hard to fix if I could only manage to reproduce it!

comment:6 by Mike Trinkala, 18 years ago

Anytime I try to create a smaller DB from the uncollapsed/unsorted result set I cannot reproduce the problem and the main DB is several gig with confidential information. It looks like it may be up to me to solve this one, I just haven't had the time. I will try to take another look at it this weekend.

comment:7 by Olly Betts, 18 years ago

Sorry, I'd misread 265 documents as the database size, but that was the MSet size.

If you're using GCC 3.4 or newer, you can turn on debugging iterators, etc in the GNU C++ STL by defining _GLIBCXX_DEBUG when building xapian-core:

./configure CPPFLAGS=-D_GLIBCXX_DEBUG

For documentation of this option, see: http://gcc.gnu.org/onlinedocs/libstdc++/debug.html

Note that all C++ code involved needs to be built with (or without) with option, which means you'll need to rebuilt Omega too.

It's possible that this will catch whatever is happening sooner after it happens

  • it seems to be related to std::string somehow.

by Mike Trinkala, 18 years ago

Attachment: bug86.tgz added

Flint DBs to reproduce the crash

comment:8 by Olly Betts, 18 years ago

Thanks for the test case.

If I bodge the databases so that SVN HEAD can read them (by copying in "iamflint" files from a newer database, which is fine provided there's no positional information), the example query works fine in Omega.

But while that might mean this is fixed, it could just be that this testcase happens to work now.

So I'd like to find out what was happening to see if it was really fixed, so I'll see if it fails for 0.9.10 and if not work back. The bug currently says "0.9.6" for the version - what's the latest version you've tried this on?

comment:9 by Mike Trinkala, 18 years ago

Yes, that was against 0.9.6. I just tried the test data in 0.9.9 (it works). However, I ran my production data against 0.9.9 and I am still seeing the same type of failure. So it doesn't look like the problem is fixed it is just displaced.

comment:10 by Olly Betts, 18 years ago

Version: 0.9.60.9.9

OK, I can reproduce this under x86-64 Linux with unpatched 0.9.6 (from SVN), and also 0.9.7, 0.9.8, and 0.9.9. However, 0.9.10 works for me for your testcase.

It would be interesting to know if 0.9.10 fixes this for your full system...

comment:11 by Olly Betts, 18 years ago

attachments.mimetype: application/octet-streamtext/plain

comment:12 by Olly Betts, 18 years ago

attachments.mimetype: text/plainapplication/octet-stream

comment:13 by Olly Betts, 18 years ago

I've investigated and found what goes wrong - where we look for the entry in the existing MSet, the entry isn't found (which means we're already in an inconsistent state), and we try to update via an iterator which is at the "end" of the vector. And it's downhill from there...

Looking at how this could happen, I believe this is actually the same as bug#110, which was fixed in SVN HEAD (and the fix backported to 0.9.10). Applying the fix for that bug cures the problem for me. Can you try it for your larger testcases:

http://oligarchy.co.uk/xapian-fix-inconsistent-paging-with-sort-by-relevance-then-value.patch

comment:14 by Mike Trinkala, 18 years ago

I take it I have to rebuild my indexes for 9.9.10 or the patch to work (since all queries against the old indexes crash). I didn't see anything in the release notes though.

comment:15 by Olly Betts, 18 years ago

0.9.10 should be library ABI and file format compatible with 0.9.9.

The library ABI last changed between xapian-core 0.9.6 and 0.9.7. Flint format last changed between xapian 0.9.1 and 0.9.2.

Current SVN HEAD is flint format (and ABI) incompatible with 0.9.x, but you should get an error message for the flint format at least...

Certainly just applying the patch shouldn't be a problem as it only touches the matcher internals.

comment:16 by Mike Trinkala, 18 years ago

Yeah it is definitely the changes in multimatch.cc. This is a patched 9.9.9 on Solaris 10. I will try to come up with something more useful but here is what I am seeing.

Program received signal SIGSEGV, Segmentation fault. 0xfed17a3d in MultiMatch::get_mset (this=0x8046c60, first=0, maxitems=8, check_at_least=9, mset=@0x8046dc0, mdecider=0x0)

at stl_iterator.h:618

618 { return _M_current; } (gdb) bt #0 0xfed17a3d in MultiMatch::get_mset (this=0x8046c60, first=0, maxitems=8, check_at_least=9, mset=@0x8046dc0, mdecider=0x0)

at stl_iterator.h:618

#1 0xfed2d8fc in Xapian::Enquire::Internal::get_mset (this=0x8096ab0, first=0, maxitems=8, check_at_least=9, rset=0x8046c10,

mdecider=0x0) at omenquire.cc:687

#2 0xfed2dd06 in Xapian::Enquire::get_mset (this=0x8046de0, first=0, maxitems=8, check_at_least=9, rset=0x0, mdecider=0x0)

at base.h:152

#3 0x08057cc5 in execute_query (lls_conf=@0x8047350, params=@0x80473a0, data=@0x8047330, ql=@0x8047210) at find.cc:303 #4 0x080595a3 in main (argc=1, argv=0x80474d0) at ext/new_allocator.h:62

comment:17 by Mike Trinkala, 18 years ago

Yes

Program received signal SIGABRT, Aborted. 0xfe9509c7 in _lwp_kill () from /lib/libc.so.1 (gdb) bt #0 0xfe9509c7 in _lwp_kill () from /lib/libc.so.1 #1 0xfe94e18a in thr_kill () from /lib/libc.so.1 #2 0xfe8fcf93 in raise () from /lib/libc.so.1 #3 0xfe8e0a29 in abort () from /lib/libc.so.1 #4 0xfed17da2 in MultiMatch::get_mset (this=0x8046c90, first=0, maxitems=8, check_at_least=9, mset=@0x8046df0, mdecider=0x0)

at multimatch.cc:519

#5 0xfed2d9dc in Xapian::Enquire::Internal::get_mset (this=0x8096aa8, first=0, maxitems=8, check_at_least=9, rset=0x8046c40,

mdecider=0x0) at omenquire.cc:687

#6 0xfed2dde6 in Xapian::Enquire::get_mset (this=0x8046e10, first=0, maxitems=8, check_at_least=9, rset=0x0, mdecider=0x0)

at base.h:152

#7 0x08057cc5 in execute_query (lls_conf=@0x8047380, params=@0x80473d0, data=@0x8047360, ql=@0x8047240) at find.cc:303 #8 0x080595a3 in main (argc=1, argv=0x80474f8) at ext/new_allocator.h:62

by Olly Betts, 18 years ago

Attachment: matcher-min-weight.patch added

Patch to fix incorrect part of fix for bug#110

comment:18 by Olly Betts, 18 years ago

attachments.isobsolete: 01

comment:19 by Mike Trinkala, 18 years ago

The patch didn't apply cleanly to 0.9.10 or a prepatched 0.9.9. I manually applied it to a fresh 0.9.10. Now it is back to seg faulting on my production data with the initial problem query. I put the abort back in to confirm it is the same place and it is.

Program received signal SIGABRT, Aborted. 0xfe9509c7 in _lwp_kill () from /lib/libc.so.1 (gdb) bt #0 0xfe9509c7 in _lwp_kill () from /lib/libc.so.1 #1 0xfe94e18a in thr_kill () from /lib/libc.so.1 #2 0xfe8fcf93 in raise () from /lib/libc.so.1 #3 0xfe8e0a29 in abort () from /lib/libc.so.1 #4 0xfed183e3 in MultiMatch::get_mset (this=0x8046c10, first=117, maxitems=13, check_at_least=14, mset=@0x8046d70, mdecider=0x0)

at multimatch.cc:515

#5 0xfed2dfcc in Xapian::Enquire::Internal::get_mset (this=0x8083070, first=117, maxitems=13, check_at_least=14, rset=0x8046bc0,

mdecider=0x0) at omenquire.cc:687

#6 0xfed2e3d6 in Xapian::Enquire::get_mset (this=0x8046d90, first=117, maxitems=13, check_at_least=14, rset=0x0, mdecider=0x0)

at base.h:152

#7 0x08057cc5 in execute_query (lls_conf=@0x8047300, params=@0x8047350, data=@0x80472e0, ql=@0x80471c0) at find.cc:303 #8 0x080595a3 in main (argc=1, argv=0x804747c) at ext/new_allocator.h:62

comment:20 by Richard Boulton, 18 years ago

Blocking: 118 added

Not clear to me from the comments whether this affects HEAD or just the 0.9 branch. Please remove dependency (that I've just added) on bug #118 if it doesn't affect HEAD.

comment:21 by Olly Betts, 18 years ago

Both are affected, and this definitely wants fixing for 1.0. I believe I now understand what's going on here - I just need to put together a clean and coherent fix and get Mike to test it.

comment:22 by Richard Boulton, 18 years ago

Cc: richard@… added

by Olly Betts, 18 years ago

Attachment: multimatch-abort.patch added

Patch to fix incorrect part of fix for bug#110 (v2)

comment:23 by Olly Betts, 18 years ago

attachments.isobsolete: 01

comment:24 by Olly Betts, 18 years ago

attachments.isobsolete: 01

comment:25 by Mike Trinkala, 18 years ago

Resolution: fixed
Status: assignedclosed

It is looking good. I tested some problem queries before and after the patch, all are now working correctly.

Thanks, Trink

comment:26 by Olly Betts, 18 years ago

Resolution: fixedverified

Excellent!

comment:27 by Olly Betts, 18 years ago

Operating System: Linux
Resolution: verifiedreleased

Fixed in 1.0.0 release.

comment:29 by Richard Boulton, 17 years ago

Description: modified (diff)
Milestone: 1.0.0
Note: See TracTickets for help on using tickets.