#86 closed defect (released)
crash in multimatch.cc when collapsing and sorting a particular mset
Reported by: | Mike Trinkala | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.0.0 |
Component: | Library API | Version: | 0.9.9 |
Severity: | normal | Keywords: | |
Cc: | Richard Boulton | Blocked By: | |
Blocking: | #118 | Operating System: | Linux |
Description (last modified by )
* glibc detected * /var/www/html/index.cgi: double free or corruption (fasttop): 0x08ef1cc0 *
If I disable sorting OR collapsing the query works fine.
The sort value (0) is a zero padded string representation of the unix time stamp. The collapse value (1) is the md5 digest of the document (the 16 byte binary representation is stored in the value string).
I was able the reduce the match set to 265 documents but have not been able to track down the cause of the problem. If necessary I can extract that document set and send you a small flint db. Sorry I cannot provide more details. I will attach the full debug log, let me know what else you need.
Additional Info: OS: Fedora Core 5 DB: Flint Version: 0.9.6 with the multimatch "Faster sort by value" patch applied
Attachments (8)
Change History (34)
by , 18 years ago
comment:1 by , 18 years ago
Status: | new → assigned |
---|
The debug log narrows things down - it looks like the problem is when C++ tries to release the internal memory buffer associated with a string. But I can't see why it would try to do that twice for the same string.
Can you try running this under valgrind?
comment:2 by , 18 years ago
I also tried saving the entire match set from a failed query to another DB before the collapse and sort and then ran the collapse and sort version of the query against that DB. Unfortunately that worked fine. Since I cannot send you a small test case here is what I get stepping through in gdb.
GNU gdb Red Hat Linux (6.3.0.0-1.122rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) break multimatch.cc:521 No source file named multimatch.cc. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (multimatch.cc:521) pending. (gdb) run Starting program: /var/www/html/find.cgi Reading symbols from shared object read from target memory...done. Loaded system supplied DSO at 0x92f000 Breakpoint 2 at 0xc06d0c: file multimatch.cc, line 521. Pending breakpoint "multimatch.cc:521" resolved omega - omega 0.9.6 (compiled Jul 28 2006 20:28:36) Enter NAME=VALUE lines, end with blank line P=day:20060801 audio versioning -site:listme -site:bugme mime:text/html SORT=0
Breakpoint 2, MultiMatch::get_mset (this=0xbfe32cc8, first=0, maxitems=10, check_at_least=11, mset=@0xbfe333b0, mdecider=0x0) at multimatch.cc:522 522 Xapian::docid olddid = old_item.did; Current language: auto; currently c++ (gdb) s 650 normal_iterator(const _Iterator& i) : _M_current(i) { } (gdb) 522 Xapian::docid olddid = old_item.did; (gdb) 526 for (i = items.begin(); i->did != olddid; ++i) { (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 540 oldkey->second.first.wt); (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 85 : first(a), second(b) { } (gdb) 486 { return this->assign(str); } (gdb) 85 : first(a), second(b) { } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) 70 class MSetItem { (gdb) 486 { return this->assign(str); } (gdb) * glibc detected * /var/www/html/find.cgi: double free or corruption (fasttop): 0x0840ffa0 * ======= Backtrace: ========= /lib/libc.so.6[0x9b0f18] /lib/libc.so.6(libc_free+0x78)[0x9b43ef] /usr/lib/libstdc++.so.6(_ZdlPv+0x21)[0x1c66c1] /usr/lib/libstdc++.so.6(_ZNSs4_Rep10_M_destroyERKSaIcE+0x1d)[0x1a238d] /usr/lib/libstdc++.so.6(_ZNSs6assignERKSs+0xa6)[0x1a47c6] /usr/local/lib/libxapian.so.11(_ZN10MultiMatch8get_msetEjjjRN6Xapian4MSetEPKNS0_12MatchDeciderE+0x1aa0)[0xc06ea0] /usr/local/lib/libxapian.so.11(_ZNK6Xapian7Enquire8Internal8get_msetEjjjPKNS_4RSetEPKNS_12MatchDeciderE+0x17c)[0xc24acc] /usr/local/lib/libxapian.so.11(_ZNK6Xapian7Enquire8get_msetEjjjPKNS_4RSetEPKNS_12MatchDeciderE+0x51)[0xc24de1] /var/www/html/find.cgi[0x8059b62] /var/www/html/find.cgi[0x80582f3] /var/www/html/find.cgi[0x80582f3] /var/www/html/find.cgi[0x8064294] /var/www/html/find.cgi[0x806489b] /var/www/html/find.cgi[0x804f319] /lib/libc.so.6(libc_start_main+0xdc)[0x962724] /var/www/html/find.cgi(gxx_personality_v0+0xb5)[0x804d321] ======= Memory map: ======== 00111000-001f3000 r-xp 00000000 fd:00 39430295 /usr/lib/libstdc++.so.6.0.8 001f3000-001f7000 r-xp 000e1000 fd:00 39430295 /usr/lib/libstdc++.so.6.0.8 001f7000-001f8000 rwxp 000e5000 fd:00 39430295 /usr/lib/libstdc++.so.6.0.8 001f8000-001fe000 rwxp 001f8000 00:00 0 001fe000-00209000 r-xp 00000000 fd:00 20317033 /lib/libgcc_s-4.1.1-20060525.so.1 00209000-0020a000 rwxp 0000a000 fd:00 20317033 /lib/libgcc_s-4.1.1-20060525.so.1 0092f000-00930000 r-xp 0092f000 00:00 0 [vdso] 00930000-00949000 r-xp 00000000 fd:00 20316215 /lib/ld-2.4.so 00949000-0094a000 r-xp 00018000 fd:00 20316215 /lib/ld-2.4.so 0094a000-0094b000 rwxp 00019000 fd:00 20316215 /lib/ld-2.4.so 0094d000-00a7a000 r-xp 00000000 fd:00 20316293 /lib/libc-2.4.so 00a7a000-00a7c000 r-xp 0012d000 fd:00 20316293 /lib/libc-2.4.so 00a7c000-00a7d000 rwxp 0012f000 fd:00 20316293 /lib/libc-2.4.so 00a7d000-00a80000 rwxp 00a7d000 00:00 0 00a82000-00aa5000 r-xp 00000000 fd:00 20317028 /lib/libm-2.4.so 00aa5000-00aa6000 r-xp 00022000 fd:00 20317028 /lib/libm-2.4.so 00aa6000-00aa7000 rwxp 00023000 fd:00 20317028 /lib/libm-2.4.so 00b40000-00c9e000 r-xp 00000000 fd:00 39423125 /usr/local/lib/libxapian.so.11.1.0 00c9e000-00cac000 rwxp 0015e000 fd:00 39423125 /usr/local/lib/libxapian.so.11.1.0 08048000-08074000 r-xp 00000000 fd:00 42795102 /var/www/html/find.cgi 08074000-08076000 rw-p 0002b000 fd:00 42795102 /var/www/html/find.cgi 083ab000-08483000 rw-p 083ab000 00:00 0 [heap] b7e00000-b7e21000 rw-p b7e00000 00:00 0 b7e21000-b7f00000 ---p b7e21000 00:00 0 b7f32000-b7fc4000 rw-p b7f32000 00:00 0 b7fc9000-b7fcc000 rw-p b7fc9000 00:00 0 bfe20000-bfe35000 rw-p bfe20000 00:00 0 [stack]
Program received signal SIGABRT, Aborted. 0x0092f402 in kernel_vsyscall () (gdb)
comment:3 by , 18 years ago
The valgrind output shows roughly what's going on, but don't show the line number from multimatch.cc where this is happening (because of inlining I think), and I'm struggling to see exactly where this is happening.
If you could recompile multimatch.cc without optimisation and rerun with valgrind that would be very helpful. You should be able to recompile just the one file like so:
cd matcher rm multimatch.lo multimatch.o make CXXFLAGS=-g cd .. make make install
comment:4 by , 18 years ago
The gdb output actually shows the problem at line 540 of multimatch.cc I think, but that's just where we discover the problem, not where it originated... I was hoping the valgrind output would show that, which it seems to, but I still don't understand exactly what's happening here. It seems that the internal representation of a std::string gets deleted and then we access it. I think this must be due to reference counting, since it isn't the same string object that gets deleted as that which gets accessed (they're members of different MSetItem objects).
Perhaps it would be simplest if you could send the database and anything else needed (the OmegaScript query template at least) and then I can prod things from the debugger. The database should be pretty small with just 265 documents.
comment:5 by , 18 years ago
Hi Mike - currently this bug is in limbo. Is it possible to get a copy of the database which demonstrates this?
I suspect this won't be hard to fix if I could only manage to reproduce it!
comment:6 by , 18 years ago
Anytime I try to create a smaller DB from the uncollapsed/unsorted result set I cannot reproduce the problem and the main DB is several gig with confidential information. It looks like it may be up to me to solve this one, I just haven't had the time. I will try to take another look at it this weekend.
comment:7 by , 18 years ago
Sorry, I'd misread 265 documents as the database size, but that was the MSet size.
If you're using GCC 3.4 or newer, you can turn on debugging iterators, etc in the GNU C++ STL by defining _GLIBCXX_DEBUG when building xapian-core:
./configure CPPFLAGS=-D_GLIBCXX_DEBUG
For documentation of this option, see: http://gcc.gnu.org/onlinedocs/libstdc++/debug.html
Note that all C++ code involved needs to be built with (or without) with option, which means you'll need to rebuilt Omega too.
It's possible that this will catch whatever is happening sooner after it happens
- it seems to be related to std::string somehow.
comment:8 by , 18 years ago
Thanks for the test case.
If I bodge the databases so that SVN HEAD can read them (by copying in "iamflint" files from a newer database, which is fine provided there's no positional information), the example query works fine in Omega.
But while that might mean this is fixed, it could just be that this testcase happens to work now.
So I'd like to find out what was happening to see if it was really fixed, so I'll see if it fails for 0.9.10 and if not work back. The bug currently says "0.9.6" for the version - what's the latest version you've tried this on?
comment:9 by , 18 years ago
Yes, that was against 0.9.6. I just tried the test data in 0.9.9 (it works). However, I ran my production data against 0.9.9 and I am still seeing the same type of failure. So it doesn't look like the problem is fixed it is just displaced.
comment:10 by , 18 years ago
Version: | 0.9.6 → 0.9.9 |
---|
OK, I can reproduce this under x86-64 Linux with unpatched 0.9.6 (from SVN), and also 0.9.7, 0.9.8, and 0.9.9. However, 0.9.10 works for me for your testcase.
It would be interesting to know if 0.9.10 fixes this for your full system...
comment:11 by , 18 years ago
attachments.mimetype: | application/octet-stream → text/plain |
---|
comment:12 by , 18 years ago
attachments.mimetype: | text/plain → application/octet-stream |
---|
comment:13 by , 18 years ago
I've investigated and found what goes wrong - where we look for the entry in the existing MSet, the entry isn't found (which means we're already in an inconsistent state), and we try to update via an iterator which is at the "end" of the vector. And it's downhill from there...
Looking at how this could happen, I believe this is actually the same as bug#110, which was fixed in SVN HEAD (and the fix backported to 0.9.10). Applying the fix for that bug cures the problem for me. Can you try it for your larger testcases:
http://oligarchy.co.uk/xapian-fix-inconsistent-paging-with-sort-by-relevance-then-value.patch
comment:14 by , 18 years ago
I take it I have to rebuild my indexes for 9.9.10 or the patch to work (since all queries against the old indexes crash). I didn't see anything in the release notes though.
comment:15 by , 18 years ago
0.9.10 should be library ABI and file format compatible with 0.9.9.
The library ABI last changed between xapian-core 0.9.6 and 0.9.7. Flint format last changed between xapian 0.9.1 and 0.9.2.
Current SVN HEAD is flint format (and ABI) incompatible with 0.9.x, but you should get an error message for the flint format at least...
Certainly just applying the patch shouldn't be a problem as it only touches the matcher internals.
comment:16 by , 18 years ago
Yeah it is definitely the changes in multimatch.cc. This is a patched 9.9.9 on Solaris 10. I will try to come up with something more useful but here is what I am seeing.
Program received signal SIGSEGV, Segmentation fault. 0xfed17a3d in MultiMatch::get_mset (this=0x8046c60, first=0, maxitems=8, check_at_least=9, mset=@0x8046dc0, mdecider=0x0)
at stl_iterator.h:618
618 { return _M_current; } (gdb) bt #0 0xfed17a3d in MultiMatch::get_mset (this=0x8046c60, first=0, maxitems=8, check_at_least=9, mset=@0x8046dc0, mdecider=0x0)
at stl_iterator.h:618
#1 0xfed2d8fc in Xapian::Enquire::Internal::get_mset (this=0x8096ab0, first=0, maxitems=8, check_at_least=9, rset=0x8046c10,
mdecider=0x0) at omenquire.cc:687
#2 0xfed2dd06 in Xapian::Enquire::get_mset (this=0x8046de0, first=0, maxitems=8, check_at_least=9, rset=0x0, mdecider=0x0)
at base.h:152
#3 0x08057cc5 in execute_query (lls_conf=@0x8047350, params=@0x80473a0, data=@0x8047330, ql=@0x8047210) at find.cc:303 #4 0x080595a3 in main (argc=1, argv=0x80474d0) at ext/new_allocator.h:62
comment:17 by , 18 years ago
Yes
Program received signal SIGABRT, Aborted. 0xfe9509c7 in _lwp_kill () from /lib/libc.so.1 (gdb) bt #0 0xfe9509c7 in _lwp_kill () from /lib/libc.so.1 #1 0xfe94e18a in thr_kill () from /lib/libc.so.1 #2 0xfe8fcf93 in raise () from /lib/libc.so.1 #3 0xfe8e0a29 in abort () from /lib/libc.so.1 #4 0xfed17da2 in MultiMatch::get_mset (this=0x8046c90, first=0, maxitems=8, check_at_least=9, mset=@0x8046df0, mdecider=0x0)
at multimatch.cc:519
#5 0xfed2d9dc in Xapian::Enquire::Internal::get_mset (this=0x8096aa8, first=0, maxitems=8, check_at_least=9, rset=0x8046c40,
mdecider=0x0) at omenquire.cc:687
#6 0xfed2dde6 in Xapian::Enquire::get_mset (this=0x8046e10, first=0, maxitems=8, check_at_least=9, rset=0x0, mdecider=0x0)
at base.h:152
#7 0x08057cc5 in execute_query (lls_conf=@0x8047380, params=@0x80473d0, data=@0x8047360, ql=@0x8047240) at find.cc:303 #8 0x080595a3 in main (argc=1, argv=0x80474f8) at ext/new_allocator.h:62
by , 18 years ago
Attachment: | matcher-min-weight.patch added |
---|
Patch to fix incorrect part of fix for bug#110
comment:18 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:19 by , 18 years ago
The patch didn't apply cleanly to 0.9.10 or a prepatched 0.9.9. I manually applied it to a fresh 0.9.10. Now it is back to seg faulting on my production data with the initial problem query. I put the abort back in to confirm it is the same place and it is.
Program received signal SIGABRT, Aborted. 0xfe9509c7 in _lwp_kill () from /lib/libc.so.1 (gdb) bt #0 0xfe9509c7 in _lwp_kill () from /lib/libc.so.1 #1 0xfe94e18a in thr_kill () from /lib/libc.so.1 #2 0xfe8fcf93 in raise () from /lib/libc.so.1 #3 0xfe8e0a29 in abort () from /lib/libc.so.1 #4 0xfed183e3 in MultiMatch::get_mset (this=0x8046c10, first=117, maxitems=13, check_at_least=14, mset=@0x8046d70, mdecider=0x0)
at multimatch.cc:515
#5 0xfed2dfcc in Xapian::Enquire::Internal::get_mset (this=0x8083070, first=117, maxitems=13, check_at_least=14, rset=0x8046bc0,
mdecider=0x0) at omenquire.cc:687
#6 0xfed2e3d6 in Xapian::Enquire::get_mset (this=0x8046d90, first=117, maxitems=13, check_at_least=14, rset=0x0, mdecider=0x0)
at base.h:152
#7 0x08057cc5 in execute_query (lls_conf=@0x8047300, params=@0x8047350, data=@0x80472e0, ql=@0x80471c0) at find.cc:303 #8 0x080595a3 in main (argc=1, argv=0x804747c) at ext/new_allocator.h:62
comment:20 by , 18 years ago
Blocking: | 118 added |
---|
Not clear to me from the comments whether this affects HEAD or just the 0.9 branch. Please remove dependency (that I've just added) on bug #118 if it doesn't affect HEAD.
comment:21 by , 18 years ago
Both are affected, and this definitely wants fixing for 1.0. I believe I now understand what's going on here - I just need to put together a clean and coherent fix and get Mike to test it.
comment:22 by , 18 years ago
Cc: | added |
---|
by , 18 years ago
Attachment: | multimatch-abort.patch added |
---|
Patch to fix incorrect part of fix for bug#110 (v2)
comment:23 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:24 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:25 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
It is looking good. I tested some problem queries before and after the patch, all are now working correctly.
Thanks, Trink
comment:27 by , 18 years ago
Operating System: | → Linux |
---|---|
Resolution: | verified → released |
Fixed in 1.0.0 release.
comment:29 by , 17 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.0.0 |
Apache error_log containing the Xapian debug output