Opened 10 years ago
Closed 9 years ago
#679 closed defect (fixed)
Memory and speed issues in wildcard searches
Reported by: | Dmitry Karasik | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.3 |
Component: | QueryParser | Version: | 1.3.2 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Hello,
I have a problem with some searches when wildcarding is involved, which is xapian eating lots of memory and performing slowly. The problem manifests itself when expansion of a query with a trailing * in it returns too many matches, and thus the expanded query contains lots of individual words.
The simple example attached easily eats 1.6G on my machine, and never gives them back. This is a problem in long-running fastcgi processes, that grind the server down when users fire lots of these searches. I wonder if there can be done something about it.
What's more interesting, that most of the time I don't even need all of the expanded words to match, only some 100 first documents. However setting a cap on the number of wildcard expansion doesn't help, xapian casts an exception "Wildcard expands to more than X terms". Surely there's a reason behind this (I just don't know what is it), but probably there could be added a flag to QueryParser that forces capping of the expansion? This would help the speed of the search also.
Please find attached code examples and the output of memory use.
Sincerely, Dmitry Karasik IT System Developer Novozymes A/S Krogshoejvej 36 2880 Bagsvaerd Denmark
PS: tested on the bleeding edge shapshot
Attachments (3)
Change History (15)
by , 10 years ago
comment:1 by , 10 years ago
Description: | modified (diff) |
---|
comment:2 by , 10 years ago
Component: | Other → QueryParser |
---|
comment:3 by , 10 years ago
Version: | → 1.3.2 |
---|
comment:4 by , 10 years ago
Description: | modified (diff) |
---|
comment:5 by , 10 years ago
In git master (since 1.3.2) there are 3 modes for limiting wildcard expansion:
enum { /** Throw an error if OP_WILDCARD exceeds its expansion limit. * * Xapian::WildcardError will be thrown when the query is actually * run. */ WILDCARD_LIMIT_ERROR, /** Stop expanding when OP_WILDCARD reaches its expansion limit. * * This makes the wildcard expand to only the first N terms (sorted * by byte order). */ WILDCARD_LIMIT_FIRST, /** Limit OP_WILDCARD expansion to the most frequent terms. * * If OP_WILDCARD would expand to more than its expansion limit, the * most frequent terms are taken. This approach works well for cases * such as expanding a partial term at the end of a query string which * the user hasn't finished typing yet - as well as being less expense * to evaluate than the full expansion, using only the most frequent * terms tends to give better results too. */ WILDCARD_LIMIT_MOST_FREQUENT };
And you can tell QueryParser
which mode(s) to use for wildcards and for partial terms.
Using the "glass" backend for the database should make a significant difference to memory usage for cases like this, as glass reference counts cursor blocks, whereas chert just allocates fresh cursor blocks for all 100,000 terms.
I changed create
to create the database like so:
my $db = Xapian::WritableDatabase->new( "index", DB_CREATE_OR_OPEN|Xapian::DB_BACKEND_GLASS );
And with current git master I get 111MB used by the search (taking off the memory allocated before we expand the wildcard, that's 713 bytes per term in the expanded wildcard):
time vsz ( diff) rss ( diff) shared ( diff) code ( diff) data ( diff) 0 43544 ( 43544) 11688 ( 11688) 6672 ( 6672) 8 ( 8) 5200 ( 5200) 18 627 113160 ( 69616) 82316 ( 70628) 7676 ( 1004) 8 ( 0) 74816 ( 69616) 20 627 113160 ( 0) 82316 ( 0) 7676 ( 0) 8 ( 0) 74816 ( 0) 22
The search took ages to run (I didn't time it, but it felt like several minutes). But my development tree is built with all possible assertions enabled, so I'm not sure how long it would take with a normal build anyway.
comment:6 by , 10 years ago
Thank you for the quick response!
For some reason I couldn't import WILDCARD_LIMIT_* constants in my perl script, but I managed to re-run it with $qp->set_max_wildcard_expansion(10) and $qp->set_max_expansion(0). Unfortunately we can't use glass yet, it's not stable enough, but I could reproduce your result, it is indeed much less a memory hog, and it wasn't ages to run on a normal build. I hope 1.40 comes soon :)
The problem at hand though didn't go away even with these settings on chert - it's still eating 1.6G and not giving it back.
comment:7 by , 10 years ago
If you're using GCC, you could try telling its STL not to pool memory by setting this environment variable:
export GLIBCXX_FORCE_NEW=1
However, it pools by default for performance reasons, so this might make things slower.
Perhaps we should have a custom allocator for cursor blocks - those used for running a match should be released once that match has completed, so it shouldn't be very complex to track which blocks are used.
The Perl bindings have a hard-coded list of constants to export currently - you should be able to just use the full name though - e.g. Xapian::Query::WILDCARD_LIMIT_FIRST
.
comment:8 by , 10 years ago
I tried GLIBCXX_FORCE_NEW=1, that unfortunately didn't change anything.
As for $Xapian::Query:: yes that worked, thank you! But rather strangely: WILDCARD_LIMIT_ERROR is the only setting that doesn't throw an exception, while the other two do when I set $qp->set_max_wildcard_expansion(10). I actually expected otherwise; was that the meaning to be that way?
comment:10 by , 10 years ago
I think you're just getting 0 instead of the correct constant values (WILDCARD_LIMIT_ERROR
has value 0).
The constants should be wrapped without the $
(which is more conventional in Perl, and for consistency with other constants in the bindings). I've fixed that and added feature tests, and it all seems to work as expected.
comment:11 by , 10 years ago
Yes, I also think so, but this is a much lesser issue and I trust that you have it under control. It's just that that I'm using default build, and it seems that xapian-bindings/perl doesn't use the build path with "perl Makefile.PL && make", that's probably why there are issues.
Just for reference, here's what I've got:
dmka@solsort xapian-bindings/perl> perl -Mstrict -I. -Iblib/arch -le 'use Xapian qw(:standard); print "", $Xapian::VERSION' 1.3.2.0 dmka@solsort xapian-bindings/perl> perl -Mstrict -I. -Iblib/arch -le 'use Xapian qw(:standard); print "", WILDCARD_LIMIT_MOST_FREQUENT' Bareword "WILDCARD_LIMIT_MOST_FREQUENT" not allowed while "strict subs" in use at -e line 1. Execution of -e aborted due to compilation errors. [1] 17365 exit 255 perl -Mstrict -I. -Iblib/arch -le dmka@solsort xapian-bindings/perl> perl -Mstrict -I. -Iblib/arch -le 'use Xapian qw(:standard); print "", Xapian::Query::WILDCARD_LIMIT_MOST_FREQUENT' Bareword "Xapian::Query::WILDCARD_LIMIT_MOST_FREQUENT" not allowed while "strict subs" in use at -e line 1. Execution of -e aborted due to compilation errors. [1] 17366 exit 255 perl -Mstrict -I. -Iblib/arch -le dmka@solsort xapian-bindings/perl> perl -Mstrict -I. -Iblib/arch -le 'use Xapian qw(:standard); print "", $Xapian::Query::WILDCARD_LIMIT_MOST_FREQUENT' 2
comment:12 by , 9 years ago
Milestone: | → 1.3.3 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
I don't think we can realistically improve things further for chert - to do so, we'd have to backport the cursor block referencing counting changes that have gone into glass. That's theoretically possible, since it doesn't change the database format or the library ABI, but it's more sensible to focus efforts on getting 1.4.0 out (and we're making decent progress on that - down to 12 open tickets now, and most of those have a patch in progress).
So I'm going to close this as fixed in 1.3.3, which is the development release which included the new wildcard modes.
create test database