Opened 9 years ago

Closed 8 years ago

#679 closed defect (fixed)

Memory and speed issues in wildcard searches

Reported by: Dmitry Karasik Owned by: Olly Betts
Priority: normal Milestone: 1.3.3
Component: QueryParser Version: 1.3.2
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Dmitry Karasik)

Hello,

I have a problem with some searches when wildcarding is involved, which is xapian eating lots of memory and performing slowly. The problem manifests itself when expansion of a query with a trailing * in it returns too many matches, and thus the expanded query contains lots of individual words.

The simple example attached easily eats 1.6G on my machine, and never gives them back. This is a problem in long-running fastcgi processes, that grind the server down when users fire lots of these searches. I wonder if there can be done something about it.

What's more interesting, that most of the time I don't even need all of the expanded words to match, only some 100 first documents. However setting a cap on the number of wildcard expansion doesn't help, xapian casts an exception "Wildcard expands to more than X terms". Surely there's a reason behind this (I just don't know what is it), but probably there could be added a flag to QueryParser that forces capping of the expansion? This would help the speed of the search also.

Please find attached code examples and the output of memory use.

Sincerely, Dmitry Karasik IT System Developer Novozymes A/S Krogshoejvej 36 2880 Bagsvaerd Denmark

PS: tested on the bleeding edge shapshot

Attachments (3)

create (345 bytes ) - added by Dmitry Karasik 9 years ago.
create test database
search (482 bytes ) - added by Dmitry Karasik 9 years ago.
demonstrates the problem
out (375 bytes ) - added by Dmitry Karasik 9 years ago.
output on my machine

Download all attachments as: .zip

Change History (15)

by Dmitry Karasik, 9 years ago

Attachment: create added

create test database

by Dmitry Karasik, 9 years ago

Attachment: search added

demonstrates the problem

by Dmitry Karasik, 9 years ago

Attachment: out added

output on my machine

comment:1 by Dmitry Karasik, 9 years ago

Description: modified (diff)

comment:2 by Dmitry Karasik, 9 years ago

Component: OtherQueryParser

comment:3 by Dmitry Karasik, 9 years ago

Version: 1.3.2

comment:4 by Dmitry Karasik, 9 years ago

Description: modified (diff)

comment:5 by Olly Betts, 9 years ago

In git master (since 1.3.2) there are 3 modes for limiting wildcard expansion:

    enum {
        /** Throw an error if OP_WILDCARD exceeds its expansion limit.
         *
         *  Xapian::WildcardError will be thrown when the query is actually
         *  run.
         */
        WILDCARD_LIMIT_ERROR,
        /** Stop expanding when OP_WILDCARD reaches its expansion limit.
         *
         *  This makes the wildcard expand to only the first N terms (sorted
         *  by byte order).
         */
        WILDCARD_LIMIT_FIRST,
        /** Limit OP_WILDCARD expansion to the most frequent terms.
         *
         *  If OP_WILDCARD would expand to more than its expansion limit, the
         *  most frequent terms are taken.  This approach works well for cases
         *  such as expanding a partial term at the end of a query string which
         *  the user hasn't finished typing yet - as well as being less expense
         *  to evaluate than the full expansion, using only the most frequent
         *  terms tends to give better results too.
         */
        WILDCARD_LIMIT_MOST_FREQUENT
    };

And you can tell QueryParser which mode(s) to use for wildcards and for partial terms.

Using the "glass" backend for the database should make a significant difference to memory usage for cases like this, as glass reference counts cursor blocks, whereas chert just allocates fresh cursor blocks for all 100,000 terms.

I changed create to create the database like so:

my $db = Xapian::WritableDatabase->new( "index", DB_CREATE_OR_OPEN|Xapian::DB_BACKEND_GLASS );

And with current git master I get 111MB used by the search (taking off the memory allocated before we expand the wildcard, that's 713 bytes per term in the expanded wildcard):

  time    vsz (  diff)    rss (  diff) shared (  diff)   code (  diff)   data (  diff)
     0  43544 ( 43544)  11688 ( 11688)   6672 (  6672)      8 (     8)   5200 (  5200) 18
   627  113160 ( 69616)  82316 ( 70628)   7676 (  1004)      8 (     0)  74816 ( 69616) 20
   627  113160 (     0)  82316 (     0)   7676 (     0)      8 (     0)  74816 (     0) 22

The search took ages to run (I didn't time it, but it felt like several minutes). But my development tree is built with all possible assertions enabled, so I'm not sure how long it would take with a normal build anyway.

Last edited 9 years ago by Olly Betts (previous) (diff)

comment:6 by Dmitry Karasik, 9 years ago

Thank you for the quick response!

For some reason I couldn't import WILDCARD_LIMIT_* constants in my perl script, but I managed to re-run it with $qp->set_max_wildcard_expansion(10) and $qp->set_max_expansion(0). Unfortunately we can't use glass yet, it's not stable enough, but I could reproduce your result, it is indeed much less a memory hog, and it wasn't ages to run on a normal build. I hope 1.40 comes soon :)

The problem at hand though didn't go away even with these settings on chert - it's still eating 1.6G and not giving it back.

comment:7 by Olly Betts, 9 years ago

If you're using GCC, you could try telling its STL not to pool memory by setting this environment variable:

export GLIBCXX_FORCE_NEW=1

However, it pools by default for performance reasons, so this might make things slower.

Perhaps we should have a custom allocator for cursor blocks - those used for running a match should be released once that match has completed, so it shouldn't be very complex to track which blocks are used.

The Perl bindings have a hard-coded list of constants to export currently - you should be able to just use the full name though - e.g. Xapian::Query::WILDCARD_LIMIT_FIRST.

comment:8 by Dmitry Karasik, 9 years ago

I tried GLIBCXX_FORCE_NEW=1, that unfortunately didn't change anything.

As for $Xapian::Query:: yes that worked, thank you! But rather strangely: WILDCARD_LIMIT_ERROR is the only setting that doesn't throw an exception, while the other two do when I set $qp->set_max_wildcard_expansion(10). I actually expected otherwise; was that the meaning to be that way?

comment:9 by Dmitry Karasik, 9 years ago

(upd: yes, I used g++ 4.9 for GLIBCXX_FORCE_NEW)

comment:10 by Olly Betts, 9 years ago

I think you're just getting 0 instead of the correct constant values (WILDCARD_LIMIT_ERROR has value 0).

The constants should be wrapped without the $ (which is more conventional in Perl, and for consistency with other constants in the bindings). I've fixed that and added feature tests, and it all seems to work as expected.

comment:11 by Dmitry Karasik, 9 years ago

Yes, I also think so, but this is a much lesser issue and I trust that you have it under control. It's just that that I'm using default build, and it seems that xapian-bindings/perl doesn't use the build path with "perl Makefile.PL && make", that's probably why there are issues.

Just for reference, here's what I've got:

dmka@solsort xapian-bindings/perl> perl -Mstrict -I. -Iblib/arch -le 'use Xapian qw(:standard); print "", $Xapian::VERSION'
1.3.2.0
dmka@solsort xapian-bindings/perl> perl -Mstrict -I. -Iblib/arch -le 'use Xapian qw(:standard); print "", WILDCARD_LIMIT_MOST_FREQUENT'               
Bareword "WILDCARD_LIMIT_MOST_FREQUENT" not allowed while "strict subs" in use at -e line 1.
Execution of -e aborted due to compilation errors.
[1]    17365 exit 255   perl -Mstrict -I. -Iblib/arch -le 
dmka@solsort xapian-bindings/perl> perl -Mstrict -I. -Iblib/arch -le 'use Xapian qw(:standard); print "", Xapian::Query::WILDCARD_LIMIT_MOST_FREQUENT' 
Bareword "Xapian::Query::WILDCARD_LIMIT_MOST_FREQUENT" not allowed while "strict subs" in use at -e line 1.
Execution of -e aborted due to compilation errors.
[1]    17366 exit 255   perl -Mstrict -I. -Iblib/arch -le 
dmka@solsort xapian-bindings/perl> perl -Mstrict -I. -Iblib/arch -le 'use Xapian qw(:standard); print "", $Xapian::Query::WILDCARD_LIMIT_MOST_FREQUENT'
2
Last edited 9 years ago by Dmitry Karasik (previous) (diff)

comment:12 by Olly Betts, 8 years ago

Milestone: 1.3.3
Resolution: fixed
Status: newclosed

I don't think we can realistically improve things further for chert - to do so, we'd have to backport the cursor block referencing counting changes that have gone into glass. That's theoretically possible, since it doesn't change the database format or the library ABI, but it's more sensible to focus efforts on getting 1.4.0 out (and we're making decent progress on that - down to 12 open tickets now, and most of those have a patch in progress).

So I'm going to close this as fixed in 1.3.3, which is the development release which included the new wildcard modes.

Note: See TracTickets for help on using tickets.