Opened 9 years ago

Closed 9 years ago

#604 closed defect (fixed)

Document::get_data() causes process size to grow

Reported by: Graham Jones Owned by: Olly Betts
Priority: normal Milestone: 1.2.14
Component: Library API Version: 1.2.12
Severity: major Keywords:
Cc: Blocked By:
Blocking: Operating System: Linux

Description

Simple sample code to recreate the problem attached, with output. Switching between calling get_data and not doing so causes the program to leak or not.

I added guppy/heapy memory profile (not shown in the sample) and Python objects are not growing, so I suspect the leak is in the library or the bindings to it.

Attachments (4)

get_data_test.py (1.2 KB ) - added by Graham Jones 9 years ago.
get_data_results.txt (3.2 KB ) - added by Graham Jones 9 years ago.
get_data_test.cc (1.1 KB ) - added by Olly Betts 9 years ago.
C++ testcase
mset-only-cache-fetch.patch (1.1 KB ) - added by Olly Betts 9 years ago.
proposed fix

Download all attachments as: .zip

Change History (13)

by Graham Jones, 9 years ago

Attachment: get_data_test.py added

by Graham Jones, 9 years ago

Attachment: get_data_results.txt added

comment:1 by Olly Betts, 9 years ago

I don't seem to have a totally suitable test database to hand as I just get a single line of output for any of the ones I've tried, but if I modify the code to report every 100 documents I see the memory usage grow a little.

But valgrind seems to show we aren't actually leaking memory so I suspect we're just holding on to more and it'll get released eventually. The apparently linear growth is bad though.

I notice if I remove the .get_data() part, leaving just match.document, then I still see the memory growth, so I guess this is down to the !MSetItem object in the bindings.

Adding Python gc.collect() calls doesn't seem to help.

Richard: Any ideas?

comment:2 by Graham Jones, 9 years ago

But valgrind seems to show we aren't actually leaking memory so I suspect we're just holding on to more and it'll get released eventually. The apparently linear growth is bad though.

We'd always found that Xapian had been fairly leaky for us, but we would regularly recycle our processes to return memory to the OS, so we were living with it. We came to find it more of chronic problem now that one of our key customers is regularly retrieving 200-500 million records from one of our databases (we are storing around 200 billion documents) and iterating through those 200 million grew the process to over 200GB of memory for that operation.

Adding Python gc.collect() calls doesn't seem to help.

For sure, as heapy doesn't think there are any extra Python objects anyway.

comment:3 by Olly Betts, 9 years ago

I rewrote the testcase in C++ and it shows the same behaviour. However, running under valgrind like so:

GLIBCXX_FORCE_NEW=1 valgrind --leak-check=yes ./get_data_test tmp.db

reports:

==12366== All heap blocks were freed -- no leaks are possible

So it's not losing references to this memory. I wonder if it's due to heap fragmentation.

by Olly Betts, 9 years ago

Attachment: get_data_test.cc added

C++ testcase

comment:4 by Olly Betts, 9 years ago

Component: Xapian-bindings (Python)Library API
Owner: changed from Richard Boulton to Olly Betts
Summary: Document.get_data() seems to leak (from Python)Document::get_data() causes process size to grow

BTW, I did the C++ test with Xapian 1.2.12.

comment:5 by Olly Betts, 9 years ago

Milestone: 1.3.1

And it all becomes clear - the MSet is caching the documents we've looked at, thanks to the MSet::fetch() machinery.

While the idea of announcing up front which docs you'll want so a remote database can stream them all is reasonable (though ISTR it didn't deliver big benefits), the usual access pattern doesn't involve revisiting documents, and for a non-remote database it is doesn't make sense to do so.

by Olly Betts, 9 years ago

Attachment: mset-only-cache-fetch.patch added

proposed fix

comment:6 by Olly Betts, 9 years ago

Status: newassigned

The attached patch passes the testsuite and seems to fix the memory usage growth for me - the C++ version (with the logging frequency decreased to every 10000) now gives:

version: 1.3.0
 3400  21512
matches: 36287  read: 10000   3400  21512
read: 20000   3400  21512
read: 30000   3400  21512

Can you try this with your full setup?

comment:7 by Olly Betts, 9 years ago

Any feedback on the patch?

comment:8 by Olly Betts, 9 years ago

Milestone: 1.3.11.2.14

No feedback, but patch fixes the situation as I reproduced it and seems an improvement overall, so applied to trunk r17033.

Marking to consider for 1.2.14.

comment:9 by Olly Betts, 9 years ago

Resolution: fixed
Status: assignedclosed

Backported in r17076.

Note: See TracTickets for help on using tickets.