Opened 12 years ago
Closed 12 years ago
#604 closed defect (fixed)
Document::get_data() causes process size to grow
Reported by: | Graham Jones | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.14 |
Component: | Library API | Version: | 1.2.12 |
Severity: | major | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | Linux |
Description
Simple sample code to recreate the problem attached, with output. Switching between calling get_data and not doing so causes the program to leak or not.
I added guppy/heapy memory profile (not shown in the sample) and Python objects are not growing, so I suspect the leak is in the library or the bindings to it.
Attachments (4)
Change History (13)
by , 12 years ago
Attachment: | get_data_test.py added |
---|
by , 12 years ago
Attachment: | get_data_results.txt added |
---|
comment:1 by , 12 years ago
comment:2 by , 12 years ago
But valgrind seems to show we aren't actually leaking memory so I suspect we're just holding on to more and it'll get released eventually. The apparently linear growth is bad though.
We'd always found that Xapian had been fairly leaky for us, but we would regularly recycle our processes to return memory to the OS, so we were living with it. We came to find it more of chronic problem now that one of our key customers is regularly retrieving 200-500 million records from one of our databases (we are storing around 200 billion documents) and iterating through those 200 million grew the process to over 200GB of memory for that operation.
Adding Python gc.collect() calls doesn't seem to help.
For sure, as heapy doesn't think there are any extra Python objects anyway.
comment:3 by , 12 years ago
I rewrote the testcase in C++ and it shows the same behaviour. However, running under valgrind like so:
GLIBCXX_FORCE_NEW=1 valgrind --leak-check=yes ./get_data_test tmp.db
reports:
==12366== All heap blocks were freed -- no leaks are possible
So it's not losing references to this memory. I wonder if it's due to heap fragmentation.
comment:4 by , 12 years ago
Component: | Xapian-bindings (Python) → Library API |
---|---|
Owner: | changed from | to
Summary: | Document.get_data() seems to leak (from Python) → Document::get_data() causes process size to grow |
BTW, I did the C++ test with Xapian 1.2.12.
comment:5 by , 12 years ago
Milestone: | → 1.3.1 |
---|
And it all becomes clear - the MSet is caching the documents we've looked at, thanks to the MSet::fetch()
machinery.
While the idea of announcing up front which docs you'll want so a remote database can stream them all is reasonable (though ISTR it didn't deliver big benefits), the usual access pattern doesn't involve revisiting documents, and for a non-remote database it is doesn't make sense to do so.
comment:6 by , 12 years ago
Status: | new → assigned |
---|
The attached patch passes the testsuite and seems to fix the memory usage growth for me - the C++ version (with the logging frequency decreased to every 10000) now gives:
version: 1.3.0 3400 21512 matches: 36287 read: 10000 3400 21512 read: 20000 3400 21512 read: 30000 3400 21512
Can you try this with your full setup?
comment:8 by , 12 years ago
Milestone: | 1.3.1 → 1.2.14 |
---|
No feedback, but patch fixes the situation as I reproduced it and seems an improvement overall, so applied to trunk r17033.
Marking to consider for 1.2.14.
I don't seem to have a totally suitable test database to hand as I just get a single line of output for any of the ones I've tried, but if I modify the code to report every 100 documents I see the memory usage grow a little.
But valgrind seems to show we aren't actually leaking memory so I suspect we're just holding on to more and it'll get released eventually. The apparently linear growth is bad though.
I notice if I remove the .get_data() part, leaving just match.document, then I still see the memory growth, so I guess this is down to the !MSetItem object in the bindings.
Adding Python gc.collect() calls doesn't seem to help.
Richard: Any ideas?