Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#637 closed defect (worksforme)

Potential memory leak when assigning MSetItem values

Reported by: Jeff Rand Owned by: Richard Boulton
Priority: normal Milestone:
Component: Xapian-bindings (Python) Version: 1.2.15
Severity: normal Keywords: Memory leak
Cc: Blocked By:
Blocking: Operating System: Linux

Description (last modified by Olly Betts)

I've traced a memory leak to a statement which assigns the values from an MSetItem to a dictionary which is then appended to a list in python. We're running python 2.7.3, xapian-core 1.2.15 and xapian-bindings 1.2.15. I've provided an example which reproduces the behavior below. The example prints the PID and has a few statements waiting for input to make observing the behavior easier.

Run the following code and monitor the PID's memory usage in top or a similar program. I've observed the resident memory for this example go from 18m to 52m after deleting objects and running garbage collection.

I think the MSetItems are preserved in memory and are not being garbage collected correctly, possibly from a lingering reference to the MSet or MSetIterator.

import os                                                             
import simplejson as json                                             
import xapian as x                                                    
import shutil                                                         
import gc                                                             
                                                                      
def make_db(path, num_docs=100000):                                   
    try:                                                              
        shutil.rmtree(path)                                           
    except OSError, e:                                                
        if e.errno != 2:                                              
            raise                                                     
                                                                      
    db = x.WritableDatabase(path, x.DB_CREATE)                        
    for i in xrange(1, num_docs):                                     
        doc = x.Document()                                            
        doc.set_data(json.dumps({ 'id': i, 'enabled': True }))        
        doc.add_term('XTYPA')                                         
        db.add_document(doc)                                          
    return db                                                         
                                                                      
def run_query(db, num_docs=100000):                                   
    e = x.Enquire(db)                                                 
    e.set_query(x.Query('XTYPA'))                                     
    m = e.get_mset(0, num_docs, True, None)                           
                                                                      
    # Store the MSetItem's data, which causes a memory leak            
    data = []                                                         
    for i in m:                                                       
        data.append({ 'data': i.document.get_data(), 'id': i.docid, })
                                                                      
    # Make sure I'm not crazy                                         
    del num_docs, db, i, e, m, data                                   
    gc.collect()                                                      
                                                                      
def main():                                                           
    # print the PID to monitor                                        
    print 'PID to monitor: {}'.format(os.getpid())                    
                                                                      
    db = make_db('/tmp/test.db')                                      
    raw_input("database is done, ready?")                             
                                                                      
    run_query(db, 100000)                                             
    raw_input('done?')                                                
                                                                      
if __name__ == '__main__':                                            
    main()     

Attachments (1)

ticket637.py (3.4 KB ) - added by Olly Betts 10 years ago.
modified test script

Download all attachments as: .zip

Change History (4)

comment:1 by Olly Betts, 10 years ago

Description: modified (diff)

If you ask the python gc module now many objects are allocated, it doesn't increase. The attached slightly modified version of your script shows this (note calling gc.collect() more than once sometimes seems to be necessary to actually collect all objects - not sure why).

On trunk:

$ ./run-python-test ticket637.py
PID to monitor: 4107
database is done, ready?
num objects before =  7519
num objects after =  7519
done?
$

And HEAD of 1.2 branch:

$ PYTHONPATH=. python ticket637.py 
PID to monitor: 972
database is done, ready?
num objects before =  7115
num objects after =  7115
done?

So I don't see how this can be Python hanging on to objects.

I think this is just due to C++'s allocator hanging on to memory. As I said in my reply to the mailing list, this memory should just get reused by later operations (like the next query you run).

by Olly Betts, 10 years ago

Attachment: ticket637.py added

modified test script

comment:2 by Olly Betts, 10 years ago

Resolution: worksforme
Status: newclosed

No further info for 6 weeks, so closing as "worksforme".

If anyone can show evidence that there's actually a leak here (rather than just memory pooling by C++), please reopen.

If you're using GCC >= 3.4, you can export GLIBCXX_FORCE_NEW=1 before running your code to stop it doing this, which might help to determine if this is the cause of what you're seeing.

comment:3 by Olly Betts, 10 years ago

Milestone: 1.2.x
Note: See TracTickets for help on using tickets.