Opened 7 years ago

Closed 5 months ago

#743 closed defect (fixed)

omindex: delay libmagic checks

Reported by: Olly Betts Owned by: Bruno Baruffaldi
Priority: normal Milestone: 1.4.12
Component: Omega Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Currently omindex's logic is:

  • map the extension to a mime type
    • if "ignore" or "skip" move on to next file
  • check file size (requires stat() call, which we have avoided so far if the file system returns d_type from readdir())
    • if 0 or > max_size then move on to next file
  • if extension mapping didn't give a mime type, call libmagic to get a mime type
    • if libmagic doesn't recognise the file, move on to next file
  • create Document object and set up a little
  • check timestamps from stat() and the DB for an existing entry and move on to next file if this has been indexed and hasn't changed
  • check for failed entry in DB and move on if we already tried and failed (needs file size and last mod from stat())

The ordering here isn't ideal - in particular:

  • The probing done by libmagic is potentially fairly expensive since it has to open and read the start of the file, so we should avoid calling libmagic if another cheap check which doesn't need the mime type could reject the file (e.g. possibly timestamps if we can uncouple those checks from the check for the existing DB entry). If we have a mapping checking the mimetype for "ignore" or "skip" is still a cheap early check.
  • We create and setup the Document object a bit early (though this shouldn't be very expensive).

Change History (12)

comment:1 by Olly Betts, 7 years ago

Description: modified (diff)

In d32e13545e699e6920e0e5a3ebfe8a94206de32f in git master, caiyulun has moved the libmagic check after the file size checks (description updated to match). That should be backported to 1.4.5 I think.

comment:2 by Olly Betts, 6 years ago

That was backported for 1.4.5 in 33a6a91792548197c27a8ea7323314263d7cc232.

There's still potential for further improvement here though.

comment:3 by Bruno Baruffaldi, 5 years ago

I would like to try to solve this problem. I'll take a look at it.

comment:4 by Olly Betts, 5 years ago

Owner: changed from Olly Betts to Bruno Baruffaldi

comment:5 by Olly Betts, 5 years ago

Description: modified (diff)

comment:6 by Olly Betts, 5 years ago

I would like to try to solve this problem. I'll take a look at it.

Assigned to you.

Further incremental improvements to the situation are definitely useful here (there probably isn't a totally perfect answer anyway).

comment:7 by Bruno Baruffaldi, 5 years ago

I have some questions.

If the ctime of a file has changed but the mtime is unchanged, we can update the existing Document (and avoid re-extracting the text). However, I am not sure of which is the cheapest(fastest) way to update it.

Also, I don't know how to test the code properly. At the moment, I am using 'make check' but there are others options in the makefile.

To sum up, I see a 'FIXME' coment (FIXME: except permissions. at line 553) and I cannot understand it. It said ' We only store and check the mtime (last modified) - a change to the metadata won't generally cause a previous failure to now work. '

Last edited 5 years ago by Bruno Baruffaldi (previous) (diff)

comment:8 by Olly Betts, 5 years ago

If the ctime of a file has changed but the mtime is unchanged, we can update the existing Document (and avoid re-extracting the text). However, I am not sure of which is the cheapest(fastest) way to update it.

We can, but only when use_ctime is set which it isn't the default so this is of more limited benefit. It's also not what this ticket is actually about - see #632 for that.

Also, I don't know how to test the code properly. At the moment, I am using 'make check' but there are others options in the makefile.

make check runs all the tests there are.

But unfortunately there aren't any tests for this area of omindex currently. There are unit tests for some functionality it uses, but nothing for the handling of files.

FIXME: except permissions

Consider the case where we failed to index a file because the user the indexer runs as didn't have permission to read it. Then the permissions get fixed, which changes the ctime but not the mtime, but we only check the mtime when deciding to retry so won't retry it. It's really a corner case (and probably falls under #632). We could fix it by just changing to checking the ctime to decide whether to retry, but then we'd retry files in cases where nothing relevant has actually changed.

The whole ctime vs mtime thing would be good to address, but it's tricky to solve without causing a lot of extra files to be reindexed needlessly - probably not something you want to try to tackle as your first bug.

comment:9 by Bruno Baruffaldi, 5 years ago

make check runs all the tests there are.

But unfortunately there aren't any tests for this area of omindex currently. There are unit tests for some functionality it uses, but nothing for the handling of files.

I have made some changes in the code and make check seems to pass all the tests. However, prepare some tests for measuring the time of improvement will take me a while. In the meantime, I am going on with more modifications.

It would be great if you can take a look at the changes (honestly I am not sure modifying the code properly). Is it okay if I make a pull request? Or do you prefer to see it from my brach ?

Last edited 5 years ago by Bruno Baruffaldi (previous) (diff)

comment:10 by Olly Betts, 5 years ago

A pull request better supports the review process.

comment:11 by Bruno Baruffaldi, 5 years ago

I have just created a pull request (This is my first contribution to Xapian).

Last edited 5 years ago by Bruno Baruffaldi (previous) (diff)

comment:12 by Olly Betts, 5 months ago

Milestone: 1.4.x1.4.12
Resolution: fixed
Status: newclosed

Bruno's patch in https://github.com/xapian/xapian/pull/227 was merged ~4.5 years ago: d260b7d94c2f3e6e87e713d29038ef03a026b127. I looks like that actually addressed this issue so closing (there are some ctime-vs-mtime-related points above but #632 covers those).

Note: See TracTickets for help on using tickets.