Opened 8 years ago
Closed 11 months ago
#743 closed defect (fixed)
omindex: delay libmagic checks
Reported by: | Olly Betts | Owned by: | Bruno Baruffaldi |
---|---|---|---|
Priority: | normal | Milestone: | 1.4.12 |
Component: | Omega | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Currently omindex's logic is:
- map the extension to a mime type
- if "ignore" or "skip" move on to next file
- check file size (requires
stat()
call, which we have avoided so far if the file system returnsd_type
fromreaddir()
)- if 0 or > max_size then move on to next file
- if extension mapping didn't give a mime type, call libmagic to get a mime type
- if libmagic doesn't recognise the file, move on to next file
- create
Document
object and set up a little - check timestamps from
stat()
and the DB for an existing entry and move on to next file if this has been indexed and hasn't changed - check for failed entry in DB and move on if we already tried and failed (needs file size and last mod from
stat()
)
The ordering here isn't ideal - in particular:
- The probing done by libmagic is potentially fairly expensive since it has to open and read the start of the file, so we should avoid calling libmagic if another cheap check which doesn't need the mime type could reject the file (e.g. possibly timestamps if we can uncouple those checks from the check for the existing DB entry). If we have a mapping checking the mimetype for "ignore" or "skip" is still a cheap early check.
- We create and setup the
Document
object a bit early (though this shouldn't be very expensive).
Change History (12)
comment:1 by , 8 years ago
Description: | modified (diff) |
---|
comment:2 by , 6 years ago
That was backported for 1.4.5 in 33a6a91792548197c27a8ea7323314263d7cc232.
There's still potential for further improvement here though.
comment:4 by , 6 years ago
Owner: | changed from | to
---|
comment:5 by , 6 years ago
Description: | modified (diff) |
---|
comment:6 by , 6 years ago
I would like to try to solve this problem. I'll take a look at it.
Assigned to you.
Further incremental improvements to the situation are definitely useful here (there probably isn't a totally perfect answer anyway).
comment:7 by , 6 years ago
I have some questions.
If the ctime of a file has changed but the mtime is unchanged, we can update the existing Document (and avoid re-extracting the text). However, I am not sure of which is the cheapest(fastest) way to update it.
Also, I don't know how to test the code properly. At the moment, I am using 'make check' but there are others options in the makefile.
To sum up, I see a 'FIXME' coment (FIXME: except permissions. at line 553) and I cannot understand it. It said ' We only store and check the mtime (last modified) - a change to the metadata won't generally cause a previous failure to now work. '
comment:8 by , 6 years ago
If the ctime of a file has changed but the mtime is unchanged, we can update the existing Document (and avoid re-extracting the text). However, I am not sure of which is the cheapest(fastest) way to update it.
We can, but only when use_ctime
is set which it isn't the default so this is of more limited benefit. It's also not what this ticket is actually about - see #632 for that.
Also, I don't know how to test the code properly. At the moment, I am using 'make check' but there are others options in the makefile.
make check
runs all the tests there are.
But unfortunately there aren't any tests for this area of omindex currently. There are unit tests for some functionality it uses, but nothing for the handling of files.
FIXME: except permissions
Consider the case where we failed to index a file because the user the indexer runs as didn't have permission to read it. Then the permissions get fixed, which changes the ctime but not the mtime, but we only check the mtime when deciding to retry so won't retry it. It's really a corner case (and probably falls under #632). We could fix it by just changing to checking the ctime to decide whether to retry, but then we'd retry files in cases where nothing relevant has actually changed.
The whole ctime vs mtime thing would be good to address, but it's tricky to solve without causing a lot of extra files to be reindexed needlessly - probably not something you want to try to tackle as your first bug.
comment:9 by , 6 years ago
make check
runs all the tests there are.
But unfortunately there aren't any tests for this area of omindex currently. There are unit tests for some functionality it uses, but nothing for the handling of files.
I have made some changes in the code and make check
seems to pass all the tests. However, prepare some tests for measuring the time of improvement will take me a while. In the meantime, I am going on with more modifications.
It would be great if you can take a look at the changes (honestly I am not sure modifying the code properly). Is it okay if I make a pull request? Or do you prefer to see it from my brach ?
comment:11 by , 6 years ago
I have just created a pull request (This is my first contribution to Xapian).
comment:12 by , 11 months ago
Milestone: | 1.4.x → 1.4.12 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Bruno's patch in https://github.com/xapian/xapian/pull/227 was merged ~4.5 years ago: d260b7d94c2f3e6e87e713d29038ef03a026b127. I looks like that actually addressed this issue so closing (there are some ctime-vs-mtime-related points above but #632 covers those).
In d32e13545e699e6920e0e5a3ebfe8a94206de32f in git master, caiyulun has moved the libmagic check after the file size checks (description updated to match). That should be backported to 1.4.5 I think.