Opened 16 years ago
Last modified 20 months ago
#282 assigned enhancement
Assorted enhancements to omindex
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Omega | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
A patch from Reini Urban at AVL which was pasted into the wiki a while back, but a ticket is really a more appropriate way to track it. We should look at folding some of these improvements in, though some others we probably don't want to include, at least in the form in this patch.
I've updated the patch to compile with latest Omega SVN HEAD, dropping parts which Omega now supports anyway, and splitting out some features into separate tickets. I've not run-tested it at all.
The remaining features in this patch are:
- Unpacking "container file types" (e.g. archives like .zip, email folders like .mbox, email messages with attachments) so we can index the sub-parts
- Logging stderr from filters to a file
Attachments (10)
Change History (28)
by , 16 years ago
Attachment: | omindex-assorted-enhancements.patch added |
---|
comment:1 by , 16 years ago
Milestone: | 1.1.0 → 1.1.1 |
---|
Bumping milestone to 1.1.1 as this isn't fit to apply as is.
by , 16 years ago
Attachment: | xapian-omega-1.0.7a-from-ticket-285-and-cleaned-up.patch added |
---|
Patch based on the same (and other?) changes from ticket #285
comment:3 by , 16 years ago
Now attached an updated version from Reini which was in ticket #285. I removed all the changes to generated files, as they just confuse things.
comment:4 by , 15 years ago
Milestone: | 1.1.7 → 1.2.0 |
---|
Outlook support is now ticket #334.
The older patch is indeed subsumed by the newer.
And I've crudely updated the newer patch by hacking out features which omindex now has (either from this patch or a precursor, or independently), changes which revert changes made in SVN while the patch was being worked on, and also features we don't want (notably configure-time probing for filter programs - we want the user to be able to just install a new filter program and use it without having to rebuild Omega).
Bumping the rest to 1.2.x.
by , 15 years ago
Attachment: | xapian-omega-1.0.7a-from-ticket-285-and-cleaned-up-updated.patch added |
---|
Updated version of newer patch
comment:5 by , 14 years ago
Trying to prise the various unrelated changes in this patch apart:
Adding a way to allow FLAG_WILDCARD to be specified in omega is #418.
I've created #512 for changing the default operator for Omega, and #513 for adding tests for omindex, scriptindex, and the omega CGI.
I've created #513 for the issue of adding a testsuite for omindex, etc.
I've split out the textcat support into a separate patch (since those are the bulk of the remaining changes) and created #514 for that issue.
The stripping of lines starting '### ' from unrtf output isn't needed now we get it to generate HTML (comments there use HTML comments).
Reporting Xapian exceptions in scriptindex.cc improved in r15111 (just use get_description() instead of get_msg() as that gives all the info which this patch does).
Indexing the file size as a value is done in trunk r15112 (as sortable_serialise() rather than a raw 32 bit values, as that allows a range search via a NumberVRP and is likely to be smaller for most files).
I've also cleaned up lingering references to stuff previously stripped from the patch, and renamed the added rmdir()
function to rm_rf()
since it removes the contents as well as the directory, so it's surprising to name it the same as the standard function to remove just a directory.
by , 14 years ago
Updated version of patch
comment:6 by , 14 years ago
I'll attach a further updated patch in a moment, which:
- removes more changes which are no longer relevant
- drops partial attempt to add support for U CGI parameter for wildcarded support of URLs - if someone wanted to add this, they'd be better off starting from scratch than the code here
- untangles the diff where a lot of code has been reindented and also modified
- uses a subdirectory of tmpdir instead of having machinery to specify cache_dir (the user can set TMPDIR if /tmp isn't appropriate)
- splits some hunks with disparate changes in, and trim context on other hunks where the wider context has changed but the narrower context should still match
So the different conceptual changes still in the patch are:
- Add -no-undefined to AM_LDFLAGS
- Unpacking "container file types" (e.g. archives like .zip, email folders like .mbox, email messages with attachments) so we can index the sub-parts
- Logging stderr from filters to a file
- The seemingly arbitrary addition of more words all starting with "a" to the stopword list - stopping some of these seems a bit aggressive to me
- Defaulting to adding the size and lastmod time of the dump file in scriptindex. In general, the size of the dump file seems misleading (though if you put one document per dump, less so). The lastmod isn't particular helpful in many cases either
- Some tweaks to installing docs in the .spec file, which I don't know the reasons for
by , 14 years ago
further updated patch
comment:7 by , 14 years ago
Trunk now adds -no-undefined on platforms which need it to dynamically link, as of r15178.
by , 14 years ago
Patch against trunk shortly after 1.2.5
comment:9 by , 14 years ago
Description: | modified (diff) |
---|
Update description for changes in latest patch too (dropped the random extra stopwords and the doc-related changes to the spec file).
Latest patch builds, but functionality untested and probably isn't right.
by , 13 years ago
Attachment: | xapian-omega-trunk-r16058-from-ticket-285-and-cleaned-up-updated-2011-12-06.patch added |
---|
Patch against trunk r16062
comment:10 by , 13 years ago
Description: | modified (diff) |
---|
I've updated the patch to current trunk. Not tested building this time.
I also dropped the code added to scriptindex to add the size and lastmod time of the dump file to every document created from it. I don't see this making sense in most cases. Perhaps if you feed one document per dump file it does. But anyway, I think it's better to be explicit and put this data in the records in the dump file if you want it.
I also dropped the excel2text script as we already have XL handling, and this script doesn't add anything beyond stripping out all numbers, which I can see the motivation for, but isn't consistent with how we handle numbers in other formats, and isn't helpful for users wanting to search for a number.
comment:11 by , 13 years ago
Nice functionality, I especially like the reqursion capability on archives and multipart email messages. However, there seem to be some problems with the handling of cache directories.
I'm testing on Ubuntu 10.04 (64bit). I had the following message:
$ ls -l /home/user/omegatrial/test/ total 1452 -rw-r--r-- 1 user user 1451909 2012-02-21 09:13 2797172.msg
I tried to index it with the following command, but kept getting errors.
$ omindex -M 'msg:application/vnd.ms-outlook' --db /home/user/omegadb/ --url / /home/user/omegatrial/test/ Can't open directory "/home/user/omegatrial/test/2797172.msg/.msg/2797172.msg" (Not a directory) - skipping directory "l/test/2797172.msg/.msg/2797172.msg"
I used the ms-outlook indexer because of the possibility of handling multipart attachments nicely. Even though the message was correctly exploded into the cache dir by mimeexplode, omindex seemed to look for it in a subdirectory of the original dir.
Also, the outlook2text script did not work for me at all. I'm attaching patches which seemed to fix the functionality for me.
by , 13 years ago
Attachment: | fix-outlook2text.patch added |
---|
by , 13 years ago
Attachment: | fix_omindex.patch added |
---|
comment:12 by , 12 years ago
Milestone: | 1.2.x → 1.3.x |
---|
Not suitable for 1.2.x at this point in the development cycle.
comment:13 by , 12 years ago
Note that the patch here isn't really expected to work as is - it's just "all the remaining changes from the original monster patch which haven't been implemented yet". It's been updated for changes in omega, and at most has been tested to compile.
outlook2text is intended to be generated by substituting into outlook2text.in - if you look there you will see:
@MSGCONVERT@ "$1" | @MIMEEXPLODE@ -d "$2"
So adding "cat" as fix-outlook2text.patch does is incorrect - it's supposed to run msgconvert.pl on "$1".
I've folded the changes from the other patch in, fixed it not to hardcode the other extensions into the cache path, and updated for current trunk. This compiles, but has not been tested beyond that.
by , 12 years ago
Attachment: | xapian-omega-trunk-r16879-from-ticket-285-and-cleaned-up-updated-2012-11-13.patch added |
---|
updated patch against trunk
comment:14 by , 12 years ago
I've fixed the incorrect substitution of @MSGCONVERT@ which was being done in Makefile.am, which should address the issue the first patch was aimed at.
comment:15 by , 10 years ago
Just to note that I'm working on adding support for indexing zip files, but using libarchive rather than command line unzip which allows for more efficient handling (in particular, we can avoid decompressing files for which the extension is marked as "ignore" in mime_map, and potentially not have to create temporary files on disk at all for some filetypes).
comment:17 by , 6 years ago
e1b717c75a2460f27d20057b2c8a8561fa05930f adds a new --mime-type-match
option, so the effect of this hunk can now instead be specified via command-line option --mime-type-match='[Mm][Bb][Oo][Xx]:message/rfc822'
:
if (strcasecmp(d.leafname(), "mbox") == 0) { // Special filename. mimetype = "message/rfc822"; goto got_mimetype; }
comment:18 by , 20 months ago
Milestone: | 1.4.x → 2.0.0 |
---|---|
Version: | SVN trunk → git master |
Patch updated for SVN HEAD