Opened 16 years ago

Last modified 20 months ago

#282 assigned enhancement

Assorted enhancements to omindex

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Omega Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

A patch from Reini Urban at AVL which was pasted into the wiki a while back, but a ticket is really a more appropriate way to track it. We should look at folding some of these improvements in, though some others we probably don't want to include, at least in the form in this patch.

I've updated the patch to compile with latest Omega SVN HEAD, dropping parts which Omega now supports anyway, and splitting out some features into separate tickets. I've not run-tested it at all.

The remaining features in this patch are:

  • Unpacking "container file types" (e.g. archives like .zip, email folders like .mbox, email messages with attachments) so we can index the sub-parts
  • Logging stderr from filters to a file

Attachments (10)

omindex-assorted-enhancements.patch (14.0 KB ) - added by Olly Betts 16 years ago.
Patch updated for SVN HEAD
xapian-omega-1.0.7a-from-ticket-285-and-cleaned-up.patch (154.5 KB ) - added by Olly Betts 16 years ago.
Patch based on the same (and other?) changes from ticket #285
xapian-omega-1.0.7a-from-ticket-285-and-cleaned-up-updated.patch (133.7 KB ) - added by Olly Betts 15 years ago.
Updated version of newer patch
xapian-omega-1.0.7a-from-ticket-285-and-cleaned-up-updated-2010-10-27.patch (60.9 KB ) - added by Olly Betts 14 years ago.
Updated version of patch
xapian-omega-1.0.7a-from-ticket-285-and-cleaned-up-updated-2010-10-29.patch (48.2 KB ) - added by Olly Betts 14 years ago.
further updated patch
xapian-omega-1.2.5-from-ticket-285-and-cleaned-up-updated-2011-05-13.patch (49.1 KB ) - added by Olly Betts 14 years ago.
Patch against trunk shortly after 1.2.5
xapian-omega-trunk-r16058-from-ticket-285-and-cleaned-up-updated-2011-12-06.patch (47.9 KB ) - added by Olly Betts 13 years ago.
Patch against trunk r16062
fix-outlook2text.patch (717 bytes ) - added by exec 13 years ago.
fix_omindex.patch (1.3 KB ) - added by exec 13 years ago.
xapian-omega-trunk-r16879-from-ticket-285-and-cleaned-up-updated-2012-11-13.patch (46.8 KB ) - added by Olly Betts 12 years ago.
updated patch against trunk

Download all attachments as: .zip

Change History (28)

by Olly Betts, 16 years ago

Patch updated for SVN HEAD

comment:1 by Olly Betts, 16 years ago

Milestone: 1.1.01.1.1

Bumping milestone to 1.1.1 as this isn't fit to apply as is.

comment:2 by Olly Betts, 16 years ago

Milestone: 1.1.11.1.7

Triaging milestone:1.1.1 bugs.

by Olly Betts, 16 years ago

Patch based on the same (and other?) changes from ticket #285

comment:3 by Olly Betts, 16 years ago

Now attached an updated version from Reini which was in ticket #285. I removed all the changes to generated files, as they just confuse things.

comment:4 by Olly Betts, 15 years ago

Milestone: 1.1.71.2.0

Outlook support is now ticket #334.

The older patch is indeed subsumed by the newer.

And I've crudely updated the newer patch by hacking out features which omindex now has (either from this patch or a precursor, or independently), changes which revert changes made in SVN while the patch was being worked on, and also features we don't want (notably configure-time probing for filter programs - we want the user to be able to just install a new filter program and use it without having to rebuild Omega).

Bumping the rest to 1.2.x.

by Olly Betts, 15 years ago

Updated version of newer patch

comment:5 by Olly Betts, 14 years ago

Trying to prise the various unrelated changes in this patch apart:

Adding a way to allow FLAG_WILDCARD to be specified in omega is #418.

I've created #512 for changing the default operator for Omega, and #513 for adding tests for omindex, scriptindex, and the omega CGI.

I've created #513 for the issue of adding a testsuite for omindex, etc.

I've split out the textcat support into a separate patch (since those are the bulk of the remaining changes) and created #514 for that issue.

The stripping of lines starting '### ' from unrtf output isn't needed now we get it to generate HTML (comments there use HTML comments).

Reporting Xapian exceptions in scriptindex.cc improved in r15111 (just use get_description() instead of get_msg() as that gives all the info which this patch does).

Indexing the file size as a value is done in trunk r15112 (as sortable_serialise() rather than a raw 32 bit values, as that allows a range search via a NumberVRP and is likely to be smaller for most files).

I've also cleaned up lingering references to stuff previously stripped from the patch, and renamed the added rmdir() function to rm_rf() since it removes the contents as well as the directory, so it's surprising to name it the same as the standard function to remove just a directory.

by Olly Betts, 14 years ago

Updated version of patch

comment:6 by Olly Betts, 14 years ago

I'll attach a further updated patch in a moment, which:

  • removes more changes which are no longer relevant
  • drops partial attempt to add support for U CGI parameter for wildcarded support of URLs - if someone wanted to add this, they'd be better off starting from scratch than the code here
  • untangles the diff where a lot of code has been reindented and also modified
  • uses a subdirectory of tmpdir instead of having machinery to specify cache_dir (the user can set TMPDIR if /tmp isn't appropriate)
  • splits some hunks with disparate changes in, and trim context on other hunks where the wider context has changed but the narrower context should still match

So the different conceptual changes still in the patch are:

  • Add -no-undefined to AM_LDFLAGS
  • Unpacking "container file types" (e.g. archives like .zip, email folders like .mbox, email messages with attachments) so we can index the sub-parts
  • Logging stderr from filters to a file
  • The seemingly arbitrary addition of more words all starting with "a" to the stopword list - stopping some of these seems a bit aggressive to me
  • Defaulting to adding the size and lastmod time of the dump file in scriptindex. In general, the size of the dump file seems misleading (though if you put one document per dump, less so). The lastmod isn't particular helpful in many cases either
  • Some tweaks to installing docs in the .spec file, which I don't know the reasons for

comment:7 by Olly Betts, 14 years ago

Trunk now adds -no-undefined on platforms which need it to dynamically link, as of r15178.

by Olly Betts, 14 years ago

Patch against trunk shortly after 1.2.5

comment:8 by Olly Betts, 14 years ago

Description: modified (diff)
Status: newassigned

Update description.

comment:9 by Olly Betts, 14 years ago

Description: modified (diff)

Update description for changes in latest patch too (dropped the random extra stopwords and the doc-related changes to the spec file).

Latest patch builds, but functionality untested and probably isn't right.

comment:10 by Olly Betts, 13 years ago

Description: modified (diff)

I've updated the patch to current trunk. Not tested building this time.

I also dropped the code added to scriptindex to add the size and lastmod time of the dump file to every document created from it. I don't see this making sense in most cases. Perhaps if you feed one document per dump file it does. But anyway, I think it's better to be explicit and put this data in the records in the dump file if you want it.

I also dropped the excel2text script as we already have XL handling, and this script doesn't add anything beyond stripping out all numbers, which I can see the motivation for, but isn't consistent with how we handle numbers in other formats, and isn't helpful for users wanting to search for a number.

comment:11 by exec, 13 years ago

Nice functionality, I especially like the reqursion capability on archives and multipart email messages. However, there seem to be some problems with the handling of cache directories.

I'm testing on Ubuntu 10.04 (64bit). I had the following message:

$ ls -l /home/user/omegatrial/test/
total 1452
-rw-r--r-- 1 user user 1451909 2012-02-21 09:13 2797172.msg

I tried to index it with the following command, but kept getting errors.

$ omindex -M 'msg:application/vnd.ms-outlook' --db /home/user/omegadb/ --url / /home/user/omegatrial/test/ 
Can't open directory "/home/user/omegatrial/test/2797172.msg/.msg/2797172.msg" (Not a directory) - skipping directory "l/test/2797172.msg/.msg/2797172.msg"

I used the ms-outlook indexer because of the possibility of handling multipart attachments nicely. Even though the message was correctly exploded into the cache dir by mimeexplode, omindex seemed to look for it in a subdirectory of the original dir.

Also, the outlook2text script did not work for me at all. I'm attaching patches which seemed to fix the functionality for me.

by exec, 13 years ago

Attachment: fix-outlook2text.patch added

by exec, 13 years ago

Attachment: fix_omindex.patch added

comment:12 by Olly Betts, 12 years ago

Milestone: 1.2.x1.3.x

Not suitable for 1.2.x at this point in the development cycle.

comment:13 by Olly Betts, 12 years ago

Note that the patch here isn't really expected to work as is - it's just "all the remaining changes from the original monster patch which haven't been implemented yet". It's been updated for changes in omega, and at most has been tested to compile.

outlook2text is intended to be generated by substituting into outlook2text.in - if you look there you will see:

@MSGCONVERT@ "$1" | @MIMEEXPLODE@ -d "$2"

So adding "cat" as fix-outlook2text.patch does is incorrect - it's supposed to run msgconvert.pl on "$1".

I've folded the changes from the other patch in, fixed it not to hardcode the other extensions into the cache path, and updated for current trunk. This compiles, but has not been tested beyond that.

by Olly Betts, 12 years ago

updated patch against trunk

comment:14 by Olly Betts, 12 years ago

I've fixed the incorrect substitution of @MSGCONVERT@ which was being done in Makefile.am, which should address the issue the first patch was aimed at.

comment:15 by Olly Betts, 10 years ago

Just to note that I'm working on adding support for indexing zip files, but using libarchive rather than command line unzip which allows for more efficient handling (in particular, we can avoid decompressing files for which the extension is marked as "ignore" in mime_map, and potentially not have to create temporary files on disk at all for some filetypes).

comment:16 by Olly Betts, 10 years ago

Milestone: 1.3.x1.4.x

This isn't worth holding up 1.4.0 for.

comment:17 by Olly Betts, 6 years ago

e1b717c75a2460f27d20057b2c8a8561fa05930f adds a new --mime-type-match option, so the effect of this hunk can now instead be specified via command-line option --mime-type-match='[Mm][Bb][Oo][Xx]:message/rfc822':

        if (strcasecmp(d.leafname(), "mbox") == 0) {
            // Special filename.
            mimetype = "message/rfc822";
            goto got_mimetype;
        }

comment:18 by Olly Betts, 20 months ago

Milestone: 1.4.x2.0.0
Version: SVN trunkgit master
Note: See TracTickets for help on using tickets.