Opened 6 years ago

Last modified 12 months ago

#771 new enhancement

omindex: Handle "directory documents"

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Omega Version:
Severity: normal Keywords: GoodFirstBug
Cc: Blocked By:
Blocking: Operating System: All

Description

Documents from apple iwork (keynote, pages, numbers) can either be a single file or a directory of files. (I think the single file variant is actually a zip file container holding the same files as the directory variant).

As of 1.4.8, omindex can handle the file variant, but currently omindex doesn't consider a directory could be a document.

We'd need a way to indicate that a file-extension-to-mimetype mapping is for a directory, and then check directories whose leafname has an extension against the list of mimetype mappings that are valid for directories. If we get a mimetype, then we should handle that path as a document to index rather than recursing into it.

Change History (7)

comment:1 by Vaibhav Kansagara, 6 years ago

Cc: vaibhavkansagara249@… added
Owner: changed from Olly Betts to Vaibhav Kansagara
Status: newassigned

comment:2 by Vaibhav Kansagara, 6 years ago

Cc: vaibhavkansagara249@… removed
Owner: changed from Vaibhav Kansagara to Olly Betts
Status: assignednew

comment:3 by tstomar, 6 years ago

Owner: changed from Olly Betts to tstomar
Status: newassigned

comment:4 by Olly Betts, 20 months ago

Owner: changed from tstomar to Olly Betts
Status: assignednew

No activity for four years, and it'd be good to sort this out so taking on this ticket.

Not sure what's best we do about the checksum we store to support collapsing duplicates (seems like we'd have to iterate the directory recursively in a sorted order and checksum across all the files, or something like that). For now I'm going to leave the checksum blank for directory documents I think, which means duplicates won't get collapsed.

comment:5 by Olly Betts, 20 months ago

I have something that basically works.

Not sure what's best we do about the checksum we store to support collapsing duplicates (seems like we'd have to iterate the directory recursively in a sorted order (which is more awkward to do) and checksum across all the files, or something like that). For now I'm going to leave the checksum blank for directory documents I think, which means duplicates won't get collapsed.

Other issues:

  • The size we currently store is what stat() reports for the directory - that's somewhat FS dependent, but tends to vary with the number of entries in the directory which isn't really a useful number for our purposes. If we iterated the contents we could sum the sizes of the files to get a better number.
  • The mtime and ctime we store are for the directory, which means that modifications to a directory document may not always be correctly detected. It depends how programs which save them do it - if they always create a new directory with a temporary name and then once saved delete the old one and rename we'll be good. If we iterated the contents we could find the newest mtime and newest ctime from among the files inside. For ctime, we also should include the directory itself if we take the user and group from it (as we currently do, and as seems reasonable).

comment:6 by Olly Betts, 20 months ago

I thought of a trick to avoid needing to sort the directory entries in the hash - compute the hash of each file separately and combine the hashes with an associative operator such as modulo addition which gives a final answer which doesn't depend on the order of processing. If we go this route we should probably consult someone who knows more about cryptography to see if this seems a terrible idea. The security of these hashes isn't a huge concern (or else we'd have stopped using md5 here long ago) but we want to avoid weakening the hash in a way which makes chance collisions more likely.

comment:7 by Olly Betts, 12 months ago

Milestone: 1.4.x2.0.0

Postponing. Once done this can probably be backported to the current stable series.

Note: See TracTickets for help on using tickets.