Opened 6 years ago
Last modified 12 months ago
#771 new enhancement
omindex: Handle "directory documents"
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Omega | Version: | |
Severity: | normal | Keywords: | GoodFirstBug |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Documents from apple iwork (keynote, pages, numbers) can either be a single file or a directory of files. (I think the single file variant is actually a zip file container holding the same files as the directory variant).
As of 1.4.8, omindex can handle the file variant, but currently omindex doesn't consider a directory could be a document.
We'd need a way to indicate that a file-extension-to-mimetype mapping is for a directory, and then check directories whose leafname has an extension against the list of mimetype mappings that are valid for directories. If we get a mimetype, then we should handle that path as a document to index rather than recursing into it.
Change History (7)
comment:1 by , 6 years ago
Cc: | added |
---|---|
Owner: | changed from | to
Status: | new → assigned |
comment:2 by , 6 years ago
Cc: | removed |
---|---|
Owner: | changed from | to
Status: | assigned → new |
comment:3 by , 6 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 20 months ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
comment:5 by , 20 months ago
I have something that basically works.
Not sure what's best we do about the checksum we store to support collapsing duplicates (seems like we'd have to iterate the directory recursively in a sorted order (which is more awkward to do) and checksum across all the files, or something like that). For now I'm going to leave the checksum blank for directory documents I think, which means duplicates won't get collapsed.
Other issues:
- The size we currently store is what
stat()
reports for the directory - that's somewhat FS dependent, but tends to vary with the number of entries in the directory which isn't really a useful number for our purposes. If we iterated the contents we could sum the sizes of the files to get a better number.
- The mtime and ctime we store are for the directory, which means that modifications to a directory document may not always be correctly detected. It depends how programs which save them do it - if they always create a new directory with a temporary name and then once saved delete the old one and rename we'll be good. If we iterated the contents we could find the newest mtime and newest ctime from among the files inside. For ctime, we also should include the directory itself if we take the user and group from it (as we currently do, and as seems reasonable).
comment:6 by , 20 months ago
I thought of a trick to avoid needing to sort the directory entries in the hash - compute the hash of each file separately and combine the hashes with an associative operator such as modulo addition which gives a final answer which doesn't depend on the order of processing. If we go this route we should probably consult someone who knows more about cryptography to see if this seems a terrible idea. The security of these hashes isn't a huge concern (or else we'd have stopped using md5 here long ago) but we want to avoid weakening the hash in a way which makes chance collisions more likely.
comment:7 by , 12 months ago
Milestone: | 1.4.x → 2.0.0 |
---|
Postponing. Once done this can probably be backported to the current stable series.
No activity for four years, and it'd be good to sort this out so taking on this ticket.
Not sure what's best we do about the checksum we store to support collapsing duplicates (seems like we'd have to iterate the directory recursively in a sorted order and checksum across all the files, or something like that). For now I'm going to leave the checksum blank for directory documents I think, which means duplicates won't get collapsed.