Opened 8 years ago
Last modified 10 days ago
#771 assigned enhancement
omindex: Handle "directory documents"
| Reported by: | Olly Betts | Owned by: | Olly Betts |
|---|---|---|---|
| Priority: | normal | Milestone: | 2.x |
| Component: | Omega | Version: | |
| Severity: | normal | Keywords: | |
| Cc: | Blocked By: | ||
| Blocking: | Operating System: | All |
Description
Documents from apple iwork (keynote, pages, numbers) can either be a single file or a directory of files. (I think the single file variant is actually a zip file container holding the same files as the directory variant).
As of 1.4.8, omindex can handle the file variant, but currently omindex doesn't consider a directory could be a document.
We'd need a way to indicate that a file-extension-to-mimetype mapping is for a directory, and then check directories whose leafname has an extension against the list of mimetype mappings that are valid for directories. If we get a mimetype, then we should handle that path as a document to index rather than recursing into it.
Change History (9)
comment:1 by , 7 years ago
| Cc: | added |
|---|---|
| Owner: | changed from to |
| Status: | new → assigned |
comment:2 by , 7 years ago
| Cc: | removed |
|---|---|
| Owner: | changed from to |
| Status: | assigned → new |
comment:3 by , 7 years ago
| Owner: | changed from to |
|---|---|
| Status: | new → assigned |
comment:4 by , 3 years ago
| Owner: | changed from to |
|---|---|
| Status: | assigned → new |
comment:5 by , 3 years ago
I have something that basically works.
Not sure what's best we do about the checksum we store to support collapsing duplicates (seems like we'd have to iterate the directory recursively in a sorted order (which is more awkward to do) and checksum across all the files, or something like that). For now I'm going to leave the checksum blank for directory documents I think, which means duplicates won't get collapsed.
Other issues:
- The size we currently store is what
stat()reports for the directory - that's somewhat FS dependent, but tends to vary with the number of entries in the directory which isn't really a useful number for our purposes. If we iterated the contents we could sum the sizes of the files to get a better number.
- The mtime and ctime we store are for the directory, which means that modifications to a directory document may not always be correctly detected. It depends how programs which save them do it - if they always create a new directory with a temporary name and then once saved delete the old one and rename we'll be good. If we iterated the contents we could find the newest mtime and newest ctime from among the files inside. For ctime, we also should include the directory itself if we take the user and group from it (as we currently do, and as seems reasonable).
comment:6 by , 3 years ago
I thought of a trick to avoid needing to sort the directory entries in the hash - compute the hash of each file separately and combine the hashes with an associative operator such as modulo addition which gives a final answer which doesn't depend on the order of processing. If we go this route we should probably consult someone who knows more about cryptography to see if this seems a terrible idea. The security of these hashes isn't a huge concern (or else we'd have stopped using md5 here long ago) but we want to avoid weakening the hash in a way which makes chance collisions more likely.
comment:7 by , 2 years ago
| Milestone: | 1.4.x → 2.0.0 |
|---|
Postponing. Once done this can probably be backported to the current stable series.
comment:9 by , 10 days ago
| Keywords: | GoodFirstBug removed |
|---|---|
| Milestone: | 3.0.0 → 2.x |
| Status: | new → assigned |
I've found my local branch, rebased it onto main, and pushed it to the repo as branch iwork-directory-docs.
Since this is mostly implemented with just the more awkward aspects left to do it's not really "GoodFirstBug" material now, so remove that keyword.
It's also suitable for 2.x so adjust milestone.
Seems the unresolved aspects are:
- Stored size
- Stored mtime and ctime
- Stored hash of contents
- Not mentioned above but I just noticed we support a file with extension
.apxl. The commit adding this says ".apxl is the extension used for the XML files inside .key bundles/directories which hold the text content of the presentation, and by handling them we can index .key directories more usefully. It seems they are also sometimes found by themselves." That suggests it's not useful to handle a directory with the.apxlextension, but also perhaps suggests we should investigate if these are found by themselves, and if they can be opened by themselves (quick tests withloimpresssuggest not.)

No activity for four years, and it'd be good to sort this out so taking on this ticket.
Not sure what's best we do about the checksum we store to support collapsing duplicates (seems like we'd have to iterate the directory recursively in a sorted order and checksum across all the files, or something like that). For now I'm going to leave the checksum blank for directory documents I think, which means duplicates won't get collapsed.