Opened 16 years ago
Last modified 2 years ago
#302 new enhancement
Option for scriptindex to preserve non-specified fields in existing records (nondestructive update)
Reported by: | Rod McFarland | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Omega | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | #429 | |
Blocking: | Operating System: | All |
Description
I don't know if this is feasible, but in my use case this would be very useful. If there were some form of option (keepoldifnull?) in the .script file to indicate that a field should be preserved as is when re-indexing a record with the same unique reference, I wouldn't have to re-parse a large collection of large PDFs, rather, I could just update the title, author, etc. fields per record.
Currently using 1.0.8
Change History (4)
comment:1 by , 2 years ago
comment:2 by , 2 years ago
Blocked By: | 429 added |
---|
comment:3 by , 2 years ago
Ha, oh wow... funny that you found this, I'm still working at the same place (but much older) and just lately Xapian has become a front-burner item because of server updates. This isn't related to the ticket, but we're targeting Ubuntu 22, which (out of the box) only has PHP 8. I've done a bit of research and it looks like Xapian is blocked on SWIG supporting PHP 8. That's all out of my depth, but I was wondering if there was any timeline (preferably before April 2023, which is when Ubuntu 18 goes out of support).
For the purposes of this ticket, the project that motivated it is long gone but others might want such a feature. We are still using Xapian for a different project, an email archive (which doesn't have the same churn as the PDF indexer did).
comment:4 by , 2 years ago
I'm actually currently working on that, and have now updated the status in the relevant ticket.
I just came across this ticket while trying to tidy up the bug list. I'm not sure why it never got responded to before, but sorry about that.
It's kind of feasible to do, but it wouldn't be all that efficient as things stand.
The problem with only updating some fields is that it would require removing terms from an existing document selectively, and behind the scenes that reads in the existing term list for the document into a structure in memory, then iterates to find all terms which don't have a prefix matching a
keepoldifnull
field and removes those terms, then adds in the new terms.To be worthwhile I think this would really need #429 doing first.