Opened 16 years ago

Last modified 20 months ago

#302 new enhancement

Option for scriptindex to preserve non-specified fields in existing records (nondestructive update)

Reported by: Rod McFarland Owned by: Olly Betts
Priority: normal Milestone:
Component: Omega Version:
Severity: normal Keywords:
Cc: Blocked By: #429
Blocking: Operating System: All

Description

I don't know if this is feasible, but in my use case this would be very useful. If there were some form of option (keepoldifnull?) in the .script file to indicate that a field should be preserved as is when re-indexing a record with the same unique reference, I wouldn't have to re-parse a large collection of large PDFs, rather, I could just update the title, author, etc. fields per record.

Currently using 1.0.8

Change History (4)

comment:1 by Olly Betts, 20 months ago

I just came across this ticket while trying to tidy up the bug list. I'm not sure why it never got responded to before, but sorry about that.

It's kind of feasible to do, but it wouldn't be all that efficient as things stand.

The problem with only updating some fields is that it would require removing terms from an existing document selectively, and behind the scenes that reads in the existing term list for the document into a structure in memory, then iterates to find all terms which don't have a prefix matching a keepoldifnull field and removes those terms, then adds in the new terms.

To be worthwhile I think this would really need #429 doing first.

comment:2 by Olly Betts, 20 months ago

Blocked By: 429 added

comment:3 by Rod McFarland, 20 months ago

Ha, oh wow... funny that you found this, I'm still working at the same place (but much older) and just lately Xapian has become a front-burner item because of server updates. This isn't related to the ticket, but we're targeting Ubuntu 22, which (out of the box) only has PHP 8. I've done a bit of research and it looks like Xapian is blocked on SWIG supporting PHP 8. That's all out of my depth, but I was wondering if there was any timeline (preferably before April 2023, which is when Ubuntu 18 goes out of support).

For the purposes of this ticket, the project that motivated it is long gone but others might want such a feature. We are still using Xapian for a different project, an email archive (which doesn't have the same churn as the PDF indexer did).

comment:4 by Olly Betts, 20 months ago

I'm actually currently working on that, and have now updated the status in the relevant ticket.

Note: See TracTickets for help on using tickets.