Opened 11 years ago

Last modified 5 months ago

#618 assigned enhancement

Omega: Improved indexing of leafname (intelligent split into several words)

Reported by: peterpan Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Omega Version: 1.2.14
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by peterpan)

Reference: http://article.gmane.org/gmane.comp.search.xapian.general/9561

Omega indexes file names. The file name seems to indexed as several words if the name contains space characters or hyphens.

In my NAS share I often separate words in the file name using "-" , "_" or even using a capital letter at the beginning of each word (I guess this is also the case for many other users):

Examples:

"this_is_a_file.txt"

"thisIsAFile.txt"

In those cases, a noticed that omega does not index the individual words, but only the full basename as one single word.

Therefore, omega should index each respective word (i.e. "this" "is" "a" "file") in addition to the full basename (i.e. "this_is_a_file"), in order to ease the search.

Change History (7)

comment:1 by peterpan, 11 years ago

Description: modified (diff)

comment:2 by peterpan, 11 years ago

Type: defectenhancement

comment:3 by Olly Betts, 8 years ago

Milestone: 1.4.1
Status: newassigned

_ (and also &) are handled as of [e66f0f0598a4a54243964fd4a7feca8080066b19] on git master. Marking for 1.4.1.

I've not attempted to handle camel-case yet. It seems some subtlety is needed there - e.g. "README.txt" shouldn't get index as "R E A D M E".

comment:4 by Olly Betts, 8 years ago

Backported to RELEASE/1.4 branch in [50b1129bb024b7995584d820335fa1535f09aa15].

comment:5 by Olly Betts, 8 years ago

Milestone: 1.4.11.4.x

Throwing the rest back into the 1.4.x pot.

comment:6 by Olly Betts, 13 months ago

Milestone: 1.4.x1.5.0

comment:7 by Olly Betts, 5 months ago

Milestone: 1.5.02.0.0

We need an algorithm that handles camel-case suitable, without doing stupid things to other cases.

Perhaps "word-split before an upper case character if it's followed by either a lower case character, or by another upper case character and then a lower case character, so:

  • thisIsAFile -> this Is A File
  • AndThis -> And This
  • README -> README
  • nothandled -> nothandled

This would be reasonable to backport to a stable release series (especially early in the series) so not a blocker.

Note: See TracTickets for help on using tickets.