Opened 13 years ago

Closed 13 years ago

Last modified 13 years ago

#552 closed task (fixed)

omindex extracts wrong extension

Reported by: Ditha Owned by: Olly Betts
Priority: normal Milestone: 1.2.4
Component: Omega Version: 1.2.6
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by James Aylett)

If you try to index with "omindex --follow --preserve-nonduplicates --stemmer=german -M:text/html --db /data/INDEX /data/QUELLE" a directory structure like "/data/.../0/118/blog.laukien.com/software/admen" the indexer thinks ".com/software..." is an extension, if the file to index has no own extension. Everything after the last dot is the extension...

If you change the source of omindex.cc into

const char * dot_ptr = strrchr(d.leafname(), '.');
const char * dot_slash = strrchr(d.leafname(), '/');
 
if (dot_ptr && dot_slash && dot_ptr > dot_slash)

the extension will be interpreted right. ...I think. ;-)

Change History (4)

comment:1 by James Aylett, 13 years ago

Description: modified (diff)

It should probably be slash_ptr not dot_slash. Also, I think the conditional needs to be:

if (dot_ptr && dot_ptr > dot_slash)

since if you're indexing relative, "wibble.html" needs to be interpreted as an extension of ".html".

comment:2 by Olly Betts, 13 years ago

Resolution: fixed
Status: newclosed
Version: 1.2.61.2.4

Thanks for your report, but this bug isn't actually present in 1.2.6 - d.leafname() returns the leafname of the file, so in the situation you describe, d.leafname() will return "admen" and the extension is empty.

Testing (on the tip of browser:branches/1.2, but nothing relevant has changed there since 1.2.6):

mkdir -p 0/118/blog.laukien.com/software
echo testing > 0/118/blog.laukien.com/software/admen
./omindex --follow --preserve-nonduplicates --stemmer=german -M:text/html --db INDEX 0
../../xapian-core/examples/delve INDEX -r1

The output from delve is:

Term List for record #1: D20110628 E I* M201106 Oolly P/ Ttext/html U/118/blog.laukien.com/software/admen Y2011 Zadm Ztesting admen testing

Note there's an "E" term, not "Ecom/software/admen" which there would be if this bug were present.

We did used to get this wrong, but it was fixed last year in r15181, and the fix was released in 1.2.4. Did you perhaps misreport the version you were using?

comment:3 by Olly Betts, 13 years ago

Milestone: 1.2.4
Version: 1.2.41.2.6

Oops, meant to set milestone not version.

comment:4 by Ditha, 13 years ago

I use xapian-omega 1.2.5 - This should be the problem. Thx!

Note: See TracTickets for help on using tickets.