Opened 17 years ago

Closed 17 years ago

#188 closed defect (worksforme)

omindex throws "unknown exception" when indexing a website download with wget

Reported by: Haidut Owned by: Richard Boulton
Priority: normal Milestone:
Component: Omega Version: 1.0.2
Severity: major Keywords:
Cc: Olly Betts Blocked By:
Blocking: Operating System: NetBSD

Description

I've downloaded a web site with about 26,000 pages offline and I tried running omindex on it. After running without problems and indexing about 4,000 pages omindex crashes and produces the following message:

Indexing "/yyyy.com/yyyy/07jun.html" as text/html ... Caught unknown exception in index_directory, rethrowing Caught unknown exception in index_directory, rethrowing Caught unknown exception

I tried deleting the "problematic" file 07jun.html but then it crashed again on another file while indexing things in the same folder. After deleting several more files in that folder, it is obvious that the problem is not with the files but some other issue. The database size at the time of crash is 94MB and the total size of all downloaded files is 1.8GB, so nothing looks out of the ordinary.

Xapian and Omega were compiled from source using the generic compile options that the "configure" script created. Nothing fancy has been added that may crash the indexer. The system is a NetBSD 3.1 stable release default install. Again nothing fancy was done to the base system to optimize it in any way.

Change History (3)

comment:1 by Richard Boulton, 17 years ago

Status: newassigned

If you re-run the indexer, does it always fail at the same point? Is there any particular file which causes the problem: for example, the 07jun.html file you mention (ie, if you run the indexer on a single directory containing a copy of that file, does it fail?)

Are there any odd settings on the file permissions (eg, no read access)?

I downloaded the file here, and can't reproduce your problem, but as you say, it may not be an issue with an individual file. One step which would give us more information would be to run omindex under gdb, with a break point set to catch "throw"s - this should let us see what the exception is. (You can set such a break point by using the gdb command "catch throw" before running the program under gdb.)

Alternatively, you could try compiling a debugging version of xapian (see the details in xapian-core/HACKING - you want the --enable-log option) and run with XAPIAN_DEBUG_FLAGS set to display the progress of xapian - this will produce a very large amount of debug output, though.

comment:2 by Olly Betts, 17 years ago

Cc: olly@… added

Trying to index 5000 copies of 07jun.html works fine too.

Haidut: You're going to need to provide some way for us to reproduce this, or else do some detective work with gdb as Richard suggests...

comment:3 by Olly Betts, 17 years ago

Operating System: NetBSD
Resolution: worksforme
Status: assignedclosed

It'll be two months tomorrow without a response from the original reporter, so I'm closing this as "WORKSFORME". If anyone is able to reproduce this and can supply more information, please reopen this bug and do so!

Note: See TracTickets for help on using tickets.