{5} Assigned, Active Tickets by Owner (Full Description) (58 matches)

List tickets assigned, group by ticket owner. This report demonstrates the use of full-row display.

olly (6 matches)

Ticket Summary Component Milestone Type Created
Description
#3 Get multierrhandler1 working again Test Suite enhancement 2003-03-27

Redo machinery in InMemory? backend to allow multierrhandler1 to work. Probably leave until user database backends are possible, then do it by subclassing InMemory?...


#51 Nightly snapshots Website enhancement 2004-09-09

Get nightly snapshot builds set up again - essentially, take the current SVN snapshots and "bless" them if they "look good" according to the tinderbox. The problem is that we can't easily tie a row of green lights in tinderbox to a particular snapshot - switching to buildbot will help this I believe.


#53 Xapian::Fields Library API enhancement 2004-09-09

Implement a Xapian::Fields class to serialise/unserialise name=value pairs to/from Document data field.


#62 How to use the Tcl binding so cleanup works Xapian-bindings defect 2005-06-01

The current Tcl binding has problems with cleanup, sometimes the destructor does not get called and other nuisances.

I did some small experiments with the binding and found, that the constructor gets called in some cases and not in others:

This works: xapian::WritableDatabase? xapiandb testdir $::xapian::DB_CREATE_OR_OVERWRITE rename xapiandb ""

This seems it does not: xapian::WritableDatabase? xapiandb testdir $::xapian::DB_CREATE_OR_OVERWRITE set db xapiandb $db -delete

neither does this set db [xapian::WritableDatabase xapiandb testdir $::xapian::DB_CREATE_OR_OVERWRITE] $db -delete

or this: set db [xapian::WritableDatabase xapiandb testdir $::xapian::DB_CREATE_OR_OVERWRITE] rename $db ""

I'm not sure if it is a problem with the SWIG wrapping, but thing there are some subtle problems somewhere in there.

Michael


#40 Alternative approach to tracking free blocks in btrees Backend-Chert enhancement 2004-09-09

Use chains of free blocks rather than a bitmap - then we can store many old revisions more cheaply (just the space they actually need, not a whole bitmap for each one too). Then readers use fcntl locking on a single byte corresponding to the revision they're using (bytes off the end of the file can be locked, and shared locks on read-only files are ok). Then a writer would only delete old revisions for which it could obtain an exclusive lock (otherwise it would preserve them).

The Btree manager is generally written with multiple old revisions in mind, so this shouldn't be a huge project.


#63 Improve visibility annotations for the library Library API enhancement 2005-06-11

See http://gcc.gnu.org/wiki/Visibility

This should make the shared library much smaller, and a little faster!

URL: http://gcc.gnu.org/wiki/Visibility


richard (4 matches)

Ticket Summary Component Milestone Type Created
Description
#48 RangePostList Library API enhancement 2004-09-09

Provide explicit support for range searches, such as "RangePostList?" - combine a sequence of adjacent terms...


#50 SynonymPostList Library API 1.1.0 enhancement 2004-09-09

Add synonym postlists, which represents a set of postlists merged together such that each document that occurs in any of the sublists occurs in the synonym list. The termfrequency should ideally be the number of documents that one or more of the terms occurs in, but that's too expensive to find, so we'll need to estimate. Need to be able to take underlying postlists which aren't necessarily just postlists for single terms too.


#58 Convert from tinderbox to buildbot Buildbot enhancement 2004-11-25

Tinderbox isn't cleanly configurable in the way we want, so I've had to hack it around a lot. Buildbot looks a much better bet, as it's designed to allow modification by subclassing.


#138 Tidy up output of epydoc when processing xapian python bindings Xapian-bindings 1.1.0 enhancement 2007-04-23

Now that the python bindings have documentation strings, I've tried running epydoc on the xapian module to generate an HTML format version of the documentation. This does a pretty good job (good enough that we should include it on the xapian website for each released version). However, there are several issues that could do with being tidied up, so I'll list them here.

1. epydoc seems to consider some of the methods (eg Document.termlist) which are added to the classes by "extra.i" as "private", and therefore doesn't display them by default. This should be changed so that they're visible by default.

2. Methods which aren't intended to be called externally should be hidden so that epydoc considers them "private". This could be done by renaming them. For example, Document.termlist_begin() shouldn't be public; renaming it to Document._termlist_begin() would make epydoc consider it to be private, and would prevent users calling it and not knowing how to use the returned iterator. In particular, this would reduce the likelihood of confusion between classes like MSetIter and MSetIterator, since it would be impossible to get an instance of MSetIterator without accessing a private method or attribute.

3. "epydoc xapian" reports several errors, due to markup in the documentation comments being invalid restructuredText. This should be fixed - in many cases the fix will lie in doxy2swig.py, but in some cases the documentation comments in the C++ headers could do with fixing up.


olly (28 matches)

Ticket Summary Component Milestone Type Created
Description
#46 zero byte cleanliness in C# and Java bindings Xapian-bindings 1.1.0 defect 2004-09-09

Check for zero byte cleanness wherever strings are used. There are a number of c_str()s in the code, but I believe all in the core library are harmless at 2002-04-29. There may be other zero byte issues though. xapian-applications/dbtools also uses c_str() where it should probably use data() and length(). xapian-bindings hasn't been checked.


#158 Query::MatchNothing and Query::MatchAll aren't wrapped Xapian-bindings 1.1.0 defect 2007-05-26

The obvious patch for this (below) doesn't work - in Python, you get a property of xapian.Query() added, which means that you have to instantiate xapian.Query to get at MatchNothing? (ie, xapian.Query().MatchNothing? works, but xapian.Query.MatchNothing? doesn't). It should be easy enough to fix this with a python specific workaround though.

PHP also doesn't work; I can't seem to access the resulting function at all, but this may be more due to my lack of PHP knowledge.

I've not tested for other languages yet.

Index: xapian.i =================================================================== --- xapian.i (revision 8676) +++ xapian.i (working copy) @@ -871,6 +871,9 @@

~Query();

+ static Xapian::Query MatchAll?; + static Xapian::Query MatchNothing?; +

termcount get_length() const; TermIterator? get_terms_begin() const; TermIterator? get_terms_end() const;


#170 Windows structured exceptions produce RuntimeError with some MSVC versions Xapian-bindings defect 2007-06-19

We are having an issue on a testing machine. We are running stress tests on it, and Xapian eventually raises a RuntimeError?, "unknown Xapian error".

This is on Windows, using the Xapian Python bindings.

Richard mentioned that exception.i has a catch(...) that catches all the unknown exceptions.

He also mentioned that this might have something to do with Windows Structured Exceptions, and that Mark had investigated this previously so he thought it had been fixed.


#175 xapian-compact functionality should be available from the C++ API Library API defect 2007-06-26

The ability to merge and compact databases efficiently would be a useful addition to the C++ API (and the language bindings), so it would be good to move most of the implementation xapian-compact into the core, and change xapian-compact to just be a simple interface to this.

The first step is probably to refactor xapian-compact, such that it's not mainly a single massive function: I've made a start on this, and the patch will be attached to this bug shortly.

I'm happy to work on this, and don't think it's very much work, but Olly says that there are a few outstanding issues he needs to fix in xapian-compact, so I'll leave this bug assigned to him until then.


#193 NumberValueRangeProcessor_apply not working in the PHP-bindings Xapian-bindings 1.1.0 defect 2007-08-20

The following returns an error, even though the right arguments have been passed:

$vrp = new XapianNumberValueRangeProcessor?(0, "\$", true); $vrp->apply((string)"240", (string)"500");

The error returned is: Fatal error: Type error in argument 2 of NumberValueRangeProcessor?_apply. Expected SWIGTYPE_p_stdstring in xapian.php on line 1217

line 1217 being: return NumberValueRangeProcessor?_apply($this->_cPtr,$begin,$end);

Have tried changing that line to say (string)$begin and (string)$end, same result.


#195 Flint writable databases should take a parameter indicating flush threshold. Backend-Flint defect 2007-09-11

Possibly, this should be a global parameter (ie, applies to all databases), or maybe it should be a database specific parameter (ie, set as a parameter to the "open" method for flint writable databases).

In any case, the current way of setting a flush threshold (ie, setting an environment variable) is unsatisfactory, due to being difficult to set in some circumstances (or on some OSes), and it being easy for users to forget to export the variable, resulting in bogus bug reports. A parameter to a Xapian function would be a cleaner API for this. However, we indend to improve the handling of automatic flushes in future, such that the count of added documents won't be the crucial factor; instead, amount of memory used will be. We need to ensure we don't add a parameter to the API which will shortly become meaningless.


#201 Attempting to create a NEAR search with two OR nodes gives assertion error Library API 1.1.0 defect 2007-09-21

I've observed this from python, but I expect it can occur from C++ too. The following python script gives an assertion error:

import xapian a=xapian.Query('A') b=xapian.Query('B') c=xapian.Query(xapian.Query.OP_OR, a, b) d=xapian.Query(xapian.Query.OP_NEAR, c, c)

The error is: xapian.AssertionError?: /home/richard/private/Working/xapian/xapian-core/api/omqueryinternal.cc:770: op

Xapian::Query::OP_NEAR || op

Xapian::Query::OP_PHRASE

This is because an attempt is made to flatten the query "(A OR B) NEAR (A OR B)", which isn't supported (I believe). It would be nice to fix this by supporting such searches, but meanwhile we shouldn't raise AssertionError? from the API; a more explanatory exception should be returned instead.


#216 Inconsistent return values for percentage weights Matcher 1.0.9 defect 2007-11-27

When results are being sorted primarily by an order other than relevance (e.g. sort_by_value()), the percentage values returned by the MSet object may be incorrect because they are calculated based on the document in the portion of the MSet requested which has the highest weight, instead of the document matching the query which has the highest weight.

This issue has existed in all previous Xapian releases, as far as we can tell.

There is currently no fix in progress, since it is probably not possible to fix without significant loss of efficiency, which would adversely affect users who aren't interested in the percentage scores.

If you really need percentage scores in this situation, one workaround would be to first run the search using relevance order, asking for only the top document, and to remember the weight and percentage assigned to that document. Then, re-run the search in sorted order, and calculate the percentages yourself from the weights assigned to the results, using this information.

A testcase demonstrating this is attached to this ticket.

The issue is that in multimatch.cc, we calculate "best" by looking for the highest weighted document in the candidate mset, but when sorting by anything other than relevance, the highest weighted document may have been discarded already.

It is hard to see how to fix this - one obvious approach would be to check every candidate document's weight before discarding it during the match process, and keep track the docid of the document with the highest weight seen so far. However, we currently don't calculate the weight for all the documents we see (because we first check the document against the lowest document in the mset using mcmp), so this would force us to calculate the weights on documents we wouldn't otherwise need to calculate it for. Since the percentages aren't necessarily even wanted, this seems a shame.

Perhaps a reasonable approach would be to add a flag on enquire which governed whether percentages were wanted or not; it would then be more reasonable to go to extra effort to keep track of the highest weighted document if the percentages were actually desired.


#228 Trying to build xapian package for dapper fails during fakeroot apt-get source -b xapian-core Build system defect 2008-01-18

Building ubuntu dapper or gutsy gives a similar failure for command fakeroot apt-get source -b xapian-core.

g++ -DHAVE_CONFIG_H -I. -I.. -I../common -I../include -I./include

-I../languages -Ilanguages -I../queryparser -Wall -W -Wredundant-decls -Wpointer-arith -Wcast-qual -Wcast-align -Wno-multichar -Wno-long-long -fno-gnu-keywords -Wundef -Wshadow -fvisibility=hidden -O2 -c ../api/editdistance.cc -o api/editdistance.o >/dev/null 2>&1 /bin/sh ./libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I. -I.. -I../common -I../include -I./include -I../languages -Ilanguages -I../queryparser

-Wall -W -Wredundant-decls -Wpointer-arith -Wcast-qual -Wcast-align

-Wno-multichar -Wno-long-long -fno-gnu-keywords -Wundef -Wshadow -fvisibility=hidden -O2 -c -o api/error.lo ../api/error.cc

g++ -DHAVE_CONFIG_H -I. -I.. -I../common -I../include -I./include

-I../languages -Ilanguages -I../queryparser -Wall -W -Wredundant-decls -Wpointer-arith -Wcast-qual -Wcast-align -Wno-multichar -Wno-long-long -fno-gnu-keywords -Wundef -Wshadow -fvisibility=hidden -O2 -c ../api/error.cc -fPIC -DPIC -o api/.libs/error.o In file included from ../api/error.cc:25: ../common/safeerrno.h:25:3: error: #error You must #include <config.h> before #include "safeerrno.h" make[3]: *** [api/error.lo] Error 1 make[3]: Leaving directory `/home/rhatch/xapian-core-1.0.5/xapian-core-1.0.5/build' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/rhatch/xapian-core-1.0.5/xapian-core-1.0.5/build' make[1]: *** [all] Error 2 make[1]: Leaving directory `/home/rhatch/xapian-core-1.0.5/xapian-core-1.0.5/build' make: *** [build-stamp] Error 2 Build command 'cd xapian-core-1.0.5 && dpkg-buildpackage -b -uc' failed. E: Child process failed

Prior to this failure the command below does not seem to work. sudo apt-get build-dep xapian-core Reading package lists... Done Building dependency tree... Done 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.


#230 C++-exceptions are not wrapped for Perl Search::Xapian 1.1.0 defect 2008-01-30

This results in exceptions being uncatchable, or catchable as string-exceptions.


#245 All-stopword queries with two or more terms should ignore stopword list QueryParser defect 2008-03-07

Currently, if a single word query is parsed, and that word is a stopword, the stopwording is ignored. However, if a multiple word query is parsed, and all words are stopwords, the stopwording is applied (resulting in an empty query).

If all the words in the query are stopwords, I think it may make sense to ignore the stopwording. However, even if we decide to apply the stopwording in this case, we should be consistent in our behaviour.

Some examples, in python:

import xapian s=xapian.SimpleStopper?() s.add('foo') s.add('bar') qp=xapian.QueryParser?() qp.set_stopper(s) str(qp.parse_query('foo'))

'Xapian::Query(foo:(pos=1))'

str(qp.parse_query('foo foo'))

'Xapian::Query()'

str(qp.parse_query('foo bar'))

'Xapian::Query()'

Either the first parse_query() call should return Xapian::Query(), or the later ones should return non-empty queries.


#254 Setting QueryParser default_op to OP_NEAR doesn't set an explicit window size QueryParser 1.1.0 defect 2008-04-24

When searching with more then 2 parameters on Boolean operator NEAR it throws and error: Exception: Can't use NEAR/PHRASE with a subexpression containing NEAR or PHRASE

Test case: http://myhealthcare.com/cgi-bin/search?q=american+actor+kevin&bool=near

-Kevin Duraj


#284 occasional DatabaseModifiedErrors Backend-Flint 1.0.9 defect 2008-07-23

I use xapian-core-1.0.7 with the corresponding perl bindings. I run a 1 writer/N reader setup, and I do reopen() a database-handle before each query. Nevertheless I casually get DatabaseModifiedErrors.

This is what I found out so far:

* The errors occurs after explicit flushing my most frequented index. The error does less often occur, if I do a sleep(1) after each explicit flush() before applying no changes (without flush) to the index, and it never occured so far with a sleep(4). This is my workaround.

* I already set XAPIAN_FLUSH_THRESHOLD to a large value (100000).

* I patched the xapian-core lib to log all calls of FlintDatabase::set_revision_number(), and the throw-points of the XapianModifiedErrors, which turned out that the exception gets thrown in FlintTable::set_overwritten().

* I patched again to get the caller and found out that set_overwritten() got called by FlintTable::block_to_cursor(), which I patched again to expose the condions:

if (REVISION(p) > REVISION(C_[j + 1].p)) {
 fprintf(stderr, "set_overwritten: from block_to_cursor() %d > %d\n", REVISION(p), REVISION(C_[j + 1].p));
 set_overwritten();
 return;
}

and it turned out:

set_overwritten: from block_to_cursor() 10194 > 10192
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError'
(...)
set_overwritten: from block_to_cursor() 10195 > 10193
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError' set_overwritten: from block_to_cursor() 10195 > 10193
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError'
(...)
set_overwritten: from block_to_cursor 10199 > 10197
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError'
set_overwritten: from block_to_cursor 10199 > 10197
terminate called after throwing an instance of 'Xapian::!DatabaseModifiedError'

I originally tested this with xapian-1.0.6, but it also occurs in 1.0.7.

I run xapian on Ubuntu Linux 8.04 (Hardy) with a 2.6.24-19-server kernel and an ext 3 file filesystem. The machine is an IBM x3650 with 40 GB RAM, and a ServeRAID-8k Controller running a Raid 10 over 6 SAS-Disks.

My most frequented index (the one that drops the exceptions) contains about 850.000 documents, needs 11 Gb of disk space, gets 5-15 updates per second, and about 20-25 search hits per second. I flush() this index every 10 minutes (which takes about 60-100 seconds + 4 seconds workaround delay ;-)


#288 Use F_FULLFSYNC ioctl where supported Backend-Chert 1.1.1 defect 2008-08-07

I've recently noticed that, when performing an fsync, sqlite and mysql use a special ioctl on OS X which makes an effort to ensure that the disk's internal write buffers are flushed to the platters. Perhaps we should be using this ioctl too.

http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html has some details about why this is needed.

http://www.sqlite.org/cvstrac/fileview?f=sqlite/src/os_unix.c&v=1.195 contains the sqlite implementation; search for the "full_fsync" function.


#22 Eliminate common cases which cause a slow phrase search QueryParser 1.1.0 enhancement 2004-03-15

Some common punctuation (notably -) is treated as a word break when indexing, and as a phrase generator when searching. The problem is that many common cases end up creating phrase searches with one or two character terms which are very common, and these search are slow with a big database.

Examples include: {e-mail cd-r d-i-y

This could perhaps be addressed by a smarter word identifying algorithm. When indexing and searching, we could decide never to generate a single character term in certain circumstances (maybe also apply the same rules for two character terms).

So "e-mail" would be indexed as "email" not "e" and "mail". And similarly for searching. In general the extra conflation this gives seems useful (although email is apparently dutch for enamel...)

The query parser probably wouldn't apply this rule to quoted phrase searches - otherwise searching for "o freddled gruntbuggly" would search for "ofreddled gruntbuggly" and tragically not find any matches (I'm sure there are less esoteric examples - a search for "i robot" say...)


#52 Running postlists backwards Backend-Flint enhancement 2004-09-09

Ability to run a postlist backwards - it's chunked, so this is feasible (with a small change we can even decode the current encoding backwards!) This is useful as we can add articles in date order and do a boolean search running the posting lists backwards to do "sort by date" (which is good as it an terminate once we've enough matches). Need this for gmane.


#59 Compress chert postlist changes buffered in memory Backend-Chert 1.1.0 enhancement 2004-11-26

If we could somehow reduce the memory used by the postlist changes chert buffers, we could buffer more and/or let the OS have more spare memory for buffering disk blocks. That should allow indexing to run faster. However we need to compress in such a way that we can still implement Xapian::Database methods including the effects of the buffered changes.


#113 QueryParser limitation/inconsistency QueryParser 1.1.0 enhancement 2007-03-15

Hi,

I've been using xapian (0.9.9 and now 0.9.10) recently at work and I've found

that the exquisite QueryParser? (no irony intended) imposes some serious limitations for certain queries, as it does treat some characters specially, even when flags does not contain FLAG_PHRASE.

I'm talking about the method is_phrase_generator(). In the organization I work

for we have a mixed setup of html documents and code. This includes several references to text in the word_word format. Unfortunately the QueryParser? treats underscore as phrase generator, making impossible to search for terms indexed using whitespace separators, even when allterms() shows the term exists on the database.

I believe this is an inconsistency and also a limitation in the QueryParser?,

as it does not matter what flags are used, in such cases where the query string contains any of the characters defined in is_phrase_generator(), the query will be automatically converted to a phrase search (note that these characters can't be changed).

In an ideal world (mine at least), I'd expect the user to define a phrase

(using " or any other previously defined character) and if this is not the case the QueryParser? should not try to convert the query to anything else (except for the defined operations, OR, AND, etc).

ITOH, I could change the indexing to strip the underscores (and the other

characters) and treat every part of the word_word as a separate term, but that would also mean that "word word" would match as well, when it's not what you wanted.

I hope you have this into consideration. Feel free to contact me if you need

further details or I can clarify anything else.

Many thanks,

f.-


#114 Use libmagic or libextractor instead of own MIME mappings and extractions Omega enhancement 2007-03-29

Hello,

I locally first modified omindex to use libmagic's MIME database, instead of hard coding the MIME type to file extension mapping. This ensures that the internally used MIME types are more consistent with accepted standard types.

Then I went further and instead of using file extensions to determine type, used libmagic to fingerprint the files. This is slower, but ensures that the file actually is identified correctly even if the extension is wrong.

Now I am using libextractor to actually extract the metadata from the file, instead of calling these external programs inside omindex based on the MIME type. Using libextractor greatly simplifies omindex.

Is anyone interested in these modifications?


#145 remote connection should pass 'writable' flag Backend-Remote 1.1.0 enhancement 2007-05-06

When a client application uses (say) xapian.remote_open() to connect to a server running in 'writable' mode, the server still opens the database for the connection in 'writable' mode, even though this was not requested by the caller.

This limitation means that an application might need to use 2 servers - one for

writable and one for read-only - as usage of the writable server will lock out all other read-only requests, which would be unacceptable in some environments.

The fix is not trivial as the protocol doesn't provide a way of providing connection-specific options. A solution would be to have the client send a MSG_KNOCK (?) message at connection with options (just this flag in the first instance) and the server could respond with its REPLY_GREETING if all is well. I understand this isn't going to make 1.0 though (well, unless you are really keen and would accept a patch if I could make one :)


#150 Enhancements to Unicode support QueryParser 1.1.0 enhancement 2007-05-13

This bug is intended to just gather together enhancements we'd like to make to our Unicode support.

Currently I'm aware of two:

* Special cases for case conversion: http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Case_Mappings and in particular: http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing.txt

* Normalisation (mostly combining accents): http://www.unicode.org/Public/5.0.0/ucd/UCD.html#Decompositions_and_Normalization

I'd imagine we would probably want to target most such changes at 1.1.0, for reasons of database compatibility. There are probably cases where it would be reasonable to implement such changes sooner though - if we build a different database in a case where the existing behaviour is poor, or the difference isn't problematic for some other reason, say.


#151 Use function attributes to mark functions as "const", "pure", and "nothrow" Other 1.1.0 enhancement 2007-05-13

GCC allows functions to be annotate with attribute((const)) if they "do not examine any values except their arguments, and have no effects except the return value", which allows the compiler to use CSE to eliminate calls to them with identical arguments. This would probably be very useful for Xapian::Unicode::get_category() for example.

URL: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Function-Attributes.html#Function-Attributes


#167 Add mode to query parser to search for both stemmed and unstemmed forms QueryParser enhancement 2007-06-13

Now that we store both the stemmed and unstemmed forms of each word in the database, it might be nice to add a new stemming mode to the query parser which takes each word in the query and generates an "OR" query for it, with two parts; one being the unstemmed form and one being the stemmed form. This would mean that each query would match any document with words which match the stemmed form, but would give documents with the unstemmed form a higher weight.

We might call this option "STEM_BOTH", or some better name that someone other than me can think of.


#222 omindex should make use of O_NOATIME where available Omega enhancement 2007-12-18

On Linux >= 2.6.8, open() accepts a O_NOATIME flag which is intended for use by "indexing or backup programs". That means us!

I have a patch for this, which I'll attach shortly.

There's a wrinkle though - in some cases O_NOATIME will cause open to fail with EPERM and you need to retry the open call without O_NOATIME:

EPERM The O_NOATIME flag was specified, but the effective user ID of

the caller did not match the owner of the file and the caller was not privileged (CAP_FOWNER).

So for example, if we're indexing /usr/share/doc as a non-root user, we incur an extra syscall for each file - in this case it would be more efficient not to use O_NOATIME at all.

We need to quantify this overhead, and (if it's an issue) look at how to reduce it. One thought I had was, on a per-directory basis, to give up on using O_NOATIME if we failed to open a file using it. Then we only incur one syscall per directory for a read-only tree. Various tweaks to this are possible - e.g. give up for this directory and all subdirectories.


#234 add an option to specify whether filter terms of a given prefix should be ORed or ANDed together Omega enhancement 2008-02-01

Hi,

the patch at http://people.debian.org/~tviehmann/list-search/xapian_omega_add_option_filter_defaultop.diff adds an option map to allow overriding the filter behaviour from OR to AND among the terms of a given prefix. For example, if first and last name are indexed with prefix A, I would add

$setmap{filter_defaultop,A,AND}

to the query template in order to be handle first, last, or first and last name entered into the appropriate fields.

Kind regards

Thomas

URL: http://people.debian.org/~tviehmann/list-search/xapian_omega_add_option_filter_defaultop.diff


#235 store the sort specification in the option map instead of separate variables Omega 1.1.0 enhancement 2008-02-01

Hi,

the patch at http://people.debian.org/~tviehmann/list-search/xapian_omega_make_sortstuff_options.diff moves the sort specification into the option map and removes the sort_* variables in omega.h. This makes the sort specification better accessible from omegascript. Quite possibly, it could be improved by doing the same to docid_order.

The goal immediately at hand is to reduce the amount of changes to omega in the gmane/Debian list search patches, here to move the sort handling into the query templates.

Kind regards

Thomas

URL: http://people.debian.org/~tviehmann/list-search/xapian_omega_make_sortstuff_options.diff


#280 Review storage of parameters in Query Library API 1.1.0 enhancement 2008-06-27

Currently, Xapian::Query::Internal stores any "double" parameter value as a sortable_serialised string. There is a FIXME in the code for set_dbl_parameter() and get_dbl_parameter() (around line 976 of api/omqueryinternal.cc) saying: "FIXME: rework for 1.1.0". This hasn't been changed until now due to fear of breaking ABI compatibility.

Instead, we should store double parameters as doubles in Query::Internal.

While reorganising this, it might be worth making parameter storage a bit more general, and tidying it up. We currently have the following parameters stored in Query::Internal:

  • op: The operation to perform
  • subqs: A list of subqueries
  • parameter: A "termcount" - used by NEAR and PHRASE to be the window size, used by ELITE_SET to be the number of terms, and used by RANGE to be the value number to apply the range to. For the last of these, a "termcount" type isn't really appropriate (though it is probably the same storage size as "valueno" at present, so it probably works correctly).
  • tname: A string holding the term, for a leaf query. The start of the range, for a range query.
  • str_parameter: The end of the range for a range query. The result of sortable_serialise() on the multiplier for OP_SCALE_WEIGHT queries.
  • term_pos: The position of the term for leaf queries.
  • wqf: The within query frequency, for leaf queries.
  • external_source: The external source, for external source queries.

Two approaches seem plausible to me - firstly, we could define a union with the possible parameter types, and store the parameters in a list of these unions. Alternatively, we could subclass Query::Internal for each of the possible query types, and just store the appropriate parameters for each.

The latter approach seems cleaner to me, and more likely to be flexible for future expansion of the available query operators, but I've not thought about this much yet.


#290 Omega support for Office 2007 Word and Excel Documents Omega 1.1.0 enhancement 2008-08-26

This patch uses the xmlparser and unzip to extract and process strings from *.xlsx and *.docx files.

P.S. First time I have used svn to create a diff or Trac so forgive me if I've screwed something up :)


richard (20 matches)

Ticket Summary Component Milestone Type Created
Description
#169 Standard build system should support windows with MSVC Build system defect 2007-06-19

The standard build system (ie, configure, autotools, etc) should be able to detect and use MSVC on a windows system. This will require some unix support stuff to be installed, but MSys & Mingw should suffice (ie, a full cygwin installation shouldn't be needed).

Mark has made some progress in this direction, and Richard made a start at looking at it, so this bug is intended as a place to collaborate.

Olly says: "I don't know if libtool's support has bitrotted though. If it has, there's a (very poorly named) wrapper called "wgcc" which translated gcc options to msvc"


#178 No remote backend support for: spelling correction, synonyms, metadata Backend-Remote 1.1.0 defect 2007-07-04

The remote database was briefly feature complete, but it's fallen behind again - it doesn't support spelling correction, or synonym expansion.

It may also not support the new matchspy stuff.

We should add these in to it at some point.


#180 Add support for CJK text to queryparser and termgenerator QueryParser defect 2007-07-05

Some code to do this kind of tokenisation is now available at http://code.google.com/p/cjk-tokenizer/ which should probably be used as the basis for supporting this in Xapian.


#182 Match decider should be set on enquire object, not as get_mset() param Library API defect 2007-07-06

Currently, match deciders (and match spies) are specified by passing them as get_mset() parameters. It would be neater, and reduce the excessive number of parameters passed to get_mset(), if there was a "set_match_decider" function, instead of these parameters.

We could also use this style of API to support things like multiple match deciders, where each would be called in sequence, allowing only those documents which pass all deciders to be returned. This would be useful if only a limited set of predefined match deciders were available (for example, in a remote search, or when calling from Python), and a combination of restrictions was desired.


#183 Remote backend should support use of Xapian::MatchDecider Backend-Remote defect 2007-07-06

Currently, Enquire::register_match_decider() simply stores the values passed to it in the internals of the Enquire object. These values never get used.

Either register_match_decider() should be removed, or (more probably) the values should be used in the remote match case to allow match deciders registered with the server to be used.

For now, I've added a note in the documentation comment that this method effectively does nothing.


#185 Deadlocks with apache mod_python and mod_wsgi Xapian-bindings defect 2007-07-11

Summary of current known status

mod_python

Calling any Xapian methods or functions is likely to cause dead-lock unless you set this option in the Apache configuration section for all mod_python scripts which use Xapian:

PythonInterpreter main_interpreter

You may also need to use Python >= 2.4 (due to [http://issues.apache.org/jira/browse/MODPYTHON-217 problems in Python 2.3 with the APIs the code uses).

Even with main_interpreter and Python >= 2.4, calling from Xapian's C++ code back to Python code won't work properly (this means that you can't subclass Xapian objects in Python). This is apparently an issue with mod_python.

mod_wsgi

You'll need to set:

WSGIApplicationGroup %{GLOBAL}

For details see: http://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGIApplicationGroup and http://code.google.com/p/modwsgi/wiki/ApplicationIssues#Python_Simplified_GIL_State_API

The mod_wsgi developers say this should be sufficient, and you should be able to subclass Xapian objects in Python. If you encounter problems, please talk to us or the mod_wsgi developers so we can investigate.


Originally reported on the mailing list: http://thread.gmane.org/gmane.comp.search.xapian.general/4486


#191 Possible license conflict with the PHP bindings Xapian-bindings defect 2007-08-17

I am reporting this on behalf of Adel Gadllah <adel.gadllah@…>, who is looking into packaging the bindings for Fedora 7.

The PHP license and the GPL aren't compatible but xapian-bindings links PHP licenced and GPL licensed code.

Quotes from the conversation on IRC with Fedora developers : "the problem i'm seeing is that xapian-bindings has bits of code that are GPLv2+ and PHP" "and it is merging them together into one .cc file and compiling _that_" "except, the GPLv2 and PHP are incompatible" "BOOM" "tell upstream that they can't compile PHP code with GPL* code"

We need this solved first before continuing with building the other bindings in Fedora.

Fabrice


#198 Add support for multiple values in each value slot in a Document. Backend-Flint 1.1.0 defect 2007-09-17

Currently, the value stored in a slot in a Document is a single string. It would sometimes be useful to be able to store multiple strings in the slot. For example, when using a value slot to store the set of facets that a document is relevant to, a given document may be relevant to multiple values. Also, if storing the set of tags matching a document, for use when generating a tag cloud, we want to be able to store multiple tags for each document.

However, we also need to preserve the existing API, and ensure that database formats are compatible.

Some discussion from IRC follows:

Richard Boulton: Do you think we could convert values as stored in databases
currently to allow multiple values, without breaking backwards compatibility?
ojwb: probably
ojwb: if only by checking the flint version
Richard Boulton: Hmm - if an old flint version is used to create a database, and
insert some values, it could be hard to then modify that database with a new
version of flint.
Richard Boulton: Unless we have a pass through the whole database to rewrite the
values.
ojwb: well, you could just disable the ability to add multiple values
Richard Boulton: Oh, no, its easy to store this.
Richard Boulton: Each value entry consists of a list of "valueno, entry" items.
Richard Boulton: (serialised, of course)
ojwb: or start the newly encoded ones in a way which is invalid
ojwb: oh, just duplicate?
Richard Boulton: Yep, there doesn't seem to be any thing to stop that.
ojwb: so the only question is if it's actually desirable!
Richard Boulton: They're kept in sorted order.
Richard Boulton: And the existing get_value() just returns the first of a
particular valueno found.
ojwb: that's nice then
Richard Boulton: So it would even be backwards compatible for reading purposes.
 (Old versions of xapian just wouldn't see the duplicate values)
ojwb: rewriting would mess up a document with multiple values, wouldn't it?
Richard Boulton: Not entirely.
Richard Boulton: add_value adds on the values at the end of the list.
Richard Boulton: without checking them.
ojwb: but aren't the unserialised into a map in Xapian::Document?
Richard Boulton: Oh.  Ah.
Richard Boulton: Yes, so getting a document out and then inserting it again
would lose the duplicates.
Richard Boulton: But that's a pretty nice way to degrade.
ojwb: yeah, it's not too bad
Richard Boulton: It would be nicer if we'd named Document::add_value()  as
document::set_value()
Richard Boulton: We can't change the behaviour of add_value() now, though: I
suppose we could add Document::append_value()
Richard Boulton: And leave Document::remove_value() as removing all values with
a given number.
Richard Boulton: Document::get_value() would return the first value for a given
valueno.
Richard Boulton: And we could add Document::get_values() which gets a list of
all the values for a given valueno.
Richard Boulton: Hmm - I wonder if the list of values for each valueno should be
kept in insertion order.  Or sorted in some way (binary sort, I would think).
ojwb: It shouldn't sort them I think
Richard Boulton: I think just in insertion order.
Richard Boulton: *snap*
ojwb: because you want a "primary version"
Richard Boulton: That's true.
ojwb: which is used for sorting, etc
ojwb: I'm not completely sure this is a good plan, but it seems to have merit
Richard Boulton: Yes.  That's the main thing I was unhappy about
StringListSerialiser for - you couldn't sensibly sort on the resulting values.

#213 Expose statistics to user defined Xapian::Weight subclasses Library API 1.1.0 defect 2007-11-24

Currently, The Xapian::Weight::Internal class (which is, as of last night, the class holding the statistics for the whole collection used by the weight objects) is not publically visible. This means that it would be impossible, for example, for a user to write a weighting class equivalent to, say, the BM25Weight class, using the public API, because the statistics aren't available.

After cleaning up the weighting calculation system, I believe the

Xapian::Weight::Internal class is now nearly clean enough that it could reasonably be made public, allowing custom weighting classes access to all the statistics currently available.

We might want to make the termfreq and reltermfreq members private, since they're likely to be accessed mainly through the accessor functions anyway. Also we might want to combine them into a single map with entries holding both the termfreq and the reltermfreq, since it's usual to want to access both the termfreq and the reltermfreq for a particular term at the same time.

Also, we might want to call the class Xapian::Stats, instead of Xapian::Weight::Internal, to reflect the Stats being part of the public API, but this would require an ABI change, so would have to wait for 1.1.0. (We could keep the API compatible by making Xapian::Weight::Internal a typedef for Xapian::Stats, I think; currently Stats (with no namespace) is a typedef for Xapian::Weight::Internal).


#229 Stub databases should be read with msvc_posix_open Other defect 2008-01-28

Currently, stub databases are read using a standard C++ ifstream. (See backends/database.cc, function open_stub()) This works fine, except that if a user (or the database replication code) tries, on Windows, to atomically rename a new stub db file over an existing one, it will receive an error if the old stub DB file was open.

This can be avoided if we instead use msvc_posix_open() (or just open() on unix) in open_stub() to get a file handle for the stub database, and access it using C file-handling routines.


#236 Implement automated tests of concurrent db replication and modification Backend-Flint defect 2008-02-05

Currently, there is no automated test of the behaviour when the replication function is doing a full copy of a database which gets modified while the copy is in progress.

I've done a manual test of this, so I'm moderately confident it works right, but I can't work out how to do a reliable automated test of it... at least, not without hacking a big sleep (or even a condition) into the database copying code, to allow me to be sure of getting some modifications done in the middle of it.

Any suggestions appreciated.


#243 common/fileutils.cc needs tests Test Suite 1.1.0 defect 2008-03-05

This file was added for use by the replication stuff, and handles parsing and some simple manipulation of paths. This is particularly tricky for windows paths, unfortunately, and needs proper testing.


#268 Review ValueWeightPostingSource, possibly replace with a query operator Library API 1.1.0 defect 2008-05-12

External PostingSources? have at least two annoying limitations (don't work with remote databases, don't work well with multi databases).

The newly added ValueWeightPostingSource? simply reads a value slot, returns documents with a non empty value, and returns the weight obtained by applying sortable_unserialise to the slot. Therefore, it could be implemented instead by a query operator, which would be similar to the existing OP_VALUE_... operators. This would make the feature available with remote databases and multi databases.

There may be a cleaner alternative which we haven't thought of yet, too.

Marking this for 1.1.0, since ValueWeightPostingSource? isn't yet in the API for any release, and we should remove it before making a release if we're going to remove it at all.


#278 When changesets are being generated, old changesets aren't cleaned up Backend-Chert 1.1.0 defect 2008-06-23

Currently, changesets are generated when the "XAPIAN_MAX_CHANGESETS" environment variable is set to a non-empty value. However, they are never removed. Whenever a changeset is generated, the number of changesets around should be checked, and old changesets should be removed if too many old changesets exist.

Alternatively (or as well), a different criteria might be useful for the changesets: it might be useful to be able to set an absolute limit on the total size of the changesets, or perhaps, a limit on the total size of the changesets as a proportion of the total database size.


#104 Wildcard queries should use synonym instead of OR QueryParser 1.1.0 enhancement 2006-12-13

When the synonym query operator, and synonym postlists, are implemented, the queryparser should build wildcard queries using the synonym operator instead of the OR operator.


#107 We should have an automated performance test suite Test Suite enhancement 2007-01-02

We need to be able to keep track of how changes to the code affect the performance (ie, speed / resource usage) of Xapian. In particular, we should be able to test how fast a standard set of data is indexed and searched, simply by running a single command (ideally, integrated into the build system - eg, "make speedcheck").

I have the beginnings of such a system, in the shape of some python code which builds a wikipedia index. I'm starting this bug to keep track of progress on building this.


#128 Allow queryparser to treat some prefixes as literal text QueryParser 1.1.0 enhancement 2007-04-12

By default, the query parser splits words at spaces and applies lower-casing, stemming, and other normalisation to generate terms.

I believe that it should be possible to override the query parser's default behaviour for fields with a given set of prefixs, such that the query parser will treat some terms as literal text, allowing any character to occur in the term (including spaces and quotes), and not applying stemming or other normalisation to the term.

My thinking is that this can be implemented by adding a third prefix type (which I've called "EXACT_TEXT" for want of a better name), which causes the query parser to put all the characters following the prefix until the next space or ')' into the term (like terms with a "BOOL_FILTER" prefix type). The terms so generated are then included in the query structure in the same way as "FREE_TEXT" terms - ie, they obey surrounding boolean operators, and '+' and '-' prefixes.

In order to allow spaces (and ')' characters) in the terms, the query parser should support basic backslash escaping for the contents of such fields.

I have a patch which implements this that I'll attach to this bug report shortly. The patch has a few test cases (but more are needed for such a new feature), and has I've not written any documentation for it yet.

I know that Sidnei needs this for something he's working on, and I'd be delighted if we managed to get this into 1.0 since I'm going to have to maintain it until it gets committed, but it needs thorough review before being committed and timescales for 1.0 may not allow this.


#173 Bindings should have an explicit WritableDatabase::close() method Xapian-bindings enhancement 2007-06-22

In garbage collected languages, it is hard to ensure that a WritableDatabase? object has been closed, because this requires ensuring that no objects still hold a reference to it. To make this easier, WritableDatabases? should have an explicit close() method, which would delete the underlying C++ object. After this method has been called, all other methods on the WritableDatabase? object in the bindings would be invalid.


#227 Implement database replication system Other 1.1.0 enhancement 2008-01-18

I have a setup where I would like to be able to perform index updates one one master database, and then replicate this database to multiple client machines for searching.

I've experimented with using an NFS setup for this, with the database kept local on the index server and mounted remotely on the search clients, hoping that the client machines would keep enough of the database cached that the network traffic would not slow down searches too much. However, this method doesn't work satisfactorily because the NFS protocol doesn't allow NFS clients to get information about file updates other than by polling the mtime of a file: therefore, whenever the index is updated, any cached pages from the database are discarded. This leads to many very slow searches.

For now, I'm looking at setting up a system to take snapshots of databases using filesystem features (eg, the snapshot functionality provided by ZFS) and then using xdelta to calculate the differences between the databases, transferring the differences manually, and then applying the differences to the database on the search machines.

However, this approach has two major drawbacks: firstly, it depends on filesystem specific features (to take filesystem snapshots - a standard file copy could be used, but this would have poor cache performance, which is exactly what we're trying to avoid). Secondly, it requires the whole database to be traversed on the index machine to calculate the binary diffs. This is undesirable because it imposes unnecessary load on the index machine.

Instead, I would like to have a hook into flint which writes out a list of the modified btree pages, so that these can then be distributed to the search servers. If this information was written to a log file, together with the points at which fsync were called, and with details of the changes made to the base files, this log file could be transferred to the search machines, and could be replayed there, with minimal work required there.


#189 Add a place for translations of the documentation to the source tree Other task 2007-08-09

Yung-chung Lin has translated the intro_ir.html document into zh_TW. It would be good to have a place in the source tree to put such documents.


Note: See TracReports for help on using and creating reports.