Help Wanted

See GSoCProjectIdeas for a list more tailored to students taking part in Google's Summer of Code.

This page is intended to list possible projects for somebody who's interested in getting involved with Xapian development, but who isn't intimately familiar with code yet (if you're a Xapian guru, you'll have to think up your own projects, or take a look at the  bug database.

Some of these projects would benefit from special skills in other areas (for example, experience with a scripting language is required to produce decent bindings for it).

  • Add classes which provide more choice of weighting schemes (e.g. the  Divergence from Randomness family of weighting schemes, and the traditional tf/idf schemes, such as those supported by  SMART), and investigate how they compare for speed and retrieval effectiveness with BM25. An efficient implementation may require keeping track of some extra statistics in the database.
  • Add a xapian-core feature to pull out the best N sentences/phrases from a string which contain terms from a specified list. Write an OmegaScript command to wrap this, and add an option to omindex to allow larger samples to be stored in the sample field. Then we can show a dynamic sample with matching terms in context. See also FAQ/Snippets.
  • Currently Omega's omindex needs to run an external program to index most formats (the exceptions are HTML, plain text, and uncompressed AbiWord documents). Some of these external programs are available as shared libraries, and using these instead would avoid the overhead of running an external filter and so speed up indexing. An obvious first target is a zipfile library instead of unzip.

Ideally we should load the libraries dynamically and if they aren't available disable those formats (similarly to how we currently do if the filter programs aren't available). It would be fine to just link directly against zlib though, as xapian-core already requires that. Also, requiring a zipfile reading library would probably be reasonable, as it's something many systems will probably already have and a number of modern formats are based on a zipfile container (e.g. OpenDocument format).

  • Implement bindings for Xapian in another language, ideally using  SWIG. Currently we have pretty decent support for Python, PHP, Tcl, C#, Perl, and Ruby, though we could do with support from a user of C# or Ruby to help maintain those bindings better. The hand-coded JNI bindings for Java have fallen behind the current C++ API, because they are too much work to update by hand. We have some partially done SWIG-based Java bindings, which are much easier to keep up to date, but need input from someone with Java experience. Someone has been working on  Pike bindings.
  • Some of the bindings would benefit from a more "idiomatic" wrapping. For example, wrapping the C++ iterators as PHP iterators; eliminating SWIG's "SWIGTYPE" default wrappers in C#. This needs someone who understands the language concerned well enough to understand what looks "right" in an API for it.
  • Add the ability to introspect on Query objects (see issue #159). This would allow queries generated by the query parser to be investigated and modified, and would be helpful in various scenarios. The trickiest bit is probably working out a suitable API - for this, you'd need to discuss possible APIs with the other developers on IRC or the mailing lists. It's probably a good task for a beginner, being fairly small and self-contained.
  • Improvements to spelling correction. The currently spelling correction feature is useful for many users, but there is scope for improvement. For example, currently corrections aren't offered if the word is also misspelled in a document (#225), only a single suggested correction is available via the API, and no regard is given to other words in the query when ranking possible corrections. There are no doubt other aspects too.