This is a draft list of project ideas for students taking part in  Google Summer of Code 2010.

Previous experience with Xapian certainly isn't required for these projects, but if you do have some, tell us about it in your application.

We recommend you use Linux or another UNIX-like system for development work, as we're better set up for development on such platforms. In particular we use them ourselves, so can more easily help with any set up issues you may encounter. If you want run a virtualised Linux installation for development and have no existing preference, we suggest  Ubuntu as our developer documentation covers it well.

Note to mentors: We're inviting project ideas for working on other FOSS projects which build on Xapian, as well as on Xapian itself. This will allow us to put together a wider range of project ideas, and so have a broader appeal to students. If you have an idea of a suitable scope, feel free to add it below in a format similar to the other ideas, and then ping us on IRC or the mailing list to review it (or discuss first if you're unsure). You should be prepared to mentor the project, or nominate someone else who will be.

Project Ideas

We also encourage you to suggest your own ideas for Xapian-related projects. And if you'd like to work with Xapian, but nothing below appeals and you don't have an idea of your own, feel free to talk to us to see if we can come up with something. You can contact us via IRC on #xapian on irc.freenode.net ( web interface), or on the xapian-devel mailing list.

Weighting Schemes

The aim of this project is to add support for more weighting schemes to Xapian.

Xapian provides the ability to rank search results by  relevance. The relevance is calculated using a mathematical formula, which can be specified by sub-classing  Xapian::Weight. Xapian currently includes built-in subclasses for BM25 and the "traditional" probabilistic weighting formulae].

To be supported, a weighting formula needs to be expressible as a sum of a weight from each matching term, optionally plus a per-document component. Additionally, for faster searching, an upper bound on each component is needed (each database stores a number of summary statistics to help with this - if additional statistics would be useful, you could add them as part of the project).

There are other weighting schemes which can be expressed in this way, and it would be useful to support them (some because they're potentially more effective than BM25, others because they're of interest for Information Retrieval students and academics).

Some examples of interesting weighting schemes are the  Divergence from Randomness (DfR) family of weighting schemes, and the traditional TF/IDF schemes, such as those supported by  SMART.

After implementing these schemes, it would be useful to see how they compare for speed and retrieval effectiveness with BM25 to see if there's a better default scheme for Xapian to use. The parameter-free DfR schemes are particularly interesting as academic evaluations suggest they can outperform BM25 with tuned parameters, and "real world" users rarely seem to have the patience to tune parameters.

Resources:

Skills:

  • Basic (or better) knowledge of C++.
  • Knowledge of Information Retrieval would be useful.
  • Being comfortable rearranging algebraic formulae would be a bonus.

Difficulty: medium

Dynamic Snippets

A "snippet" is the sample from a matching document which is displayed in the results with matching words highlighted.

Currently Omega stores a static fixed-sized sample from the start of the extracted text. It would be more useful for users to see a dynamically generated sample showing where the search terms match within each matching document (like most of web search engines do). This shows the context in which the matching terms are used and helps the user to decide which documents are not worth further investigation without having to fetch and read them all.

The suggested approach is:

  • add a xapian-core API feature to pull out the best N sentences or phrases from a string, based on which contain terms from a specified set. This allows API users to either store extracted text in the document data, or to re-extract it from the source documents in the result set.
  • add an OmegaScript command to wrap this, and add an option to omindex to allow larger samples to be stored in the sample field to allow Omega to show a dynamic snippet.

Resources:

Skills:

  • You'll need to know C++.

Difficulty: medium

Text-Extraction Libraries

This project would use libraries in preference to external programs to extract text from various file formats during indexing.

Omega's omindex indexer currently has built-in support for HTML, plain text, and uncompressed AbiWord documents. All other formats require an external filter program (or sometimes more than one) to be run for each file.

The functionality provided by some of these external programs is also available as shared libraries, and using these instead would avoid the overhead of running an external filter and so speed up indexing.

A number of modern file formats are based around the zip file format with XML contents (e.g. OpenDocument format) so using a zip file reading library instead of the unzip program would be an obvious first target.

There are also libraries for at least PDF ( Poppler), .wps ( libwps), .wpd ( libwpd), and  DjVu.

Currently Omega avoids a hard requirement on the filter programs it uses by automatically disabling those formats for which the filters aren't installed. We could do something similar for libraries by loading them dynamically (with dlopen() or similar) and if they aren't available disable those formats which require them. But the additional complexity may not be justified.

It would certainly be fine to just link directly against zlib, as xapian-core already requires that. Also, hard requiring a zip file reading library seems reasonable as it covers several popular formats.

Resources:

  • See in-line links above.

Skills:

  • C++
  • Some familiarity with Linux/UNIX system programming would be useful.

Difficulty: easy-medium

Support Another Language

Xapian's core is written in C++ and provides a C++ API, but we have bindings which wrap this API to allow use from a number of other languages.

This project would add bindings which allowed Xapian to be used from a language which isn't already supported, ideally using  SWIG to reduce the work required to update the bindings for future changes to the C++ API.

Xapian currently has pretty decent support for Python, PHP5, Tcl, C#, Java, Perl, and Ruby. There has been some work on  Pike bindings but not using SWIG and only wrapping part of the C++ API.

The "most wanted" language not already supported seems to be Lua, but you're welcome to pitch for other languages which Xapian bindings would be useful for.

Most of the work is likely to be customising the SWIG-generated bindings to produce a more natural API in your chosen language (for example, the semantics of C++ iterators aren't natural in scripting languages).

Resources:

Skills:

  • You'll need good familiarity with the language you want to add support for.
  • Familiarity with C++.
  • Knowledge of SWIG useful.
  • Knowledge of your chosen language's C/C++ API would be a bonus.

Difficulty: medium-hard

Improve Existing Bindings

This project would add more natural API wrapping in existing language bindings.

Some of the bindings would benefit from a more "idiomatic" wrapping in places. For example, wrapping the C++ iterators as PHP iterators; eliminating SWIG's "SWIGTYPE" default wrappers in C#.

This would allow us to provide an API which is felt more natural to programmers of each language, making Xapian easier for them to use.

Resources:

Skills:

  • You need to be familiar with the language (or languages) you want to work with (you'll need to understand what "looks right" in an API for those languages).
  • Familiarity with C++.
  • Knowledge of SWIG useful.
  • Knowledge of your chosen language's C/C++ API would be useful.

Difficulty: easy-medium

SWIG-based Java Bindings

This project would reimplement Xapian's Java bindings using SWIG instead of hand-coded JNI.

Xapian's core is written in C++ and provides a C++ API, but we have bindings which wrap this API to allow use from a number of other languages. Most of these bindings are built using  SWIG which takes care of the mechanical work of wrapping new API features.

However the Java bindings are currently implemented in hand-coded  JNI which requires some tedious and error-prone work to wrap new features. Consequently the JNI bindings don't wrap the whole of the current C++ API.

Some work has been done on generating Java bindings with SWIG, but it needs someone with a good understanding of Java to get it working, and to wrap things like C++ iterators in a more Java-like way.

Resources:

  • See the in-line links in the above description.

Skills:

  • You'll need a good understanding of Java (in particular, what "looks right" in a Java API).
  • Knowledge of SWIG and C++ would be useful.
  • Knowledge of JNI might be useful, but isn't required.

Difficulty: easy-medium

Improve Spelling Correction

The existing spelling correction feature is useful for many users, but there is scope for improvement. For example, currently:

  • corrections aren't offered if the word is also misspelled in a document in the database even if the correct spelling is far more common (ticket #225)
  • only a single suggested correction is available via the API
  • no regard is given to other words in the query when ranking possible corrections (approach used by sycamore wiki:  http://github.com/rtucker/sycamore/blob/master/Sycamore/search.py#L434)
  • corrections involving addition/removal/transposing of/with spaces aren't considered
  • currently spellings are only handled for unprefixed terms, but we ought to be able to offer spelling correction for fields. Some fields should be able to share spellings (e.g. title and document), others should have their own (e.g. author) and some may not want spelling correction at all.

There are no doubt other aspects too.

Skills: Good knowledge of C++ required.

Difficulty: medium

Matcher Optimisations

The matcher is the part of Xapian which does all the hard work when generating search results. It implements a number of optimisations to help make searching fast. There is potential to improve things further, at least in some cases.

When we have an idea for a new optimisation, we usually open a ticket to make sure we don't forget about it, and to track discussion and progress. Here are the currently open such tickets:

The idea for this project is to take several such optimisation ideas (either from the above, or ones you develop yourself), and for each in turn to implement the optimisation, and do performance testing to check that they give a performance improvement in at least some case, and to check that they don't make things slower in other cases (or if they do, then the gains outweigh the losses).

Resources:

Skills:

  • Good C++
  • Up for a challenge!

Difficulty: hard

Gmane Search Improvements

 Gmane is a public mailing list archive, with a total of over 94 million messages from over 12 thousand mailing lists. It has a  Xapian-powered article search.

There are a number of enhancements which could be made to the current search:

  • Currently the group filter only allows a sub-hierarchy to be specified (users would like to be able to say cpan not gmane.comp.lang.perl.cpan.*).
  • Allow restricting a search to the Subject: header.
  • Allow searching for parts of the email address.
  • Provide a search API (e.g. RSS feeds of results).
  • Some sort of grouping or collapsing by thread?
  • Currently the search index is updated once per day, but it would be useful to be able to search more recently arrived articles.
  • Make use of Xapian's spelling correction functionality ( prototype implementation)

Resources:

Skills:

  • Good C++
  • Perl useful
  • Experience with web technologies useful
  • Familiarity with Linux/UNIX system programming would be useful

Difficulty: medium/hard

CJK Support

This project idea is to improve support indexing and searching Chinese, Japanese, and Korean.

Xapian's TermGenerator class converts text to a series of terms for indexing, and its QueryParser class converts a user-entered query to a Query object tree for searching. At the moment, both of these mostly assume that terms are groups of letters delimited by whitespace.

Some work has been done of this already (see ticket:180) but performance is important (especially for indexing when large amounts of text are usually being processed) and the prototype patch there will slow down indexing of non-CJK text unacceptably.

Resources:

Skills:

  • C++
  • Familiarity with Chinese, Japanese, and/or Korean would be very useful
  • Knowledge of Unicode useful

Difficulty: medium