GSoC 2018 Project Ideas List

Xapian is taking part in Google Summer of Code this year.

You can contact us via IRC on #xapian on (web interface or link for IRC client), or on the xapian-devel mailing list. Unlike some of the larger projects, we will take the time to talk to you personally. Communication is an important part of open source development and of GSoC - if we haven't talked to you then your application is unlikely to be seriously considered.

Each idea below lists people who have expressed an interest in mentoring it.

Please don't contact individual potential mentors privately - use the IRC channel or the mailing list instead. One reason is that you'll likely get a faster response as other people may be able to help with many questions. Also interacting with the community is an important part of open source development, and we'll be looking for evidence of positive interactions when reviewing applications, but we can't easily take account of private interactions when doing this. The mentor names against each project are only an indication of who might want to mentor it - other people who haven't signed up yet may be interested too.

Project Ideas List

Note that these are ideas - some are more fully formed than others, but don't be afraid to take them and extend or adapt them in your proposal to produce something you're more interesting in working on. If you think you'd like to work with Xapian, but nothing below really appeals and you don't have an idea of your own, feel free to talk to us, and we'll see if we can come up with something suitable.

Click on the idea title in the list below to see more details. Each idea should have an explanation of what it's about, some references for further reading, a list of useful skills to already have, an estimated difficulty, and a list of people who have so far indicated they might be willing to mentor such projects.

Once you've have some idea what you might want to work on, follow the steps in our guide for GSoC students.

Looking at existing code should help you to get a grip on what's involved for most of these ideas - we suggest you check out the latest code from our git repository and start from there. If you can't find the right area, just come and ask.

Project: <Your Idea Here>

We invite students to suggest their own ideas for projects. Such projects need to benefit Xapian and/or the Xapian community, and be suitable in scope to be occupy you for 3 months full-time. GSoC rules require the projects to be primarily coding, but you are expected to document and test your code (ideally with automated testcases), so allow for that in your planning.

We strongly recommend talking to us about your idea before submitting your proposal so we can help you to polish it into something both you and we love.

You may find some inspiration on our general project ideas page, although some of the larger projects there have already been listed below, and others may not on their own be of sufficient size for a GSoC project.

Potential mentors: Don't worry about this, if we love your idea, we'll find a mentor for it

Library Internals

Project: Matcher Optimisations

The matcher is the part of Xapian which does all the hard work when generating search results. It implements a number of optimisations to help make searching fast. There is potential to improve things further, at least in some cases.

When we have an idea for a new optimisation, we usually open a ticket to make sure we don't forget about it, and to provide a single location for discussion and to track progress. Here are the currently open such tickets:

The idea for this project is to take several such optimisation ideas (either from the above, or ones you develop yourself), and for each in turn to implement the optimisation, and do performance testing to check that they give a performance improvement in at least some case, and to check that they don't make things slower in other cases (or if they do, then the gains outweigh the losses).



  • Good C++
  • Up for a challenge!

Difficulty: hard

Potential mentors: Olly Betts

Project: Math Aware Search

Math-aware search engines need to support formulae in queries. Mathematical Information Retrieval is concerned with finding information in documents that include mathematics.

This project aim is to implement support searching mathematical information in documents using Xapian.

For example, when a user searches for any mathematical expression like X² + Y², Xapian should return documents containing the expression a² + b² and even support complex expressions.

There has been a lot of interest in supporting mathematical expression in the search engine. NTCIR ran math retrieval tasks (, Recently there have been publications in SIGIR exploring the same idea.

For your proposal, you should review literature published on Math IR and decide on a suitable approach which can be implemented efficiently within Xapian.



  • Good C++

Difficulty: medium

Potential mentors: Gaurav Arora


Project: Weighting Schemes

The aim of this project is to add support for more weighting schemes to Xapian.

Xapian provides the ability to rank search results by relevance. The relevance is calculated using a mathematical formula, which can be specified by sub-classing Xapian::Weight. Xapian currently includes built-in subclasses for BM25, the "traditional" probabilistic weighting formulae] (which is essentially a special case of BM25), BM25+, several TF/IDF schemes, many of the Divergence from Randomness (DfR) family of models, coordinate matching, and Unigram Language Modelling. Bi-gram Language Modelling was also implemented during GSoC 2012 (but hasn't yet been merged).

To be supported, a weighting formula needs to be expressible as a sum of a weight from each matching term, optionally plus a per-document component. Additionally, for faster searching, an upper bound on each component is needed (each database stores a number of summary statistics to help with this - if additional statistics would be useful, you could add them as part of the project).

There are other weighting schemes which can be expressed in this way, and it would be useful to support them (some because they're potentially more effective than BM25, others because they're of interest for Information Retrieval students and academics).

Xapian currently supports some of TF/IDF schemes described by SMART, but only a subset of the documented normalisations are implemented. More could be supported if we tracked some extra statistics and made them available to weighting schemes.

Currently the number of unique terms in each document is available as a statistic for weighting schemes, but the value may be a slight overestimate. It would be better to store a chunked stream of unique terms values like how we store the document lengths as then we could provide the exact statistic, and it would also be more efficient.

It would also be very useful to see how the different schemes compare for speed and retrieval effectiveness, so can offer solid advice to users wondering which to use. It would also be good to know if there's a better default scheme than BM25. The parameter-free DfR schemes are particularly interesting as academic evaluations suggest they can outperform BM25 with tuned parameters, and "real world" users rarely seem to have the patience to tune parameters. However, it looks like the bounds on the weights aren't as tight as BM25, so Xapian is likely to have less scope to optimise the match process.



  • Basic (or better) knowledge of C++.
  • Knowledge of Information Retrieval would be useful.
  • Being comfortable rearranging algebraic formulae would be a bonus.

Difficulty: medium

Potential mentors: Olly Betts, Gaurav Arora

Project: Learning to Rank Stabilisation

Learning to Rank (Letor) is the application of Machine Learning (ML) to Information Retrieval (IR), in particular to the problem of ranking. Each document is represented by a vector of features. These features try to distinguish the levels of relevancy between documents (in the simple binary case between relevant and non-relevant documents). In the academic literature, Learning-to-Rank has been shown to perform better than unsupervised ranking models like TF-IDF or BM-25, especially in document retrieval and web-page retrieval.

Xapian has an experimental Letor API, with work split across several previous GSoC projects. The most recent two projects have worked on consolidating to get a stable, tested core of functionality that can be included in a future Xapian release. There is considerably more detail available on the project's individual page. There's still some consolidation work to do, cleaning up the code base, ensuring it has good automated test coverage, and making sure it's properly releasable.

In order to evaluate the implementation, you can index the INEX collection (without annotation tags), then get the set of queries and relevance judgements. It would be very nice if you can generate a number similar to that reported by the non-refactored code. The FIRE collection is another useful source of datasets for evaluation.

Beyond getting Letor releasable, there are a number of subsequent improvements that are worth considering, and more are likely:

  • Add support for the database backend to track the length of fields would avoid having to handle this specially as a feature for Letor, and would also allow implementation of weighting schemes like BM25F and PL2F.
  • Use a linear regression approach to combine the scores given by different Rankers
  • A feature reduction technique can be added to eliminate redundant features.
  • Currently the xapian-core api allows updating MSet as whole. We can modify this to update individual entries in an MSet.
  • We can add parallelization to xapian-letor using OpenMP or OpenCL to improve the performance.


  • Solid C++
  • High level of attention to detail
  • Knowledge of Information Retrieval (IR) and Machine Learning (ML) useful but not essential

Difficulty: medium

Potential mentors: Ayush Pandey, Ayush Tomar, Parth Gupta, James Aylett

Project: Diversification of Results

This project aims to provide more diversified search results to users.

For example, if a user of a general web search engine searches for "jaguar", they could be interested in big cats, or in cars, or in OS X 10.2, or in many other things. Without more context, the search engine will be more widely useful if it attempts to provide results on its first page which aren't all about the same interpretation of the query.

More generally, considering a single search result we want to take into account what additional relevance it adds given the context of other search results. A result which is almost identical to another doesn't add much even though both results might be highly relevant considered individually.

There's extensive academic literature on this topic - for your proposal you should review this and decide on a suitable approach which can be implemented relatively efficiently within Xapian.

The approach could be a post-processing step (like how clustering works in Xapian) or could happen as part of the search - either way it needs to be fast enough that users don't perceive the search as slow. You should probably target it taking under 0.1 seconds in addition to the time the search already takes.



  • Good C++
  • Reading and understanding academic papers

Difficulty: hard

Potential mentors: Olly Betts, Richhiey Thomas

Project: Learning to Rank Clickstream Data Mining

Learning to Rank (Letor) is the application of Machine Learning (ML) to Information Retrieval (IR), in particular to the problem of ranking.

Xapian has a stable Letor API, but a user wanting to deploy this ideally needs relevance judgments to train the machine learning (a pretrained model from a different document corpus can be used, but generally training with data from the same dataset will give better results).

Getting people to explicitly generate relevance judgments is likely to be expensive and time-consuming, but end users clicking on search results are implicitly making relevance judgments. A module which took such click data and mined it to produce relevance judgments for training would complement the existing Letor module.

Previous GSoC project, worked in that direction. So, we now have a way to generate letor training data from logs of user clicks on search results in omega.

There are a number of areas in which the project could go this year:

  • Enabling an end-to-end use of letor API with omega, that is, training the letor module on the training file obtained from omega click data and using it to improve search results on omega.
  • Currently, DBN click model training is based on a simple counting algorithm. There's an advanced version of training method given by a combination of EM and forward-backward algorithm in the paper which is worth having.
  • Also, since we only have one clickmodel, there is definitely a scope for implemeting more like Dependent Click Model and Intent Aware Models. We might then also be interested in providing a mechanism for their comparisons to the users.



  • Proficient in C++ and Python.
  • Knowledge of Information Retrieval (IR) and Machine Learning (ML) useful but not essential

Difficulty: medium

Potential mentors: James Aylett, Vivek Pal


Project: Text-Extraction Libraries

This project would use libraries in preference to external programs to extract text from various file formats during indexing.

Omega's omindex indexer currently has built-in support for HTML, plain text, and uncompressed AbiWord documents. All other formats require an external filter program (or sometimes more than one) to be run for each file.

The functionality provided by some of these external programs is also available as shared libraries, and using these instead would avoid the overhead of running an external filter and so speed up indexing.

A number of modern file formats are based around the zip file format with XML contents (e.g. OpenDocument format) so using a zip file reading library instead of the unzip program would be an obvious first target.

There are also libraries for at least PDF (Poppler), .doc (wvware), .wps (libwps), .wpd (libwpd), and DjVu.

Currently Omega avoids a hard requirement on the filter programs it uses by automatically disabling those formats for which the filters aren't installed. We could do something similar for libraries by loading them dynamically (with dlopen() or similar) and if they aren't available disable those formats which require them. But the additional complexity may not be justified.

It would certainly be fine to just link directly against zlib, as xapian-core already requires that. Also, hard requiring a zip file reading library seems reasonable as it covers several popular formats.

Another issue is that we want to isolate omindex from bugs in the extraction library - if the library crashes we want omindex to continue, and if the library ends up in a memory or CPU eating infinite loop, we don't want to loop forever.


  • See in-line links above.
  • Olly worked on a patch which adds support for using libwv2 (from wvware) to index .doc files, isolated in a subprocess to avoid bugs in the library from crashing omindex. No resource limits currently, and libwv2 seems rather buggy and no longer maintained, but the mechanism around it seems to work well.


  • Good C++
  • Familiarity with Linux/UNIX system programming would be useful.

Difficulty: medium

Potential mentors: James Aylett, Gaurav Arora

Programming Language Support

We have some specific projects and approaches we think would be useful. In all of them, and any others you think might make for valuable projects, it is worth also thinking about translating the examples in our Getting started guide to your target language. Doing this for a language we already support will both improve the usefulness of the getting started guide _and_ the language bindings; it will also likely suggest ways of improving the language bindings as you work on the examples.

Note that translating the getting started guide to a new language we already support is unlikely to be enough work on its own for a GSoC project. However it might be possible to do other work around the getting started guide, such as working on the automation and configuration required to automatically build and deploy the guide with all language variants, and an easy way of switching between the different languages. (If you're interested in something along these lines, it is likely that your proposal will require input from us to reach both a reasonable scope and a sensible approach to deployment that works with other aspects of Xapian's web presence.)

Project: Implement Bindings for Go

The aim of this project is to allow Xapian to be used from the Go language.

Xapian's core is written in C++ and provides a C++ API, which we'd like people to be able to use from Go. We'd prefer the bindings to be generated using SWIG to reduce the work required to update the bindings as the C++ API evolves.

Marius Tibeica has done some initial work on a proof of concept (see this repo too).

Most of the work is likely to be customising the SWIG-generated bindings to produce a more natural Go API.

The Go bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages.



  • Experience with Go
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Previous experience with calling from Go to C would be a bonus
  • Familiarity with reStructured text markup a bonus

Difficulty: medium

Potential mentors: Olly Betts

Project: Support Another Language using SWIG

The aim of this project is to allow Xapian to be used from a programming language which isn't already supported.

Xapian's core is written in C++ and provides a C++ API, but we have bindings which wrap this API to allow use from a number of other languages.

Where possible, we prefer bindings to be generated using SWIG to reduce the work required to update the bindings as the C++ API evolves.

Xapian currently has pretty decent support for Python, PHP, Tcl, C#, Java, Lua, Perl, Ruby, Erlang, R, and Node.js. There has been some work on Pike bindings but not using SWIG and only wrapping part of the C++ API. Work has also been done on bindings for Haskell and OCaml.

Most of the work is likely to be customising the SWIG-generated bindings to produce a more natural API in your chosen language (for example, the semantics of C++ iterators aren't natural in scripting languages).

The new bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages.



  • You'll need good familiarity with the language you want to add support for
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Knowledge of your chosen language's C/C++ API would be a bonus
  • Familiarity with reStructured text markup a bonus

Difficulty: medium

Potential mentors: Olly Betts, James Aylett

Project: Python Bindings Improvements

The aim of this project is to improve Xapian's Python bindings in various ways. Some ideas:

  • Support for PyPy
  • Improve the bindings speed and memory use. For example, SWIG's Python backend has a -builtin option "for better performance", which we should try out and see how it compares for speed and memory use, and if it causes any problems for code using the bindings.
  • Python doccomments only show information for one overloaded form
  • Put together a PyPI package of python bindings (optionally including xapian-core) to make it simpler to install the python bindings into a virtualenv.



  • Python
  • C++
  • Knowledge of Python's C API would be useful

Difficulty: medium

Potential mentors: Assem Chelli, James Aylett

Project: Support for HHVM

The aim of this project is to allow Xapian to be used from HHVM.

HHVM is a virtual machine for which PHP programs can be compiled. It doesn't support the same extension API as PHP, but it does provide the HHVM-Native Interface (HNI) which would allow writing bindings for Xapian.

A project in GSoC 2016 extended SWIG to support generating bindings for HHVM. This work is currently on a branch, mainly because the manual still needs updating. Most of the SWIG testsuite passes, but there are some failures.

The next stage is to use this new support to generate HHVM bindings for Xapian. This is likely to also require dealing with bugs in SWIG's HHVM support, and it would be good to get the SWIG testsuite fully passing as part of this project.



  • Good C/C++
  • PHP
  • Previous experience with HHVM would be useful

Difficulty: hard

Potential mentors: Olly Betts, James Aylett

Project: C# Bindings Improvements

This project would add more natural API wrapping in the C# bindings.

This project is really too small to make up a GSoC project by itself - look at combining it with another idea.

The C# bindings would benefit from a more "idiomatic" wrapping in places. For example:

This would allow us to provide an API which is felt more natural to C# programmers, making Xapian easier for them to use.

The bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages, but currently it doesn't cover C#.

If you use Debian or Ubuntu, another useful thing to work on would be expanding the xapian-bindings package to include a binary package for the C# bindings.



  • You need to be familiar with C# (you'll need to understand what "looks right" in a C# API)
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Knowledge of C#'s C API would be useful
  • Familiarity with reStructured text markup a bonus

Difficulty: easy-medium

Potential mentors: James Aylett

Project: Java Bindings Improvements

This project would improve Xapian's new SWIG-based Java bindings.

Xapian's core is written in C++ and provides a C++ API, but we have bindings which wrap this API to allow use from a number of other languages. Most of these bindings are built using SWIG which takes care of the mechanical work of wrapping new API features.

In Xapian 1.2, the Java bindings were implemented in hand-coded JNI which requires some tedious and error-prone work to wrap new features. Consequently the JNI bindings don't wrap the whole of the current C++ API.

On trunk and in 1.4, we have Java bindings with are generated by SWIG, but a few things don't work, and they aren't as compatible with the Java bindings in 1.2 as we'd like. This really needs someone with a good understanding of Java to get them fully working, and to wrap things like C++ iterators in a more Java-like way.

The bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages, but it doesn't fully cover Java. This will mostly involve implementing the remaining examples in Java (the initial work for this was done as a warm-up for GSoC in 2017.

If you use Debian or Ubuntu, another useful thing to work on would be expanding the xapian-bindings package to include a binary package for the Java bindings.

This project is at the easier end of the scale, so if you want a more challenging project, you could combine it with working on the C# bindings too. SWIG's support for Java and C# has a lot in common, and the two languages have a number of similarities, so it's a fairly natural combination.



  • You'll need a good understanding of Java (in particular, what "looks right" in a Java API)
  • Knowledge of SWIG and C++ would be useful
  • Knowledge of JNI might be useful, but isn't required
  • Familiarity with reStructured text markup a bonus

Difficulty: easy

Potential mentors: James Aylett


Project: Integrate Xapian into an application or framework

Many applications and frameworks would benefit from fast indexed search. If you're interested in adding Xapian integration to an application or framework, we're happy to consider that as a GSoC project (or as part of a GSoC project if the integration is straightforward). Similarly, improvements to existing integrations could form a good basis for a project (with the caveat that it might need pairing with some other work in order to create a project sufficiently large to make sense as a GSoC project).

One concrete application we'd like to see integration with is Trac - we use this for our bugtracker and wiki, but the built-in search is rather rudimentary. (A possible basis for this would be the TracAdvancedSearchPlugin.



  • A good working knowledge of the programming language you'll be using
  • Familiarity with the application or framework you want to integrate with would be very useful

Difficulty: variable

Potential mentors: Assem Chelli, James Aylett

Projects Related to Testing

Project: Performance Test Suite

It would be very useful if we had a testsuite which tested the performance of Xapian, in terms of things like time taken, memory usage, and disk space usage. This would allow us to easily see what effect a proposed change has, which currently usually involves setting up an ad-hoc test.

The tests should use real-world data for both the documents being indexed and the queries being run - synthetic data can have very different behaviour to real-world data.


  • Ticket #107 originally proposed creating this
  • xapian-core/tests/perftest/ contains some "performance tests", but they use randomly generated data, so the results may not reflect what users will see


  • Good C++

Difficulty: medium

Potential mentors: David Bremner, Olly Betts

Project: Speed up the Test Suite

Xapian has an extensive suite of feature tests, unit tests, and regression tests. The test harness can run tests under valgrind to detect use of uninitialised values, memory leaks, and other problems. The downside is that code runs significantly slower under valgrind, so a full testsuite run can take quite a long time. And that means developers are tempted not to do a full run, or not to run it under valgrind, and as a result sometimes code is committed which causes tests to fail, which isn't good for anyone.

Currently the test harness only runs one testcase at a time. Most modern machines have multiple processors, so if the harness could run testcases in parallel this should allow the testsuite run time to be reduced significantly - if you have 4 CPUs and can keep them all busy, it should complete in close to 1/4 of the time.

Our testcases often use one of a fairly small number of standard databases, which are reused if the testcase uses them read-only, but regenerated each time if the testcase modifies them. So the harness would need to take some care with scheduling testcases.

A second part to this project could be adding some instrumentation code to the test harness which records how long each testcase takes to run and can tell us which tests are taking a lot of time. It may well be that there are a few testcases which take a significant proportion of the time, and if so, it's likely that they don't need to be so slow.



  • Good C++

Difficulty: medium

Potential mentors: Olly Betts

Project: Idea Template

Note to mentors: We're inviting project ideas for working on other FOSS projects which build on Xapian, as well as on Xapian itself. This will allow us to offer a wider range of project ideas, and so have a broader appeal to students.

If you have an idea of a suitable scope, feel free to copy this template to a suitable place above and fill it in to match the other ideas here, then ping us on IRC or the mailing list to review it (or discuss first if you're unsure). You should be prepared to mentor the project, or nominate someone else who has said they are prepared to.

The aim of this project is to <brief summary>.

<Longer description>


  • <Links to useful background information>


  • <List of required skills>
  • <List of useful skills>

Difficulty: <easy/medium/hard>

Potential mentors: <list of people interested in mentoring this project>

Last modified 9 months ago Last modified on 26/03/18 06:00:32

Attachments (1)

Download all attachments as: .zip