wiki:GSoCProjectIdeas

GSoC 2019 Project Ideas List

GSoC isn't happening right now, but we're still happy to mentor anyone wanting to work on projects - please get in touch if you're interested.

You can contact us via IRC on #xapian on irc.freenode.net or on the xapian-devel mailing list. Unlike some of the larger projects, we will take the time to talk to you personally. Communication is an important part of open source development and of GSoC - if we haven't talked to you then your application is unlikely to be seriously considered.

Each idea below lists people who have expressed an interest in mentoring it.

Please don't contact individual potential mentors privately - use the IRC channel or the mailing list instead. One reason is that you'll likely get a faster response as other people may be able to help with many questions. Also interacting with the community is an important part of open source development, and we'll be looking for evidence of positive interactions when reviewing applications, but we can't easily take account of private interactions when doing this. The mentor names against each project are only an indication of who might want to mentor it - other people who haven't signed up yet may be interested too.

Project Ideas List

Note that these are ideas - some are more fully formed than others, but don't be afraid to take them and extend or adapt them in your proposal to produce something you're more interesting in working on. If you think you'd like to work with Xapian, but nothing below really appeals and you don't have an idea of your own, feel free to talk to us, and we'll see if we can come up with something suitable.

Click on the idea title in the list below to see more details. Each idea should have an explanation of what it's about, some references for further reading, a list of useful skills to already have, an estimated difficulty, and a list of people who have so far indicated they might be willing to mentor such projects.

Once you've have some idea what you might want to work on, follow the steps in our guide for GSoC students.

Looking at existing code should help you to get a grip on what's involved for most of these ideas - we suggest you check out the latest code from our git repository and start from there. If you can't find the right area, just come and ask.

Project: <Your Idea Here>

We invite students to suggest their own ideas for projects. Such projects need to benefit Xapian and/or the Xapian community, and be suitable in scope to be occupy you for 3 months full-time. GSoC rules require the projects to be primarily coding, but you are expected to document and test your code (ideally with automated testcases), so allow for that in your planning.

We strongly recommend talking to us about your idea before submitting your proposal so we can help you to polish it into something both you and we love.

You may find some inspiration on our general project ideas page, although some of the larger projects there have already been listed below, and others may not on their own be of sufficient size for a GSoC project.

Potential mentors: Don't worry about this, if we love your idea, we'll find a mentor for it

Library Internals

Project: Matcher Optimisations

The matcher is the part of Xapian which does all the hard work when generating search results. It implements a number of optimisations to help make searching fast. There is potential to improve things further, at least in some cases.

When we have an idea for a new optimisation, we usually open a ticket to make sure we don't forget about it, and to provide a single location for discussion and to track progress. Here are the currently open such tickets:

The idea for this project is to take several such optimisation ideas (either from the above, or ones you develop yourself), and for each in turn to implement the optimisation, and do performance testing to check that they give a performance improvement in at least some case, and to check that they don't make things slower in other cases (or if they do, then the gains outweigh the losses).

Resources:

Skills:

  • Good C++
  • Up for a challenge!

Difficulty: hard

Potential mentors: Olly Betts

Project: Improve estimated total number of results

The aim of this project is to improve the estimated total number of results, as returned by the Xapian::MSet::get_matches_estimated() method.

When requesting results using Xapian::Enquire::get_mset() the user specifies how many matching documents they're interested in - for the first page of a typical web search you're only interested in the top 10 results. Then Xapian's matcher is able to make use of various short-cuts which significantly reduce the time the search takes by not considering documents which it can prove can't rank highly enough to make that top 10.

But because of these short-cuts the matcher often doesn't know exactly how many matching documents there are in total, so it provides an estimate of the total, along with upper and lower bounds.

The estimate is currently calculated from the frequency of occurrence of the terms, assuming they occur independently of one another - this makes for easy probability calculations, but words in a query are usually somewhat correlated which means the estimate tends to be skewed.

Thanks to some recent changes, the first and last document which each term matches are now available to the matcher, and these could be used to calculate better estimates by taking into account how much these ranges overlap between terms.

Also, if we ran some experiments using real datasets and associated queries we could look at how the estimates compare to the true answers, and use the results to tune the calculations of the estimate to try to reduce the skewing effect we get from assuming independence.

Resources:

Skills:

  • C++
  • Understanding of probability

Difficulty: medium

Potential mentors: Guruprasad Hegde, Olly Betts

Ranking

Project: Weighting Schemes

The aim of this project is to add support for more weighting schemes to Xapian.

Xapian provides the ability to rank search results by relevance. The relevance is calculated using a mathematical formula, which can be specified by sub-classing Xapian::Weight. Xapian currently includes built-in subclasses for BM25, the "traditional" probabilistic weighting formulae] (which is essentially a special case of BM25), BM25+, several TF/IDF schemes, many of the Divergence from Randomness (DfR) family of models, coordinate matching, and Unigram Language Modelling. Bi-gram Language Modelling was also implemented during GSoC 2012 (but hasn't yet been merged).

To be supported, a weighting formula needs to be expressible as a sum of a weight from each matching term, optionally plus a per-document component. Additionally, for faster searching, an upper bound on each component is needed (each database stores a number of summary statistics to help with this - if additional statistics would be useful, you could add them as part of the project).

There are other weighting schemes which can be expressed in this way, and it would be useful to support them (some because they're potentially more effective than BM25, others because they're of interest for Information Retrieval students and academics).

Xapian currently supports some of TF/IDF schemes described by SMART, but only a subset of the documented normalisations are implemented. More could be supported if we tracked some extra statistics and made them available to weighting schemes.

Currently the number of unique terms in each document is available as a statistic for weighting schemes, but the value may be a slight overestimate. It would be better to store a chunked stream of unique terms values like how we store the document lengths as then we could provide the exact statistic, and it would also be more efficient.

It would also be very useful to see how the different schemes compare for speed and retrieval effectiveness, so can offer solid advice to users wondering which to use. It would also be good to know if there's a better default scheme than BM25. The parameter-free DfR schemes are particularly interesting as academic evaluations suggest they can outperform BM25 with tuned parameters, and "real world" users rarely seem to have the patience to tune parameters. However, it looks like the bounds on the weights aren't as tight as BM25, so Xapian is likely to have less scope to optimise the match process.

Resources:

Skills:

  • Basic (or better) knowledge of C++.
  • Knowledge of Information Retrieval would be useful.
  • Being comfortable rearranging algebraic formulae would be a bonus.

Difficulty: medium

Potential mentors: Gaurav Arora, Richhiey Thomas, Guruprasad Hegde

Project: Learning to Rank Stabilisation

Learning to Rank (Letor) is the application of Machine Learning (ML) to Information Retrieval (IR), in particular to the problem of ranking. Each document is represented by a vector of features. These features try to distinguish the levels of relevancy between documents (in the simple binary case between relevant and non-relevant documents). In the academic literature, Learning-to-Rank has been shown to perform better than unsupervised ranking models like TF-IDF or BM-25, especially in document retrieval and web-page retrieval.

Xapian has an experimental Letor API, with work split across several previous GSoC projects. The most recent two projects have worked on consolidating to get a stable, tested core of functionality that can be included in a future Xapian release. There is considerably more detail available on the project's individual page. There's still some consolidation work to do, cleaning up the code base, ensuring it has good automated test coverage, and making sure it's properly releasable.

In order to evaluate the implementation, you can index the INEX collection (without annotation tags), then get the set of queries and relevance judgements. It would be very nice if you can generate a number similar to that reported by the non-refactored code. The FIRE collection is another useful source of datasets for evaluation.

Skills:

  • Solid C++
  • High level of attention to detail
  • Knowledge of Information Retrieval (IR) and Machine Learning (ML) useful but not essential

Difficulty: medium

Potential mentors: Ayush Pandey, James Aylett, Richhiey Thomas

Project: Learning to Rank Clickstream Data Mining

Learning to Rank (Letor) is the application of Machine Learning (ML) to Information Retrieval (IR), in particular to the problem of ranking.

Xapian has a stable Letor API, but a user wanting to deploy this ideally needs relevance judgments to train the machine learning (a pretrained model from a different document corpus can be used, but generally training with data from the same dataset will give better results).

Getting people to explicitly generate relevance judgments is likely to be expensive and time-consuming, but end users clicking on search results are implicitly making relevance judgments. A module which took such click data and mined it to produce relevance judgments for training would complement the existing Letor module.

Previous GSoC project, worked in that direction. So, we now have a way to generate letor training data from logs of user clicks on search results in omega.

There are a number of areas in which the project could go this year:

  • Enabling an end-to-end use of letor API with omega, that is, training the letor module on the training file obtained from omega click data and using it to improve search results on omega.
  • Currently, DBN click model training is based on a simple counting algorithm. There's an advanced version of training method given by a combination of EM and forward-backward algorithm in the paper which is worth having.
  • Also, since we only have one clickmodel, there is definitely a scope for implemeting more like Dependent Click Model and Intent Aware Models. We might then also be interested in providing a mechanism for their comparisons to the users.

Resources:

Skills:

  • Proficient in C++ and Python.
  • Knowledge of Information Retrieval (IR) and Machine Learning (ML) useful but not essential

Difficulty: medium

Potential mentors: James Aylett, Amanda Jayanetti

Omega

Project: Text-Extraction Libraries

This project would use libraries in preference to external programs to extract text from various file formats during indexing.

Omega's omindex indexer currently has built-in support for HTML, plain text, CSV (Comma-Separated Values), SVG, Atom feeds and AbiWord documents. Thanks to a 2019 GSoC project, it also has a way of safely using external libraries, with support via that for PDF, various eBook formats, documents from Apple's iWork suite, OCR of some image formats, and MIME-formatted emails. Other formats require an external filter program (or sometimes more than one) to be run for each file.

The functionality provided by some of these external programs is also available as shared libraries, and using these instead would avoid the overhead of running an external filter and so speed up indexing.

A number of modern file formats are based around the zip file format with XML contents (e.g. OpenDocument format) so using a zip file reading library instead of the unzip program could be an useful target.

Omega already uses zlib to read gzip compressed Abiword files. Hard requiring a zip file reading library as well seems reasonable as that covers several popular formats - libarchive is probably a sensible option.

There are also libraries for a number of other formats.

Resources:

Skills:

  • Good C++
  • Familiarity with Linux/UNIX system programming would be useful.

Difficulty: medium

Potential mentors: James Aylett, Gaurav Arora

Programming Language Support

We have some specific projects and approaches we think would be useful. In all of them, and any others you think might make for valuable projects, it is worth also thinking about translating the examples in our Getting started guide to your target language. Doing this for a language we already support will both improve the usefulness of the getting started guide _and_ the language bindings; it will also likely suggest ways of improving the language bindings as you work on the examples.

Note that translating the getting started guide to a new language we already support is unlikely to be enough work on its own for a GSoC project. However it might be possible to do other work around the getting started guide, such as working on the automation and configuration required to automatically build and deploy the guide with all language variants, and an easy way of switching between the different languages. (If you're interested in something along these lines, it is likely that your proposal will require input from us to reach both a reasonable scope and a sensible approach to deployment that works with other aspects of Xapian's web presence.)

Project: Implement Bindings for Go

The aim of this project is to allow Xapian to be used from the Go language.

Xapian's core is written in C++ and provides a C++ API, which we'd like people to be able to use from Go. We'd prefer the bindings to be generated using SWIG to reduce the work required to update the bindings as the C++ API evolves.

Marius Tibeica has done some initial work on a proof of concept. Currently you can generate bindings using SWIG with basically no customisation, build them, and run simple tests.

What needs doing:

  • The current branch is based on Xapian 1.4.x, but really this work should be done on git master, with the aim of including it in the next stable release series. So one of the first things to do is rebase the existing work onto git master.
  • The SWIG-generated bindings need customising to wrap the places that SWIG can't handle unaided, and to produce a more natural Go API in places where SWIG doesn't produce a very nature Go API by itself.
  • The automated tests need filling out to give a similar level of coverage to what we have for other languages.
  • The Go bindings should be documented in

Getting Started with Xapian - the source code for this book is set up to allow versions for different languages.

Resources:

Skills:

  • Experience with Go
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Previous experience with calling from Go to C would be a bonus
  • Familiarity with reStructured text markup a bonus

Difficulty: medium

Potential mentors: Olly Betts

Project: Support Another Language using SWIG

The aim of this project is to allow Xapian to be used from a programming language which isn't already supported.

Xapian's core is written in C++ and provides a C++ API, but we have bindings which wrap this API to allow use from a number of other languages.

Where possible, we prefer bindings to be generated using SWIG to reduce the work required to update the bindings as the C++ API evolves.

Xapian currently has pretty decent support for Python, PHP, Tcl, C#, Java, Lua, Perl, Ruby, Erlang, R, and Node.js. There has been some work on Pike bindings but not using SWIG and only wrapping part of the C++ API. Work has also been done on bindings for Haskell and OCaml.

Most of the work is likely to be customising the SWIG-generated bindings to produce a more natural API in your chosen language (for example, the semantics of C++ iterators aren't natural in scripting languages).

The new bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages.

Resources:

Skills:

  • You'll need good familiarity with the language you want to add support for
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Knowledge of your chosen language's C/C++ API would be a bonus
  • Familiarity with reStructured text markup a bonus

Difficulty: medium

Potential mentors: Olly Betts, James Aylett

Project: Python Bindings Improvements

The aim of this project is to improve Xapian's Python bindings in various ways. Some ideas:

  • Support for PyPy (although some initial investigation has suggested this would be a lot of work)
  • Improve the bindings speed and memory use. For example, SWIG's Python backend has a -builtin option "for better performance", which we should try out and see how it compares for speed and memory use, and if it causes any problems for code using the bindings.
  • Python doccomments only show information for one overloaded form
  • Put together a PyPI package of python bindings (optionally including xapian-core) to make it simpler to install the python bindings into a virtualenv.
  • We currently have our own home-grown test runner, and it may be beneficial to use a common third party, or at least to use something like unittest (probably only for python3 bindings, simply because spending lots of time on py2 at this point is probably not valuable)

Resources:

Skills:

  • Python
  • C++
  • Knowledge of Python's C API would be useful

Difficulty: medium

Potential mentors: James Aylett

Project: C# Bindings Improvements

This project would add more natural API wrapping in the C# bindings.

This project is really too small to make up a GSoC project by itself - look at combining it with another idea.

The C# bindings would benefit from a more "idiomatic" wrapping in places. For example:

This would allow us to provide an API which is felt more natural to C# programmers, making Xapian easier for them to use.

The bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages, but currently it doesn't cover C#.

If you use Debian or Ubuntu, another useful thing to work on would be expanding the xapian-bindings package to include a binary package for the C# bindings.

Resources:

Skills:

  • You need to be familiar with C# (you'll need to understand what "looks right" in a C# API)
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Knowledge of C#'s C API would be useful
  • Familiarity with reStructured text markup a bonus

Difficulty: easy-medium

Potential mentors: James Aylett

Project: Java Bindings Improvements

This project would improve Xapian's new SWIG-based Java bindings.

Xapian's core is written in C++ and provides a C++ API, but we have bindings which wrap this API to allow use from a number of other languages. Most of these bindings are built using SWIG which takes care of the mechanical work of wrapping new API features.

In Xapian 1.2, the Java bindings were implemented in hand-coded JNI which requires some tedious and error-prone work to wrap new features. Consequently the JNI bindings don't wrap the whole of the current C++ API.

On trunk and in 1.4, we have Java bindings with are generated by SWIG, but a few things don't work, and they aren't as compatible with the Java bindings in 1.2 as we'd like. This really needs someone with a good understanding of Java to get them fully working, and to wrap things like C++ iterators in a more Java-like way.

The bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages, but it doesn't fully cover Java. This will mostly involve implementing the remaining examples in Java (the initial work for this was done as a warm-up for GSoC in 2017.

If you use Debian or Ubuntu, another useful thing to work on would be expanding the xapian-bindings package to include a binary package for the Java bindings.

This project is at the easier end of the scale, so if you want a more challenging project, you could combine it with working on the C# bindings too. SWIG's support for Java and C# has a lot in common, and the two languages have a number of similarities, so it's a fairly natural combination.

Resources:

Skills:

  • You'll need a good understanding of Java (in particular, what "looks right" in a Java API)
  • Knowledge of SWIG and C++ would be useful
  • Knowledge of JNI might be useful, but isn't required
  • Familiarity with reStructured text markup a bonus

Difficulty: easy

Potential mentors: James Aylett

Applications

Project: Integrate Xapian into an application or framework

Many applications and frameworks would benefit from fast indexed search. If you're interested in adding Xapian integration to an application or framework, we're happy to consider that as a GSoC project (or as part of a GSoC project if the integration is straightforward). Similarly, improvements to existing integrations could form a good basis for a project (with the caveat that it might need pairing with some other work in order to create a project sufficiently large to make sense as a GSoC project).

One concrete application we'd like to see integration with is Trac - we use this for our bugtracker and wiki, but the built-in search is rather rudimentary. (A possible basis for this would be the TracAdvancedSearchPlugin.

Skills:

  • A good working knowledge of the programming language you'll be using
  • Familiarity with the application or framework you want to integrate with would be very useful

Difficulty: variable

Potential mentors: James Aylett

Projects Related to Testing

Project: Performance Test Suite

It would be very useful if we had a testsuite which tested the performance of Xapian, in terms of things like time taken, memory usage, and disk space usage. This would allow us to easily see what effect a proposed change has, which currently usually involves setting up an ad-hoc test.

The tests should use real-world data for both the documents being indexed and the queries being run - synthetic data can have very different behaviour to real-world data.

Resources:

  • Ticket #107 originally proposed creating this
  • xapian-core/tests/perftest/ contains some "performance tests", but they use randomly generated data, so the results may not reflect what users will see

Skills:

  • Good C++

Difficulty: medium

Potential mentors: David Bremner, Olly Betts

Project: Evaluation Suite

It would be helpful if we had an automated system for evaluating the various ranking algorithms and options against large-scale and real-world datasets.

There has been previous work done on this, and there are some notes in the Letor stabilisation project page. Improvements in this area could be done either as part of a Letor project, or as part of a project around automated testing, perhaps including the performance test suite idea noted above.

Skills:

  • Good C++
  • Understanding of information retrieval evaluation helpful but not essential

Difficulty: medium

Potential mentors: James Aylett

Project: Speed up the Test Suite

Xapian has an extensive suite of feature tests, unit tests, and regression tests. The test harness can run tests under valgrind to detect use of uninitialised values, memory leaks, and other problems. The downside is that code runs significantly slower under valgrind, so a full testsuite run can take quite a long time. And that means developers are tempted not to do a full run, or not to run it under valgrind, and as a result sometimes code is committed which causes tests to fail, which isn't good for anyone.

Currently the test harness only runs one testcase at a time. Most modern machines have multiple processors, so if the harness could run testcases in parallel this should allow the testsuite run time to be reduced significantly - if you have 4 CPUs and can keep them all busy, it should complete in close to 1/4 of the time.

Work to prepare for this was carried out in a 2019 GSoC project.

Our testcases often use one of a fairly small number of standard databases, which are reused if the testcase uses them read-only, but regenerated each time if the testcase modifies them. So the harness would need to take some care with scheduling testcases.

The test suite currently has instrumentation code in the test harness which records how long each testcase takes to run and can tell us which tests are taking the longest time. There may be some optimisations possible in those tests, either by breaking them into smaller tests that can be run in parallel, or by improving the implementations of those tests more directly.

Resources:

Skills:

  • Good C++

Difficulty: medium

Potential mentors: Olly Betts

Project: Idea Template

Note to mentors: We're inviting project ideas for working on other FOSS projects which build on Xapian, as well as on Xapian itself. This will allow us to offer a wider range of project ideas, and so have a broader appeal to students.

If you have an idea of a suitable scope, feel free to copy this template to a suitable place above and fill it in to match the other ideas here, then ping us on IRC or the mailing list to review it (or discuss first if you're unsure). You should be prepared to mentor the project, or nominate someone else who has said they are prepared to.

The aim of this project is to <brief summary>.

<Longer description>

Resources:

  • <Links to useful background information>

Skills:

  • <List of required skills>
  • <List of useful skills>

Difficulty: <easy/medium/hard>

Potential mentors: <list of people interested in mentoring this project>

Last modified 5 days ago Last modified on 05/12/19 08:42:36

Attachments (1)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.