wiki:GSoCProjectIdeas

GSoC 2017 Project Ideas List

Xapian is taking part in Google's Summer of Code this year. See the list of projects that students will be working on this year.

You can contact us via IRC on #xapian on irc.freenode.net (web interface or link for IRC client), or on the xapian-devel mailing list. Unlike some of the larger projects, we will take the time to talk to you personally. Communication is an important part of open source development and of GSoC - if we haven't talked to you then your application is unlikely to be seriously considered.

Each idea below lists people who have expressed an interest in mentoring it.

Please don't contact individual potential mentors privately - use the IRC channel or the mailing list instead. One reason is that you'll likely get a faster response as other people may be able to help with many questions. Also interacting with the community is an important part of open source development, and we'll be looking for evidence of positive interactions when reviewing applications, but we can't easily take account of private interactions when doing this. The mentor names against each project are only an indication of who might want to mentor it - other people who haven't signed up yet may be interested too.

Project Ideas List

Note that these are ideas - some are more fully formed than others, but don't be afraid to take them and extend or adapt them in your proposal to produce something you're more interesting in working on. If you think you'd like to work with Xapian, but nothing below really appeals and you don't have an idea of your own, feel free to talk to us, and we'll see if we can come up with something suitable.

Click on the idea title in the list below to see more details. Each idea should have an explanation of what it's about, some references for further reading, a list of useful skills to already have, an estimated difficulty, and a list of people who have so far indicated they might be willing to mentor such projects.

Once you've have some idea what you might want to work on, follow the steps in our guide for GSoC students.

Looking at existing code should help you to get a grip on what's involved for most of these ideas - we suggest you check out the latest code from our git repository and start from there. If you can't find the right area, just come and ask.

Project: <Your Idea Here>

We invite students to suggest their own ideas for projects. Such projects need to benefit Xapian and/or the Xapian community, and be suitable in scope to be occupy you for 3 months full-time. GSoC rules require the projects to be primarily coding, but you are expected to document and test your code (ideally with automated testcases), so allow for that in your planning.

We strongly recommend talking to us about your idea before submitting your proposal so we can help you to polish it into something both you and we love.

Potential mentors: Don't worry about this, if we love your idea, we'll find a mentor for it

Library Internals

Project: Matcher Optimisations

The matcher is the part of Xapian which does all the hard work when generating search results. It implements a number of optimisations to help make searching fast. There is potential to improve things further, at least in some cases.

When we have an idea for a new optimisation, we usually open a ticket to make sure we don't forget about it, and to provide a single location for discussion and to track progress. Here are the currently open such tickets:

The idea for this project is to take several such optimisation ideas (either from the above, or ones you develop yourself), and for each in turn to implement the optimisation, and do performance testing to check that they give a performance improvement in at least some case, and to check that they don't make things slower in other cases (or if they do, then the gains outweigh the losses).

Resources:

Skills:

  • Good C++
  • Up for a challenge!

Difficulty: hard

Potential mentors: Olly Betts

Ranking

Project: Weighting Schemes

The aim of this project is to add support for more weighting schemes to Xapian.

Xapian provides the ability to rank search results by relevance. The relevance is calculated using a mathematical formula, which can be specified by sub-classing Xapian::Weight. Xapian currently includes built-in subclasses for BM25, the "traditional" probabilistic weighting formulae] (which is essentially a special case of BM25), BM25+, several TF/IDF schemes, many of the Divergence from Randomness (DfR) family of models, coordinate matching, and Unigram Language Modelling. Bi-gram Language Modelling was also implemented during GSoC 2012 (but hasn't yet been merged).

To be supported, a weighting formula needs to be expressible as a sum of a weight from each matching term, optionally plus a per-document component. Additionally, for faster searching, an upper bound on each component is needed (each database stores a number of summary statistics to help with this - if additional statistics would be useful, you could add them as part of the project).

There are other weighting schemes which can be expressed in this way, and it would be useful to support them (some because they're potentially more effective than BM25, others because they're of interest for Information Retrieval students and academics).

Xapian currently supports some of TF/IDF schemes described by SMART, but only a subset of the documented normalisations are implemented. More could be supported if we tracked some extra statistics and made them available to weighting schemes.

Currently the number of unique terms in each document is available as a statistic for weighting schemes, but the value may be a slight overestimate. It would be better to store a chunked stream of unique terms values like how we store the document lengths as then we could provide the exact statistic, and it would also be more efficient.

It would also be very useful to see how the different schemes compare for speed and retrieval effectiveness, so can offer solid advice to users wondering which to use. It would also be good to know if there's a better default scheme than BM25. The parameter-free DfR schemes are particularly interesting as academic evaluations suggest they can outperform BM25 with tuned parameters, and "real world" users rarely seem to have the patience to tune parameters. However, it looks like the bounds on the weights aren't as tight as BM25, so Xapian is likely to have less scope to optimise the match process.

Resources:

Skills:

  • Basic (or better) knowledge of C++.
  • Knowledge of Information Retrieval would be useful.
  • Being comfortable rearranging algebraic formulae would be a bonus.

Difficulty: medium

Potential mentors: Olly Betts, Parth Gupta, Gaurav Arora

Project: Learning to Rank Stabilisation

Learning to Rank (Letor) is the application of Machine Learning (ML) to Information Retrieval (IR), in particular to the problem of ranking. Each document is represented by a vector of features. These features try to distinguish the levels of relevancy between documents (in the simple binary case between relevant and non-relevant documents). In the academic literature, Learning-to-Rank has been shown to perform better than unsupervised ranking models like TF-IDF or BM-25, especially in document retrieval and web-page retrieval.

Xapian has an experimental Letor API, with work split across a few previous GSoC projects, only some of which has been merged. The overall goal is to consolidate the work done so far to get to a stable, tested core of functionality that can be included in a future Xapian release. There is considerably more detail available on the project's individual page.

One of our GSoC 2016 projects worked on stabilising the code base and merging the useful-but-unmerged parts from previous projects, but there's still more to do. To prepare, you should look at the notes from last year's project, compare them to the overall plan on the project page, and also start thinking about automated tests you could write. There's information on Xapian's automated tests, and you'll probably want to look at the tests in xapian-core to get an idea of how to tackle things.

We can help you with all the required knowledge of IR and ML.

Skills:

  • Solid C++
  • High level of attention to detail
  • Knowledge of Information Retrieval (IR) and Machine Learning (ML) useful but not essential

Difficulty: medium

Potential mentors: Ayush Tomar, Parth Gupta, James Aylett

Project: Learning to Rank Click Data Mining

Learning to Rank (Letor) is the application of Machine Learning (ML) to Information Retrieval (IR), in particular to the problem of ranking.

Xapian has an experimental Letor API, but a user wanting to deploy this ideally needs relevance judgments to train the machine learning (a pretrained model from a different document corpus can be used, but generally training with data from the same dataset will give better results).

Getting people to explicitly generate relevance judgments is likely to be expensive and time-consuming, but end users clicking on search results are implicitly making relevance judgments. A module which took such click data and mined it to produce relevance judgments for training would complement the existing Letor module.

There are a number of papers in the academic IR literature exploring this idea, and we'd recommend using one of these as a theoretical basis for your project.

The programming language for implementing is largely up to you, provided it has Xapian bindings, but we'd expect you to justify your choice. You'll need to think about:

  • where you get your click data from (for instance, can it be extracted from webserver logs? what information do you need to be recorded?)
  • what format the click data should be in (we want this documented, so others can produce similar data easily)
  • how to process that data into relevance judgements, to make training data
  • what a sensible workflow is, for people who want this to be run automatically on a regular basis to update their Letor model

Resources:

Skills:

  • Proficient in the programming language you intend to use
  • Knowledge of Information Retrieval (IR) and Machine Learning (ML) useful but not essential

Difficulty: easy-medium

Potential mentors: Parth Gupta, James Aylett, Olly Betts

Omega

Project: Text-Extraction Libraries

This project would use libraries in preference to external programs to extract text from various file formats during indexing.

Omega's omindex indexer currently has built-in support for HTML, plain text, and uncompressed AbiWord documents. All other formats require an external filter program (or sometimes more than one) to be run for each file.

The functionality provided by some of these external programs is also available as shared libraries, and using these instead would avoid the overhead of running an external filter and so speed up indexing.

A number of modern file formats are based around the zip file format with XML contents (e.g. OpenDocument format) so using a zip file reading library instead of the unzip program would be an obvious first target.

There are also libraries for at least PDF (Poppler), .doc (wvware), .wps (libwps), .wpd (libwpd), and DjVu.

Currently Omega avoids a hard requirement on the filter programs it uses by automatically disabling those formats for which the filters aren't installed. We could do something similar for libraries by loading them dynamically (with dlopen() or similar) and if they aren't available disable those formats which require them. But the additional complexity may not be justified.

It would certainly be fine to just link directly against zlib, as xapian-core already requires that. Also, hard requiring a zip file reading library seems reasonable as it covers several popular formats.

Another issue is that we want to isolate omindex from bugs in the extraction library - if the library crashes we want omindex to continue, and if the library ends up in a memory or CPU eating infinite loop, we don't want to loop forever.

Resources:

  • See in-line links above.
  • Olly worked on a patch which adds support for using libwv2 (from wvware) to index .doc files, isolated in a subprocess to avoid bugs in the library from crashing omindex. No resource limits currently, and libwv2 seems rather buggy and no longer maintained, but the mechanism around it seems to work well.

Skills:

  • Good C++
  • Familiarity with Linux/UNIX system programming would be useful.

Difficulty: medium

Potential mentors: James Aylett, Gaurav Arora

Programming Language Support

Project: Implement Bindings for Go

The aim of this project is to allow Xapian to be used from the Go language.

Xapian's core is written in C++ and provides a C++ API, which we'd like people to be able to use from Go. We'd prefer the bindings to be generated using SWIG to reduce the work required to update the bindings as the C++ API evolves.

Marius Tibeica has done some initial work on a proof of concept (see this repo too).

Most of the work is likely to be customising the SWIG-generated bindings to produce a more natural Go API.

The Go bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages.

Resources:

Skills:

  • Experience with Go
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Previous experience with calling from Go to C would be a bonus
  • Familiarity with reStructured text markup a bonus

Difficulty: medium

Potential mentors: Olly Betts

Project: Support Another Language using SWIG

The aim of this project is to allow Xapian to be used from a programming language which isn't already supported.

Xapian's core is written in C++ and provides a C++ API, but we have bindings which wrap this API to allow use from a number of other languages.

Where possible, we prefer bindings to be generated using SWIG to reduce the work required to update the bindings as the C++ API evolves.

Xapian currently has pretty decent support for Python, PHP, Tcl, C#, Java, Lua, Perl, Ruby, Erlang, R, and Node.js. There has been some work on Pike bindings but not using SWIG and only wrapping part of the C++ API. Work has also been done on bindings for Haskell and OCaml.

Most of the work is likely to be customising the SWIG-generated bindings to produce a more natural API in your chosen language (for example, the semantics of C++ iterators aren't natural in scripting languages).

The new bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages.

Resources:

Skills:

  • You'll need good familiarity with the language you want to add support for
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Knowledge of your chosen language's C/C++ API would be a bonus
  • Familiarity with reStructured text markup a bonus

Difficulty: medium

Potential mentors: Olly Betts, James Aylett

Project: Python Bindings Improvements

The aim of this project is to improve Xapian's Python bindings in various ways. Some ideas:

  • Support for PyPy
  • Improve the bindings speed and memory use. For example, SWIG's Python backend has a -builtin option "for better performance", which we should try out and see how it compares for speed and memory use, and if it causes any problems for code using the bindings.
  • Python doccomments only show information for one overloaded form
  • Put together a PyPI package of python bindings (optionally including xapian-core) to make it simpler to install the python bindings into a virtualenv.

Resources:

Skills:

  • Python
  • C++
  • Knowledge of Python's C API would be useful

Difficulty: medium

Potential mentors: Assem Chelli, James Aylett

Project: PHP Bindings Improvements

The aim of this project is to improve Xapian's PHP bindings.

Note that this project requires both C++ and PHP skills.

There are three parts to this project:

The first would involve reimplementing SWIG's PHP backend to generate objects using the PHP/Zend C API. Currently SWIG generates a PHP module which wraps the C++ API as a set of flat functions, and also generates a PHP script containing classes with methods which call these flat functions. Using PHP's Zend C API to create objects would be simpler and more efficient (it wasn't done originally as this API lacked documentation at the time).

The second part of this project would make the API of the PHP bindings more natural for PHP programmers to use, for example:

  • Currently C++ methods taking a floating point value can't be passed an integer value, and you have to work around this by writing (double)$val instead of $val if $val is (or might be) an integer value. This is surprising behaviour for a PHP API.
  • simplify how we avoid garbage collecting objects which are still being used in the C++ library.
  • PHP now supports namespaces, and it would be good to put the bindings in a namespace (currently we prefix all classnames with "Xapian").

The final part would be updating Getting Started with Xapian for the changes, and adding PHP versions of the example code which hasn't yet been translated to PHP.

Currently Xapian supports PHP5 and PHP7, but this work should probably just target PHP7, and PHP5 support stays as is since PHP5 itself is now in maintenance mode.

Resources:

Skills:

  • You need to be familiar with PHP (you'll need to understand what "looks right" in a PHP API)
  • C++ experience
  • Previous experience with SWIG useful, but not essential
  • Knowledge of PHP's Zend C API also useful, but also not essential
  • Familiarity with reStructured text markup a bonus

Difficulty: medium-hard

Potential mentors: Olly Betts, James Aylett

Project: Support for HHVM

The aim of this project is to allow Xapian to be used from HHVM.

HHVM is a virtual machine for which PHP programs can be compiled. It doesn't support the same extension API as PHP, but it does provide the HHVM-Native Interface (HNI) which would allow writing bindings for Xapian.

A project in GSoC 2016 extended SWIG to support generating bindings for HHVM. This work is currently on a branch, mainly because the manual still needs updating. Most of the SWIG testsuite passes, but there are some failures.

The next stage is to use this new support to generate HHVM bindings for Xapian. This is likely to also require dealing with bugs in SWIG's HHVM support, and it would be good to get the SWIG testsuite fully passing as part of this project.

Resources:

Skills:

  • Good C/C++
  • PHP
  • Previous experience with HHVM would be useful

Difficulty: hard

Potential mentors: Olly Betts, James Aylett

Project: C# Bindings Improvements

This project would add more natural API wrapping in the C# bindings.

This project is really too small to make up a GSoC project by itself - look at combining it with another idea.

The C# bindings would benefit from a more "idiomatic" wrapping in places. For example:

This would allow us to provide an API which is felt more natural to C# programmers, making Xapian easier for them to use.

The bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages, but currently it doesn't cover C#.

If you use Debian or Ubuntu, another useful thing to work on would be expanding the xapian-bindings package to include a binary package for the C# bindings.

Resources:

Skills:

  • You need to be familiar with C# (you'll need to understand what "looks right" in a C# API)
  • Familiarity with C++
  • Knowledge of SWIG useful
  • Knowledge of C#'s C API would be useful
  • Familiarity with reStructured text markup a bonus

Difficulty: easy-medium

Potential mentors: James Aylett

Project: Java Bindings Improvements

This project would improve Xapian's new SWIG-based Java bindings.

Xapian's core is written in C++ and provides a C++ API, but we have bindings which wrap this API to allow use from a number of other languages. Most of these bindings are built using SWIG which takes care of the mechanical work of wrapping new API features.

In Xapian 1.2, the Java bindings were implemented in hand-coded JNI which requires some tedious and error-prone work to wrap new features. Consequently the JNI bindings don't wrap the whole of the current C++ API.

On trunk and in 1.4, we have Java bindings with are generated by SWIG, but a few things don't work, and they aren't as compatible with the Java bindings in 1.2 as we'd like. This really needs someone with a good understanding of Java to get them fully working, and to wrap things like C++ iterators in a more Java-like way.

The bindings should be documented in Getting Started with Xapian - the source code for this book is set up to allow versions for different languages, but it doesn't fully cover Java. This will mostly involve implementing the remaining examples in Java (the initial work for this was done as a warm-up for GSoC in 2017.

If you use Debian or Ubuntu, another useful thing to work on would be expanding the xapian-bindings package to include a binary package for the Java bindings.

This project is at the easier end of the scale, so if you want a more challenging project, you could combine it with working on the C# bindings too. SWIG's support for Java and C# has a lot in common, and the two languages have a number of similarities, so it's a fairly natural combination.

Resources:

Skills:

  • You'll need a good understanding of Java (in particular, what "looks right" in a Java API)
  • Knowledge of SWIG and C++ would be useful
  • Knowledge of JNI might be useful, but isn't required
  • Familiarity with reStructured text markup a bonus

Difficulty: easy

Potential mentors: James Aylett

Applications

Project: Integrate Xapian into an application or framework

Many applications and frameworks would benefit from fast indexed search. If you're interested in adding Xapian integration to an application or framework, we're happy to consider that as a GSoC project (or as part of a GSoC project if the integration is straightforward). Similarly, improvements to existing integrations could form a good basis for a project (with the caveat that it might need pairing with some other work in order to create a project sufficiently large to make sense as a GSoC project).

One concrete application we'd like to see integration with is Trac - we use this for our bugtracker and wiki, but the built-in search is rather rudimentary.

Resources:

Skills:

  • A good working knowledge of the programming language you'll be using
  • Familiarity with the application or framework you want to integrate with would be very useful

Difficulty: variable

Potential mentors: Assem Chelli, James Aylett

Projects Related to Testing

Project: Performance Test Suite

It would be very useful if we had a testsuite which tested the performance of Xapian, in terms of things like time taken, memory usage, and disk space usage. This would allow us to easily see what effect a proposed change has, which currently usually involves setting up an ad-hoc test.

The tests should really use real-world data for both the documents being indexed and the queries being run.

Resources:

  • Ticket #107 originally proposed creating this
  • xapian-core/tests/perftest/ contains some "performance tests", but they use randomly generated data, so the results may not reflect what users will see

Skills:

  • Good C++

Difficulty: medium

Potential mentors: Olly Betts, James Aylett

Project: Speed up the Test Suite

Xapian has an extensive suite of feature tests, unit tests, and regression tests. The test harness can run tests under valgrind to detect use of uninitialised values, memory leaks, and other problems. The downside is that code runs significantly slower under valgrind, so a full testsuite run can take quite a long time. And that means developers are tempted not to do a full run, or not to run it under valgrind, and as a result sometimes code is committed which causes tests to fail, which isn't good for anyone.

Currently the test harness only runs one testcase at a time. Most modern machines have multiple processors, so if the harness could run testcases in parallel this should allow the testsuite run time to be reduced significantly - if you have 4 CPUs and can keep them all busy, it should complete in close to 1/4 of the time.

Our testcases often use one of a fairly small number of standard databases, which are reused if the testcase uses them read-only, but regenerated each time if the testcase modifies them. So the harness would need to take some care with scheduling testcases.

A second part to this project could be adding some instrumentation code to the test harness which records how long each testcase takes to run and can tell us which tests are taking a lot of time. It may well be that there are a few testcases which take a significant proportion of the time, and if so, it's likely that they don't need to be so slow.

Resources:

Skills:

  • Good C++

Difficulty: medium

Potential mentors: Olly Betts

Project: Idea Template

Note to mentors: We're inviting project ideas for working on other FOSS projects which build on Xapian, as well as on Xapian itself. This will allow us to offer a wider range of project ideas, and so have a broader appeal to students.

If you have an idea of a suitable scope, feel free to copy this template to a suitable place above and fill it in to match the other ideas here, then ping us on IRC or the mailing list to review it (or discuss first if you're unsure). You should be prepared to mentor the project, or nominate someone else who has said they are prepared to.

The aim of this project is to <brief summary>.

<Longer description>

Resources:

  • <Links to useful background information>

Skills:

  • <List of required skills>
  • <List of useful skills>

Difficulty: <easy/medium/hard>

Potential mentors: <list of people interested in mentoring this project>

Last modified 5 months ago Last modified on 05/05/17 21:22:00

Attachments (1)

Download all attachments as: .zip