This page is intended to list possible projects for somebody who's interested in getting involved with Xapian development, but who isn't intimately familiar with our code yet (if you're a Xapian guru, you can of course tackle one of these, or you could think up your own projects or take a look at the bug database).
Some of these projects would benefit from special skills in other areas (for example, experience with a scripting language is required to produce decent bindings for it; being fluent in a particular human language would be very helpful for improving support for it).
We're happy to provide mentoring to anyone trying to get to grips with the Xapian codebase. You can contact us via IRC on #xapian on irc.freenode.net (web interface or link for IRC client), or on the xapian-devel mailing list.
The list below is split into "bite size" projects, which are probably a good place to start for someone wanted to get familiar with the Xapian code, and other projects which are larger in scope. This split is inevitably a little subjective though.
GSoCProjectIdeas has a list of projects with a larger scope, aimed at students taking part in Google's Summer of Code.
Table of Contents
- Bite Size
- I/O Profiling
- Documentation and examples
- Use function attributes more internally
- Use GCC attribute warn_unused
- Add more stemming algorithms
- Check how well a human language is supported
- Fix a compiler warning
- Larger Projects
Rework Omega templates to use more modern web techniques
- Add classes to HTML elements to allow styling via CSS.
- Existing familiarity with HTML and CSS would be very useful
Support indexing another file format in Omindex
Omindex can currently extract text and meta-data from a number of file formats for indexing, but it would be good to support more.
- Currently supported formats
- tips for adding support for new formats
- Document Liberation Project import filters
Implement "make installcheck"
make check is a standard target which runs the testsuite in a source tree, using the version of
Xapian just built. Another standard target is
make installcheck which provides a standard way
to run tests against the library after it's installed (which is done using
make install). There's
now support in git master for
make installcheck for omega and most of the language bindings, but for
make installcheck doesn't do anything.
It would be useful if
make installcheck ran most of the same tests as
make check but against
an installed version of Xapian.
Improve Test Coverage
We can generate reports of how well our testsuite covers the code by using gcov and lcov (see the documentation in HACKING for how to do this). A regularly updated report is available here:
Ideally we'd have good test coverage for the whole library - it would be useful to look through the test coverage report, and see what causes the poor coverage in some places, and try to write new testcases which exercise that code. We've found a few bugs in the past by doing this, and also quite a bit of code which isn't actually ever used (whole unused functions in some places!)
Currently the 'bin' and 'examples' directories aren't really exercised
by the testsuite. The exceptions are that there's a test to check that
--version work for all the programs, and the
servers for remote backends and replication are exercised.
We have a script in
xapian-maintainer-tools/profiling/strace-analyse which can process a log from the
strace tool of the system calls made by a program using Xapian, to allow disk I/O to be profiled. However,
recent versions of
strace only work on Linux. We also have an
LD_PRELOAD library that allows intercepting calls to C library functions which is an approach which can be made to work on most platforms, but is likely to need some extra work for each one.
It should be possible to write a `dtrace` script to log the required information, and
dtrace supports most popular platforms.
See ticket #390 for more background.
Documentation and examples
Generate nicer e-book
There's an e-book version of Getting Started with Xapian, but it lacks a cover, and the formatting seems quite basic. It would be good to address these aspects.
Get PDF building
Currently sphinx fails to build a PDF of the Getting Started with Xapian guide.
Translate an example to another programming language
Translate one of the code examples in Getting Started with Xapian to a language which is missing a version. Currently there's a complete set of examples for python2 and python3, almost complete for C++, and the start of Java and PHP.
Use function attributes more internally
These would allow the compiler to optimise a little more in some places.
See ticket #151 (for pure, const, and non-NULL pointer parameters). There's also
noexcept for functions which never throw an exception.
Use GCC attribute warn_unused
GCC supports a
warn_unused attribute, which allows classes to be marked to be treated
like fundamental types, and warnings about unused variables of a class to be issued:
This would be useful for some of our classes.
Add more stemming algorithms
Xapian supports stemming algorithms for many languages, and for some languages we support more than one. But there are more algorithms out there.
- Paice/Husk (English) - Note that there's no explicit licence on Andy Stark's C implementation, so we can't use it (or code derived from it) in Xapian, unless you're able to contact him and get an explicit licence added.
- Krovetz stemmer (English)
Check how well a human language is supported
If you are fluent in a language other than English, you could check how well Xapian works on that language, and report back (either to the mailing list or by opening a ticket. If you're able, see what you can do to improve support.
The simplest way to do this is probably just to index some text and try searching it. You can use Omega to do this - the omindex indexer can index text file and HTML, plus many other formats if you have additional tools installed.
There's an FAQ entry which shows how to look at the terms produced from a document and compare that with how Xapian parsed a query string, which is likely to be useful if you want to see how Xapian handled a particular document or query.
Fix a compiler warning
We aim to have a warning-free compilation, but new compiler versions sometimes introduce new warnings, and different compilers warn about different things. If you see any compiler warnings while building Xapian, try to work out what's causing each one and how to fix it. Sometimes the warning indicates an actual problem; sometimes there isn't a problem as such, but by having a warning-free build we can know that we aren't overlooking warnings which indicate actual problems.
Add Query object introspection
Add more complete support for introspecting on Query objects (see issue #159). This would allow queries generated by the query parser to be investigated and modified, and would be helpful in various scenarios. The trickiest bit is probably working out a suitable API - for this, you'd need to discuss possible APIs with the other developers on IRC or the mailing lists. It's probably a good task for a beginner, being fairly small and self-contained.
Use flat intermediate file format for multipass compaction
This would speed up xapian-compact --multipass, and reduce intermediate disk space requirements.
See ticket #444.
Improve Omega's testsuite
As ticket#513 says, we should have tests for Omega which:
- index data with omindex and scriptindex, and inspect the resultant database to check it contains what we expect.
- run omega in command-line mode and check the output is as expected (this may be the easiest way to verify the results of omindex and scriptindex).
Xapian backend for advanced trac search
There's an advanced trac search plugin (https://github.com/dnephin/TracAdvancedSearchPlugin) which is intended to allow multiple backends. It'd be nice to have a backend onto Xapian, and then set that up on our own trac (we're currently running the latest stable version of trac).
Helper subclasses to use Lua code within a C++ application
Various aspects of Xapian can be customised by user code (eg: MatchDeciders, PostingSources and so forth). If you are working in a language other than C++ (such as Python, Ruby, Lua etc.), we aim to allow you to write those subclasses in the language you're using, which is more convenient than writing C++, binding into your language and then passing through to Xapian. When working in C++, you have to write the subclasses in C++.
However there would be value, particularly during experimentation and prototyping, to being able to write those subclasses in another language. This would involve having C++ subclasses, probably as an auxiliary library, that manage a suitable language runtime and pass control flow down into user code. Lua is suggested as a target language since it is fairly easy to embed from C/C++, has reported reasonable performance when using the Lua JIT, and has good support in the Xapian Lua bindings for user subclasses.
In terms of API, you'd probably want to be able to construct say a LuaMatchDecider using the text of the Lua code itself (another possible variant would construct via a filename).
This was prototyped by James in early 2012, although he now can't find the code. His memory is that it wasn't an enormous amount of work to get functioning, although of course getting a polished solution is another matter.