wiki:ProjectIdeas

Project ideas

This page is intended to list possible projects for somebody who's interested in getting involved with Xapian development, but who isn't intimately familiar with our code yet (if you're a Xapian guru, you can of course tackle one of these, or you could think up your own projects or take a look at the bug database).

Some of these projects would benefit from special skills in other areas (for example, experience with a scripting language is required to produce decent bindings for it; being fluent in a particular human language would be very helpful for improving support for it).

We're happy to provide mentoring to anyone trying to get to grips with the Xapian codebase. You can contact us via IRC on #xapian on irc.freenode.net (web interface or link for IRC client), or on the xapian-devel mailing list.

The list below is split into "bite size" projects, which are probably a good place to start for someone wanted to get familiar with the Xapian code, and other projects which are larger in scope. This split is inevitably a little subjective though.

GSoCProjectIdeas has a list of projects with a larger scope, aimed at students taking part in Google's Summer of Code.

Bite Size

Omega

Rework Omega templates to use more modern web techniques

For example:

  • Add classes to HTML elements to allow styling via CSS.

Resources:

Skills:

  • Existing familiarity with HTML and CSS would be very useful

Support indexing another file format in Omindex

Omindex can currently extract text and meta-data from a number of file formats for indexing, but it would be good to support more.

Resources:

Bindings

Wrap C++ iterators better for Ruby

Supporting block-driven iteration for Ruby would be nicer and more efficient than turning iterators into arrays of values.

Resources:

Tests

Implement "make installcheck"

make check is a standard target which runs the testsuite in a source tree, using the version of Xapian just built. Another standard target is make installcheck which provides a standard way to run tests against the library after it's installed (which is done using make install). There's now support in git master for make installcheck for omega and most of the language bindings, but for xapian-core make installcheck doesn't do anything.

It would be useful if make installcheck ran most of the same tests as make check but against an installed version of Xapian.

Improve Test Coverage

We can generate reports of how well our testsuite covers the code by using gcov and lcov (see the documentation in HACKING for how to do this). A regularly updated report is available here:

http://lcov.xapian.org/

Ideally we'd have good test coverage for the whole library - it would be useful to look through the test coverage report, and see what causes the poor coverage in some places, and try to write new testcases which exercise that code. We've found a few bugs in the past by doing this, and also quite a bit of code which isn't actually ever used (whole unused functions in some places!)

Currently the 'bin' and 'examples' directories aren't really exercised by the testsuite. The exceptions are that there's a test to check that --help and --version work for all the programs, and the servers for remote backends and replication are exercised.

I/O Profiling

We have a script in xapian-maintainer-tools/profiling/strace-analyse which can process a log from the strace tool of the system calls made by a program using Xapian, to allow disk I/O to be profiled. However, recent versions of strace only work on Linux. We also have an LD_PRELOAD library that allows intercepting calls to C library functions which is an approach which can be made to work on most platforms, but is likely to need some extra work for each one.

It should be possible to write a `dtrace` script to log the required information, and dtrace supports most popular platforms.

See ticket #390 for more background.

Documentation and examples

Generate nicer e-book

There's an e-book version of Getting Started with Xapian, but it lacks a cover, and the formatting seems quite basic. It would be good to address these aspects.

Get PDF building

Currently sphinx fails to build a PDF of the Getting Started with Xapian guide.

Translate an example to another programming language

Translate one of the code examples in Getting Started with Xapian to a language which is missing a version. Currently there's a complete set of examples for python2 and python3, almost complete for C++, and the start of Java and PHP.

Use function attributes more internally

These would allow the compiler to optimise a little more in some places.

See tickets #151 (for pure, const, and non-NULL pointer parameters) and #454 (for nothrow).

Use GCC attribute warn_unused

GCC supports a warn_unused attribute, which allows classes to be marked to be treated like fundamental types, and warnings about unused variables of a class to be issued:

http://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Attributes.html

This would be useful for some of our classes.

Add more stemming algorithms

Xapian supports stemming algorithms for many languages, and for some languages we support more than one. But there are more algorithms out there.

Currently there's a limitation that the stemmer returns a single stem for each word (see #465), but aside from that, stemmers written in snowball, C, or C++ can easily be added.

Some possibilities:

  • Paice/Husk (English) - Note that there's no explicit licence on Andy Stark's C implementation, so we can't use it (or code derived from it) in Xapian, unless you're able to contact him and get an explicit licence added.
  • Krovetz stemmer (English)
  • Czech

Check how well a human language is supported

If you are fluent in a language other than English, you could check how well Xapian works on that language, and report back (either to the mailing list or by opening a ticket. If you're able, see what you can do to improve support.

The simplest way to do this is probably just to index some text and try searching it. You can use Omega to do this - the omindex indexer can index text file and HTML, plus many other formats if you have additional tools installed.

There's an FAQ entry which shows how to look at the terms produced from a document and compare that with how Xapian parsed a query string, which is likely to be useful if you want to see how Xapian handled a particular document or query.

Fix a compiler warning

We aim to have a warning-free compilation, but new compiler versions sometimes introduce new warnings, and different compilers warn about different things. If you see any compiler warnings while building Xapian, try to work out what's causing each one and how to fix it. Sometimes the warning indicates an actual problem; sometimes there isn't a problem as such, but by having a warning-free build we can know that we aren't overlooking warnings which indicate actual problems.

Larger Projects

Add bindings for Vala

Vala is a language used for various Gnome apps. We've had a request for bindings for it, but it really needs somebody with previous experience of Vala to work on it.

Resources:

  • Ticket #535 is the feature request and has more details

Add Query object introspection

Add more complete support for introspecting on Query objects (see issue #159). This would allow queries generated by the query parser to be investigated and modified, and would be helpful in various scenarios. The trickiest bit is probably working out a suitable API - for this, you'd need to discuss possible APIs with the other developers on IRC or the mailing lists. It's probably a good task for a beginner, being fairly small and self-contained.

Use flat intermediate file format for multipass compaction

This would speed up xapian-compact --multipass, and reduce intermediate disk space requirements.

See ticket #444.

Improve Omega's testsuite

As ticket#513 says, we should have tests for Omega which:

  • index data with omindex and scriptindex, and inspect the resultant database to check it contains what we expect.
  • run omega in command-line mode and check the output is as expected (this may be the easiest way to verify the results of omindex and scriptindex).

Xapian backend for advanced trac search

There's an advanced trac search plugin (https://github.com/dnephin/TracAdvancedSearchPlugin) which is intended to allow multiple backends. It'd be nice to have a backend onto Xapian, and then set that up on our own trac (we're currently running the latest stable version of trac).

Helper subclasses to use Lua code within a C++ application

Various aspects of Xapian can be customised by user code (eg: MatchDeciders, PostingSources and so forth). If you are working in a language other than C++ (such as Python, Ruby, Lua etc.), we aim to allow you to write those subclasses in the language you're using, which is more convenient than writing C++, binding into your language and then passing through to Xapian. When working in C++, you have to write the subclasses in C++.

However there would be value, particularly during experimentation and prototyping, to being able to write those subclasses in another language. This would involve having C++ subclasses, probably as an auxiliary library, that manage a suitable language runtime and pass control flow down into user code. Lua is suggested as a target language since it is fairly easy to embed from C/C++, has reported reasonable performance when using the Lua JIT, and has good support in the Xapian Lua bindings for user subclasses.

In terms of API, you'd probably want to be able to construct say a LuaMatchDecider using the text of the Lua code itself (another possible variant would construct via a filename).

(James has a proof of concept of this for both MatchDecider and ExpandDecider.)

Last modified 3 days ago Last modified on 18/04/19 00:06:17