This is a draft list of project ideas for students taking part in [http://code.google.com/soc/ Google Summer of Code 2010]. Previous experience with Xapian certainly isn't required for these projects, but if you do have some, tell us about it in your application. We recommend you use Linux or another UNIX-like system for development work, as we're better set up for development on such platforms. In particular we use them ourselves, so can more easily help with any set up issues you may encounter. If you want run a virtualised Linux installation for development and have no existing preference, we suggest [http://www.ubuntu.com/ Ubuntu] as our developer documentation covers it well. > Note to mentors: We're inviting project ideas for working on other FOSS > projects which build on Xapian, as well as on Xapian itself. This will > allow us to put together a wider range of project ideas, and > so have a broader appeal to students. > > If you have an > idea of a suitable scope, feel free to add it below in a format similar to > the other ideas, and then ping us on IRC or the mailing list to review it > (or discuss first if you're unsure). You should be prepared to mentor the > project, or nominate someone else who will be. = Project Ideas = We also encourage you to suggest your own ideas for Xapian-related projects. And if you'd like to work with Xapian, but nothing below appeals and you don't have an idea of your own, feel free to talk to us to see if we can come up with something. You can contact us via IRC on #xapian on irc.freenode.net ([http://webchat.freenode.net/?channels=%23xapian web interface]), or on the [http://xapian.org/lists xapian-devel mailing list]. [[TOC(inline)]] == Weighting Schemes == The aim of this project is to add support for more weighting schemes to Xapian. Xapian provides the ability to rank search results by [http://en.wikipedia.org/wiki/Relevance_%28information_retrieval%29 relevance]. The relevance is calculated using a mathematical formula, which can be specified by sub-classing [http://trac.xapian.org/browser/trunk/xapian-core/include/xapian/weight.h Xapian::Weight]. Xapian currently includes built-in subclasses for [http://xapian.org/docs/bm25.html BM25] and the "traditional" probabilistic weighting formulae]. To be supported, a weighting formula needs to be expressible as a sum of a weight from each matching term, optionally plus a per-document component. Additionally, for faster searching, an upper bound on each component is needed (each database stores a number of summary statistics to help with this - if additional statistics would be useful, you could add them as part of the project). There are other weighting schemes which can be expressed in this way, and it would be useful to support them (some because they're potentially more effective than BM25, others because they're of interest for Information Retrieval students and academics). Some examples of interesting weighting schemes are the [http://ir.dcs.gla.ac.uk/wiki/DivergenceFromRandomness Divergence from Randomness (DfR)] family of weighting schemes, and the traditional TF/IDF schemes, such as those supported by [http://people.csail.mit.edu/jrennie/ecoc-svm/smart.html SMART]. After implementing these schemes, it would be useful to see how they compare for speed and retrieval effectiveness with BM25 to see if there's a better default scheme for Xapian to use. The parameter-free DfR schemes are particularly interesting as academic evaluations suggest they can outperform BM25 with tuned parameters, and "real world" users rarely seem to have the patience to tune parameters. Resources: * http://en.wikipedia.org/wiki/Relevance_%28information_retrieval%29 * http://trac.xapian.org/browser/trunk/xapian-core/include/xapian/weight.h * http://xapian.org/docs/bm25.html * http://ir.dcs.gla.ac.uk/wiki/DivergenceFromRandomness * http://people.csail.mit.edu/jrennie/ecoc-svm/smart.html Skills: * Basic (or better) knowledge of C++. * Knowledge of Information Retrieval would be useful. * Being comfortable rearranging algebraic formulae would be a bonus. Difficulty: medium == Dynamic Snippets == A "snippet" is the sample from a matching document which is displayed in the results with matching words highlighted. Currently [http://xapian.org/docs/omega/ Omega] stores a static fixed-sized sample from the start of the extracted text. It would be more useful for users to see a dynamically generated sample showing where the search terms match within each matching document (like most of web search engines do). This shows the context in which the matching terms are used and helps the user to decide which documents are not worth further investigation without having to fetch and read them all. The suggested approach is: * add a xapian-core API feature to pull out the best ''N'' sentences or phrases from a string, based on which contain terms from a specified set. This allows API users to either store extracted text in the document data, or to re-extract it from the source documents in the result set. * add an Omega``Script command to wrap this, and add an option to omindex to allow larger samples to be stored in the sample field to allow Omega to show a dynamic snippet. Resources: * See ["FAQ/Snippets"] for further related information. Skills: * You'll need to know C++. Difficulty: medium == Text-Extraction Libraries == This project would use libraries in preference to external programs to extract text from various file formats during indexing. [http://xapian.org/docs/omega/overview.html Omega]'s omindex indexer currently has built-in support for HTML, plain text, and uncompressed !AbiWord documents. All other formats require an external filter program (or sometimes more than one) to be run for each file. The functionality provided by some of these external programs is also available as shared libraries, and using these instead would avoid the overhead of running an external filter and so speed up indexing. A number of modern file formats are based around the zip file format with XML contents (e.g. !OpenDocument format) so using a zip file reading library instead of the unzip program would be an obvious first target. There are also libraries for ''at least'' PDF ([http://poppler.freedesktop.org/ Poppler]), .wps ([http://libwps.sourceforge.net/ libwps]), .wpd ([http://libwpd.sourceforge.net/ libwpd]), and [http://djvu.sourceforge.net/ DjVu]. Currently Omega avoids a hard requirement on the filter programs it uses by automatically disabling those formats for which the filters aren't installed. We could do something similar for libraries by loading them dynamically (with dlopen() or similar) and if they aren't available disable those formats which require them. But the additional complexity may not be justified. It would certainly be fine to just link directly against zlib, as xapian-core already requires that. Also, hard requiring a zip file reading library seems reasonable as it covers several popular formats. Resources: * See in-line links above. Skills: * C++ * Some familiarity with Linux/UNIX system programming would be useful. Difficulty: easy-medium == Support Another Language == Xapian's core is written in C++ and provides a [http://xapian.org/docs/apidoc/html/annotated.html C++ API], but we have [http://xapian.org/docs/bindings/ bindings] which wrap this API to allow use from a number of other languages. This project would add bindings which allowed Xapian to be used from a language which isn't already supported, ideally using [http://www.swig.org/ SWIG] to reduce the work required to update the bindings for future changes to the [http://xapian.org/docs/apidoc/html/annotated.html C++ API]. Xapian currently has pretty decent support for Python, PHP5, Tcl, C#, Java, Perl, and Ruby. There has been some work on [http://article.gmane.org/gmane.comp.lang.pike.user/5487 Pike bindings] but not using SWIG and only wrapping part of the C++ API. The "most wanted" language not already supported seems to be Lua, but you're welcome to pitch for other languages which Xapian bindings would be useful for. Most of the work is likely to be customising the SWIG-generated bindings to produce a more natural API in your chosen language (for example, the semantics of C++ iterators aren't natural in scripting languages). Resources: * http://www.swig.org/Doc1.3/ * http://xapian.org/docs/bindings/ Skills: * You'll need good familiarity with the language you want to add support for. * Familiarity with C++. * Knowledge of SWIG useful. * Knowledge of your chosen language's C/C++ API would be a bonus. Difficulty: medium-hard == Improve Existing Bindings == This project would add more natural API wrapping in existing language bindings. Some of the bindings would benefit from a more "idiomatic" wrapping in places. For example, wrapping the C++ iterators as PHP iterators; eliminating SWIG's "SWIGTYPE" default wrappers in C#. This would allow us to provide an API which is felt more natural to programmers of each language, making Xapian easier for them to use. Resources: * http://www.swig.org/Doc1.3/ * http://xapian.org/docs/bindings/ Skills: * You need to be familiar with the language (or languages) you want to work with (you'll need to understand what "looks right" in an API for those languages). * Familiarity with C++. * Knowledge of SWIG useful. * Knowledge of your chosen language's C/C++ API would be useful. Difficulty: easy-medium == SWIG-based Java Bindings == This project would reimplement Xapian's Java bindings using SWIG instead of hand-coded JNI. Xapian's core is written in C++ and provides a [http://xapian.org/docs/apidoc/html/annotated.html C++ API], but we have [http://xapian.org/docs/bindings/ bindings] which wrap this API to allow use from a number of other languages. Most of these bindings are built using [http://swig.org/ SWIG] which takes care of the mechanical work of wrapping new API features. However the Java bindings are currently implemented in hand-coded [http://en.wikipedia.org/wiki/Java_Native_Interface JNI] which requires some tedious and error-prone work to wrap new features. Consequently the JNI bindings don't wrap the whole of the current C++ API. Some work has been done on generating Java bindings with SWIG, but it needs someone with a good understanding of Java to get it working, and to wrap things like C++ iterators in a more Java-like way. Resources: * See the in-line links in the above description. Skills: * You'll need a good understanding of Java (in particular, what "looks right" in a Java API). * Knowledge of SWIG and C++ would be useful. * Knowledge of JNI might be useful, but isn't required. Difficulty: easy-medium == Improve Spelling Correction == The existing spelling correction feature is useful for many users, but there is scope for improvement. For example, currently: * corrections aren't offered if the word is also misspelled in a document in the database even if the correct spelling is far more common (ticket #225) * only a single suggested correction is available via the API * no regard is given to other words in the query when ranking possible corrections (approach used by sycamore wiki: http://github.com/rtucker/sycamore/blob/master/Sycamore/search.py#L434) * corrections involving addition/removal/transposing of/with spaces aren't considered * currently spellings are only handled for unprefixed terms, but we ought to be able to offer spelling correction for fields. Some fields should be able to share spellings (e.g. title and document), others should have their own (e.g. author) and some may not want spelling correction at all. There are no doubt other aspects too. Skills: Good knowledge of C++ required. Difficulty: medium == Matcher Optimisations == The matcher is the part of Xapian which does all the hard work when generating search results. It implements a number of optimisations to help make searching fast. There is potential to improve things further, at least in some cases. When we have an idea for a new optimisation, we usually open a ticket to make sure we don't forget about it, and to track discussion and progress. Here are the currently open such tickets: * [ticket:215 Boolean OR could be optimised further] * [ticket:224 Supply and optimise more OP_VALUE_ comparison operators] * [ticket:378 Optimise MultiAndPostList using current weight of lhs] * [ticket:400 Optimise AND_MAYBE when the RHS has a maxweight of 0] The idea for this project is to take several such optimisation ideas (either from the above, or ones you develop yourself), and for each in turn to implement the optimisation, and do performance testing to check that they give a performance improvement in at least some case, and to check that they don't make things slower in other cases (or if they do, then the gains outweigh the losses). Resources: * http://xapian.org/docs/matcherdesign.html * http://trac.xapian.org/browser/trunk/xapian-core/matcher/ Skills: * Good C++ * Up for a challenge! Difficulty: hard == Gmane Search Improvements == [http://gmane.org/ Gmane] is a public mailing list archive, with a total of over 94 million messages from over 12 thousand mailing lists. It has a [http://search.gmane.org Xapian-powered article search]. There are a number of enhancements which could be made to the current search: * Currently the group filter only allows a sub-hierarchy to be specified (users would like to be able to say ''cpan'' not ''gmane.comp.lang.perl.cpan.*''). * Allow restricting a search to the ''Subject:'' header. * Allow searching for parts of the email address. * Provide a search API (e.g. RSS feeds of results). * Some sort of grouping or collapsing by thread? * Currently the search index is updated once per day, but it would be useful to be able to search more recently arrived articles. * Make use of Xapian's spelling correction functionality ([http://plane.gmane.org/?query=xapain+spelll prototype implementation]) Resources: * ''"[http://survex.com/~olly/how_search.gmane.org_works_lca2010/ How search.gmane.org works]"'' talk from LCA2010 * [http://search.gmane.org/~xapian/ Source code] Skills: * Good C++ * Perl useful * Experience with web technologies useful * Familiarity with Linux/UNIX system programming would be useful Difficulty: medium/hard == CJK Support == This project idea is to improve support indexing and searching Chinese, Japanese, and Korean. Xapian's !TermGenerator class converts text to a series of terms for indexing, and its !QueryParser class converts a user-entered query to a Query object tree for searching. At the moment, both of these mostly assume that terms are groups of letters delimited by whitespace. Some work has been done of this already (see ticket:180) but performance is important (especially for indexing when large amounts of text are usually being processed) and the prototype patch there will slow down indexing of non-CJK text unacceptably. Resources: * [http://en.wikipedia.org/wiki/CJK Wikipedia's CJK article] * http://www.linfo.org/cjkv.html - some general background * [ticket:180 Ticket #180] has some discussion and a prototype patch Skills: * C++ * Familiarity with Chinese, Japanese, and/or Korean would be very useful * Knowledge of Unicode useful Difficulty: medium