Project Details
The main goal of this project is to enable Xapian to be accessed by the R community. There exists an R package (https://github.com/stewid/xapr) which provides an interface to the basic searching and indexing features of Xapian search engine. But it cannot be used as a foundation for this project since it is developed using R's C API, which was the option that existed for a similar task prior to the introduction of RCPP. Using C for interfacing R is not recommended due to the following reasons,
- It limits the extensibility of a system. [Quote from „Software for data analysis: Programming with R, Springer 2008‟: “including additional C code is a dangerous step with some added dangers and often a substantial amount of programming and debugging required. You should have a good reason.”]
- Increases the likelihood of bugs
- Programmer is required to take care of multiple things, and this in turn makes coding a tedious process.
Using C++ with RCPP is the recommended way since it provides protection against many 'historical idiosyncrasies of the R API, takes care of memory management and provides many useful helper methods'. [Quote by Hadley Wickham, chief scientist at RStudio and an adjunct
assistant professor of statistics at Rice University]
For the development of R bindings, I will be using the RCPP package, which facilitates seamless integration of R and C++ by providing R functions as well as C++ classes. RCPP provides a straightforward way of passing data between R and C++ and is inarguably the best existing package to be used in the development of R bindings. Above all the RCPP::Export attribute enables the content of standalone C++ files to be sourced into R, and this greatly simplifies the R/C++ integration process.
Project Deliverable
A well-documented R package which interfaces Xapian search engine library [In addition to the reference manual, a complete set of examples in R will be produced as a part of the deliverable]
Goals
- Providing an interface which facilitates convenient access to advanced indexing and retrieval operations offered by Xapian search engine library: Two main functions will be exposed to users; xapian_search() and xapian_index(). All complexities will be handled internally within the package, giving users the ability to interact with Xapian by simply calling the required functions with appropriate arguments.
- Minimizing inherent inefficiencies associated with R code by using C++: User inputs to R functions will take a pre-defined format, and the R components extracted from those will be manually converted to C++ equivalents by using the as<> functionality provided by RCPP. And as far as possible all computationally intensive tasks will be done in C++ code.
API design
xapian_index() draft I https://docs.google.com/document/d/1RWHgeju_BycgtMtcuvT7uz5PeuxCwZV7ApZ4YbOcg7Y/edit?usp=sharing
xapian_index() draft II https://docs.google.com/document/d/1X0WI5RjAEVpwJw9Ldga22sTTVfAexWQ3nFPZskDhmvE/edit?usp=sharing
xapian_search() https://docs.google.com/document/d/1bLSPVQq9MjEVbPjqfZa2Vr7bfrRFo-T3hVXlp5yjAgM/edit?usp=sharing
Project Timeline
22 April – 22 May | Reading Xapian documentation Understanding existing bindings Modifying the originally proposed design of xapian_index() Modifying the originally proposed design of xapian_search() Determining the output structure of xapian_search() Determining the input structure of queries Determining an R package to use for testing Determining an R package to use for argument checks Developing code samples that would be helpful during implementation |
23 May – 29 May 30 May – 5 June | Simple Indexing Creating a package skeleton, generating Makevars with autoconf Structuring the package Validating user inputs to input parameters Extracting R components and converting those to C++ equivalents Creating objects and calling required Xapian functions with extracted arguments Testing the simple indexing feature Writing examples and explanations of the simple indexing feature |
6 June – 12 June 13 June – 19 June | Simple Search Validating user inputs to input parameters Extracting R components and converting those to C++ equivalents Creating objects and calling required Xapian functions with extracted arguments Wrapping the search results in a data frame Testing the simple search feature Writing examples and explanations of the simple search feature Finalizing deliverables for the mid evaluation |
20 June – 26 June 27 June – 3 July | Advanced Indexing and Delete Validating user inputs to input parameters Extracting R components and converting those to C++ equivalents Creating objects and calling required Xapian functions with extracted arguments Testing the advanced indexing feature Writing examples and explanations of the advanced indexing feature Developing the xapian_delete() function Testing and documenting the xapian_delete() function |
4 July – 10 July 11 July – 17 July | Advanced Search: Faceted Search Validating user inputs to input parameters Extracting R components and converting those to C++ equivalents Creating objects and calling required Xapian functions with extracted arguments Wrapping output spy values returned by TermIterator in a suitable R data structure Testing the faceted search feature Writing examples and explanations of the faceted search feature |
18 July – 24 July 25 July – 31 July | Advanced Search: Remaining features Validating user inputs to input parameters Extracting R components and converting those to C++ equivalents Creating objects and calling required Xapian functions with extracted arguments Testing remaining features of advanced search Writing examples and explanations of the advanced search |
1 August – 7 August | Writing a reference manual for the package Further feature enhancements depending on time availability |
8 August – 14 August | Package testing, bug fixing Finalizing project deliverables |
15 August – 23 August | Kept unallocated as recommended by Xapian |