wiki:GSoC2016/R/ProjectPlan

Project Details

The main goal of this project is to enable Xapian to be accessed by the R community. There exists an R package (https://github.com/stewid/xapr) which provides an interface to the basic searching and indexing features of Xapian search engine. But it cannot be used as a foundation for this project since it is developed using R's C API, which was the option that existed for a similar task prior to the introduction of RCPP. Using C for interfacing R is not recommended due to the following reasons,

  • It limits the extensibility of a system. [Quote from „Software for data analysis: Programming with R, Springer 2008‟: “including additional C code is a dangerous step with some added dangers and often a substantial amount of programming and debugging required. You should have a good reason.”]
  • Increases the likelihood of bugs
  • Programmer is required to take care of multiple things, and this in turn makes coding a tedious process.

Using C++ with RCPP is the recommended way since it provides protection against many 'historical idiosyncrasies of the R API, takes care of memory management and provides many useful helper methods'. [Quote by Hadley Wickham, chief scientist at RStudio and an adjunct assistant professor of statistics at Rice University]

For the development of R bindings, I will be using the RCPP package, which facilitates seamless integration of R and C++ by providing R functions as well as C++ classes. RCPP provides a straightforward way of passing data between R and C++ and is inarguably the best existing package to be used in the development of R bindings. Above all the RCPP::Export attribute enables the content of standalone C++ files to be sourced into R, and this greatly simplifies the R/C++ integration process.

Project Deliverable

A well-documented R package which interfaces Xapian search engine library [In addition to the reference manual, a complete set of examples in R will be produced as a part of the deliverable]

Goals

  • Providing an interface which facilitates convenient access to advanced indexing and retrieval operations offered by Xapian search engine library: Two main functions will be exposed to users; xapian_search() and xapian_index(). All complexities will be handled internally within the package, giving users the ability to interact with Xapian by simply calling the required functions with appropriate arguments.
  • Minimizing inherent inefficiencies associated with R code by using C++: User inputs to R functions will take a pre-defined format, and the R components extracted from those will be manually converted to C++ equivalents by using the as<> functionality provided by RCPP. And as far as possible all computationally intensive tasks will be done in C++ code.

API design

xapian_index() draft I https://docs.google.com/document/d/1RWHgeju_BycgtMtcuvT7uz5PeuxCwZV7ApZ4YbOcg7Y/edit?usp=sharing
xapian_index() draft II https://docs.google.com/document/d/1X0WI5RjAEVpwJw9Ldga22sTTVfAexWQ3nFPZskDhmvE/edit?usp=sharing
xapian_search() https://docs.google.com/document/d/1bLSPVQq9MjEVbPjqfZa2Vr7bfrRFo-T3hVXlp5yjAgM/edit?usp=sharing

Project Timeline

22 April – 22 May Reading Xapian documentation
Understanding existing bindings
Modifying the originally proposed design of xapian_index()
Modifying the originally proposed design of xapian_search()
Determining the output structure of xapian_search()
Determining the input structure of queries
Determining an R package to use for testing
Determining an R package to use for argument checks
Developing code samples that would be helpful during implementation
23 May – 29 May



30 May – 5 June
Simple Indexing

Creating a package skeleton, generating Makevars with autoconf
Structuring the package
Validating user inputs to input parameters
Extracting R components and converting those to C++ equivalents
Creating objects and calling required Xapian functions with extracted arguments
Testing the simple indexing feature
Writing examples and explanations of the simple indexing feature
6 June – 12 June


13 June – 19 June
Simple Search

Validating user inputs to input parameters
Extracting R components and converting those to C++ equivalents
Creating objects and calling required Xapian functions with extracted arguments
Wrapping the search results in a data frame
Testing the simple search feature
Writing examples and explanations of the simple search feature
Finalizing deliverables for the mid evaluation
20 June – 26 June



27 June – 3 July
Advanced Indexing and Delete

Validating user inputs to input parameters
Extracting R components and converting those to C++ equivalents
Creating objects and calling required Xapian functions with extracted arguments
Testing the advanced indexing feature
Writing examples and explanations of the advanced indexing feature
Developing the xapian_delete() function
Testing and documenting the xapian_delete() function
4 July – 10 July


11 July – 17 July
Advanced Search: Faceted Search

Validating user inputs to input parameters
Extracting R components and converting those to C++ equivalents
Creating objects and calling required Xapian functions with extracted arguments
Wrapping output spy values returned by TermIterator in a suitable R data structure
Testing the faceted search feature
Writing examples and explanations of the faceted search feature
18 July – 24 July

25 July – 31 July
Advanced Search: Remaining features

Validating user inputs to input parameters
Extracting R components and converting those to C++ equivalents
Creating objects and calling required Xapian functions with extracted arguments
Testing remaining features of advanced search
Writing examples and explanations of the advanced search
1 August – 7 August Writing a reference manual for the package
Further feature enhancements depending on time availability
8 August – 14 August Package testing, bug fixing
Finalizing project deliverables
15 August – 23 August Kept unallocated as recommended by Xapian
Last modified 8 years ago Last modified on 24/06/16 16:52:35
Note: See TracWiki for help on using the wiki.