GSoC2012/QueryParser/Notes – Xapian

wiki:GSoC2012/QueryParser/Notes

Context Navigation

Xapian QueryParser

Xapian uses the QueryParser generated by Lemon Parser Generator. Lemon is an LALR parser generator for C or C++. Rather than generating a complete and working program, it generates only a few subroutines that implement the parser.

The QueryParser contains a self-written lexer which tokenises the query, and each time after figuring out the token, calls the Lemon generated parser [to be specific, calls the method static void Parse(yyParser *, int, Term *, State *), described below] with the token detected and the corresponding information of the token.

The prototype and description of the subroutines that are generated in Xapian::QueryParser using Lemon are :

static yyParser *ParseAlloc();

Description : This routine allocates and initializes the new parser and returns a pointer to it.

static void ParseFree(yyParser *);

Description : This routine is used free all the memory allocated by the parser, called after the program is finished using the parser. The argument to be passed is the pointer returned by ParseAlloc().

static void Parse(yyParser *, int, Term *, State *);

Description : This is the main parser function. It is called each time a token is generated by the lexer.

The first argument is the pointer returned by ParseAlloc().

The second argument is the integer that tells the parser the type of the token to be parsed. There is one token for each TERMINAL symbol specified in the grammar.The TERMINAL symbols are mapped to appropriate integer values.

The third argument is the value of the given token. Xapian uses a class Term to store all the corresponding information related to the token. This class is used to pass the information about a token from lexer to parser.

Details of the class Term : It has six basic attributes :

State * state : State is a class used by QueryParser and is the same as the 4th argument of the Parse(). Described later.

string name : It represents the word string present in the query corresponding to the token generated by the lexer. E.g. for the input query : latest watches , two tokens shall be generated by the lexer, one corresponding to the string "latest" and other corresponding to the string "watches". Let term1 and term2 be the Term objects corresponding to the values of the two tokens respectively. Then term1.name will be "latest" and term2.name will be "watches". It is used by make_term() when making a term by appending the prefixes, "Z" etc. (IF required) to the Term::name.

const PrefixInfo * prefix_info : PrefixInfo is a struct defined in queryparser_internal.h (line 42). It contains information about how to handle a prefix in a query string.

Two main members of struct PrefixInfo are

filter_type type : Defined queryparser_internal.h (line 39). It describes the type of the filter. It can have 3 values :

NON_BOOLEAN : Represents a probabilistic term prefix.

E.g. Suppose we have a large database corresponding to different sites, such that the documents with different sites have "site" field associated with them.
Let us map the field "site" to prefix "S" This prefix can be added by the following code :
Xapian::QueryParser : qp; qp.add_prefix("site","S");
Now consider the following query :
watches site:google
The above query will return the following Query object :
watches@1 OR Agoogle

The operator OP_OR (corresponding to OR ) is used (and not OP_FILTER) since the type of prefix is NON-BOOLEAN.

Details of Xapian::Query::OP_OR (consider A OP_OR B):

What documents are matched? : matches documents which match query A or B (or both)

How is the weights of documents adjusted? : passes documents with the sum of weights from A and B

Multiple fields can be mapped to SAME prefix. E.g. Suppose our database have a field "title" along with with the field "site".

Let us map the field "site" to the same prefix as before, i.e. "S" and field "title" to the also to the prefix "S"
qp.add_prefix("site","S");
qp.add_prefix("title","S");
Now consider the following query :
watches site:google title:sale''[[BR]] The above query will return the following Query object :
watches@1 OR Sgoogle@2 OR Ssale@3

It is also possible to map a single field with to MULTIPLE prefixes. Multiples terms are generated for such a field, combined with Xapian::Query::OP_OR.

E.g. Let us map the field "site" to prefix "S" as well as prefix "T".
qp.add_prefix("site","S");
qp.add_prefix("site","T");
Now consider the following query :
watches site:google''[[BR]] The above query will return the following Query object :
watches@1 OR (Sgoogle@2 OR Tgoogle@2)

BOOLEAN_EXCLUSIVE : It allows the user to restrict a search with a boolean filter specified in the free text query.

E.g. Consider the same database as in previous example.
Now suppose we make the field "site" a Boolean filter.
qp.add_boolean_prefix("site","S")
Now consider the following query :
watches site:google''[[BR]] The above query will return the following Query object :
watches@1 FILTER Sgoogle
The corresponding search will return all the documents from site google ONLY (and not any other site since we made "site" a boolean filter) which have the term "watches" in it.

The operator OP_FILTER (corresponding to FILTER ) is used (and not OP_OR) since the type of prefix is BOOLEAN_EXCLUSIVE..

Details of Xapian::Query::OP_FILTER (consider A OP_FILTER B):

What documents are passed? : passes documents which match both the subqueries A as well as B

How is the weights of documents adjusted? : passes documents with the weight from A ONLY.

If there are boolean filters for different prefixes, they will be combined with the @c Xapian::Query::OP_AND operator.

E.g. Consider the same database with the fields "site" and "description". Let us make both of these boolean filters with DIFFERENT prefixes.
qp.add_boolean_prefix("site","S");
qp.add_boolean_prefix("title","T");
Now consider the following query :
watches site:google title:sale''[[BR]] The above query will return the following Query object :
watches@1 FILTER (Sgoogle AND Tsale).

If multiple boolean filters are specified in a query for the same prefix, they will be combined with the Xapian::Query::OP_OR operator.

E.g. Consider the same database with the fields "site" and "description". Let us make both of these boolean filters with SAME prefixes.
qp.add_boolean_prefix("site","S");
qp.add_boolean_prefix("title","S");
Now consider the following query :
watches site:google title:sale''[[BR]] The above query will return the following Query object :
watches@1 FILTER (Sgoogle OR Ssale)

BOOLEAN : It is same as BOOLEAN_EXCLUSIVE in all respects with only one difference.

Under this type, multiple boolean filters specified for SAME prefixes are combined with OP_AND (and not with OP_OR as is the case in BOOLEAN_EXCLUSIVE type).
E.g. Consider the same database with the fields "site" and "description". Let us make both of these boolean filters with SAME prefixes.
qp.add_boolean_prefix("site","S");
qp.add_boolean_prefix("title","S",true);
Now consider the following query :
watches site:google title:sale[[BR]] The above query will return the following Query object :

watches@1 FILTER (Sgoogle AND Ssale)

list<string> prefixes : This contains the list of the prefixes corresponding to a filter.

E.g. If we map the filter "site" to two prefixes "S" and "T", then the corresponding list<string> prefixes will have two terms "S" and "T". To manage different fields and correspondingly different prefixes associated with them, Xapian uses map<string, PrefixInfo> Xapian::QueryParser::Internal::prefixmap (line 74, queryparser_internal.h), which maps different filters to their corresponding prefix(es) and the type of filter via PrefixInfo.

string unstemmed : It contains the same information as Term::name, with the difference that term::name is lowercase version of Term::unstemmed.
QueryParser::stem_strategy stem : It represents the stemming strategy to be used by the QueryParser.

It can have three different values.

STEM_NONE : No word is stemmed.

STEM_SOME : Some words are stemmed (Note: stemmed words have a prefix "Z")

STEM_ALL : All words are stemmed. (Note: stemmed words DON'T have a prefix "Z")

[NOT ADDED YET] 4. STEM_ALL_Z : All words are stemmed. (Note: stemmed words have a prefix "Z")

termpos pos : It is used to keep track of the term position within the query. This is used to implement phrase searching and the NEAR operator.

The fourth argument is the is an instance of the class State. It is the Parser State that is shared between the lexer and the parser.
Lemon provides the feature of fourth parameter that can be of any type chosen by the programmer.The parser doesn't do anything with this argument except to pass it through to action routines. This is a convenient mechanism for passing state information down to the action routines without having to use global variables.

Details of class State: It has four basic attributes :

Query query : It is used to save the parsed query so that it can be returned. Thus we don't need a global parameter to save the query returned by the parser.

The highest level in the parse tree corresponds to the grammar rule : query ::= expr(E). where E is the parsed query. Under this rule the action taken is to save the parsed query in the State structure i.e. state->query = *E;

const char* error : It is used to store the type of error if the Parser fails. Thus we don't need a global parameter to store the type of error occurred.

E.g. For a query like A AND B, boolean operators like OP_AND require that both A as well as B are not null. Thus if we give a query like "spectacles AND", then that returns the QueryParserError: Syntax: <expression> AND <expression>

3. unsigned flags : It is used to store the information regarding the types of Xapian::QueryParser::feature_flag, that have been enabled while parsing the query.

While parsing, and tokenizing, state->flags provide a convenient way to access which all flags have been enabled, thus saving the need of global parameter.

QueryParser::Internal * qpi : It enables the parser to access as well as store different information and correspondingly update the attributes of

Xapian::QueryParser::Internal, while parsing. E.g. If we are doing a widcard query, like "cod*" or a synonym query, then parser reuqires access to the database to form the Query object. Under situations like these, the parser, while parsing can access the database via qpi, i.e qi->db .

static void yy_parse_failed(yyParser *); This is used if at any point while parsing the query, the parser fails. The argument to be passed is the pointer returned by ParseAlloc().

In Lemon, ALL Terminals must have the same type (as mentioned above, in Xapian, each terminal has the type Term thus all the information corresponding to a token is stored in the corresponding Term object) but Non-Terminals can have their own (different) types/values.

In Lemon a terminal symbol (token) is any string of alphanumeric and underscore characters that begins with an upper case letter.

The QueryParser grammar has the following 23 TERMINALS :

ERROR : Used to represent an error in the query i.e. a malformed query. E.g. If bool_arg(A) [described later] is empty, then that corresponds to ERROR
OR : Refers to OR operator ( Xapian::Query::OP_OR ).

E.g. subquery1 OR subquery2. This matches the documents which are matched by either of the subqueries.

XOR : Refers to XOR operator ( Xapian::Query::OP_XOR ).

E.g. subquery1 XOR subquery2. This matches the documents which are one or the other subqueries, but not both.

AND : Refers to AND operator ( Xapian::Query::OP_AND ).

E.g. subquery1 AND subquery2. This matches the documents which are matched by both the subqueries.

NOT : Refers to AND operator ( Xapian::Query::OP_AND ).

E.g. subquery1 AND subquery2. Another example : subquery1 AND NOT subquery2. This matches the documents that are matched only by first subquery and not the second subquery. If FLAG_PURE_NOT is enabled, then queries like "NOT subquery" can be used. This matches the documents that are not matched by the subquery.

NEAR : Refers to NEAR operator ( Xapian::Query::OP_NEAR ).

E.g. word1 NEAR word2. This matches documents containing the both the words - word1 and word2 such that they are within 10 words of each other. The default value of NEAR operator is 10. We change the default value by using NEAR/n which corresponds to the token NEAR(N). E.g. word1 NEAR/5 word2. This matches documents containing the both the words - word1 and word2 such that they are within 5 words of each other.

ADJ : ADJ is similar to NEAR with the difference that it matches only if the words specified in the query with ADJ operator appear in same order as mentioned in the query.

E.g. if I have a document containing "xapian parser provides a new stemming strategy". Then both the queries "xapian NEAR strategy" and "strategy NEAR xapian" will match this document. Also "xapian ADJ strategy" will match this document but "strategy ADJ xapian" will NOT MATCH this document. Similar to NEAR the default value of ADJ is 10. It can be changed to n by a query like following: word1 ADJ/n word2. The ADJ/n corresponds to ADJ(n) token.

LOVE : If FLAG_LOVEHATE is enabled then "+" after a whitespace or an open bracket corresponds to the token LOVE but with following conditions
1. if "+" is followed by space, then it is ignored. E.g. the query "xapian + strategy" returns the Query object "xapian@1 OR strategy@2" and not "strategy@2 AND_MAYBE xapian@1" which is returned if the query is "xapian +strategy".
2. a Postfix "+" (such as in google+) is not treated as a LOVE token. Under such case, the character "+" is regarded as a part of the term only by the lexer. E.g. The query "profile google+" returns the query object "profile@1 OR google+@2" i.e. "+" is treated as the part of the term google only and not as a separate token.
3. Ignored if present at the end of the query. Example query which involve LOVE token : As mentioned above, the query "xapian +strategy" returns the query object "strategy@2 AND_MAYBE xapian@1". Xapian::Query::OP_AND_MAYBE corresponds to the AND_MAYBE operator.

Details of Xapian::Query::OP_AND_MAYBE ( Consider A OP_AND_MAYBE B):

Which documents are matched? : matches documents which matches A or (A and B).

How is weight of documents adjusted? : 1. Documents which match A and B are passed, with weight of A+B

Documents which match A only are passed, with weight of A
Documents which match B only are not passed

HATE : If FLAG_LOVEHATE is enabled then "-" after a whitespace or an open bracket corresponds to the token HATE but with the following conditions
1. if "-" is followed by space, then it is ignored. E.g. the query "xapian - strategy" returns the Query object "xapian@1 OR strategy@2"

and not "xapian@1 AND_NOT strategy@2" which is returned if the query is "xapian -strategy".

a Postfix - (such as in xapian-) is not treated as a HATE token. Under such case, the character "-" is simply ignored by the lexer and is not regarded as a part of the term.

E.g. The query "xapian- core" returns the query object "xapian@1 OR core@2" i.e. "-" is simply ignored and is not treated as the part of the term xapian or as a separate token.

Ignored if present at the end of the query. Example query which involve HATE token : As mentioned above, the query "xapian -strategy" returns the query object "xapian@1 AND_NOT strategy@2". Xapian::Query::OP_AND_NOT corresponds to the AND_NOT operator.

Details of Xapian::Query::OP_AND_NOT ( Consider A OP_AND_NOT B ):

Which documents are matched? : matches documents which match query A but not B.

How is weight of documents adjusted? : passes documents with the weight from A only

HATE_AFTER_AND : If FLAG_LOVEHATE is enabled then "-" after AND operator corresponds to the token HATE_AFTER_AND.

Doubt - Why do we need this token? Why not simply treat such type of queries with HATE token ?

SYNONYM : If FLAG_SYNONYM is enabled then "~" after a whitespace, +, -, or an open bracket corresponds to the token SYNONYM but with the following conditions
1. It is ignored if not followed by a word character. E.g. if "happy" and "cheerful" are added as synonyms then the query "~happy" will return the Query object "happy@1 SYNONYM cheerful@1" but the query "~ happy" returns the Query object "happy@1"
2. Ignored if present at the end of the query.

Example query which involve SYNONYM token :

NOTE: we must call set_database() for this to work. Also we need to add the synonyms to the document. This can be done as follow

Xapian::WritableDatabase db(@param);
db.add_synonym("happy", "cheerful");
Xapian::QueryParser qp;
qp.set_database(db);
Now if we give a query "~happy" then the Query object returned is "happy@1 SYNONYM cheerful@1". Xapian::Query::OP_SYNONYM corresponds to the SYNONYM operator.

Details of Xapian::Query::OP_SYNONYM : Treats a set of queries as synonyms. It is identical to OP_OR except for the weightings returned.

Which documents are matched? : matches documents that match at least one of the queries.

How is the weight of documents adjusted? : Documents are weighted as if all the sub-queries are are instances of the same term, so multiple matching terms increase the wdf value used, and the term frequency is based on the number of documents which will match an OR of all the subqueries.

TERM : TERM is a query term, including prefix (if any).

GROUP_TERM : GROUP_TERM is a query term which follows a TERM or another GROUP_TERM and is only separated by whitespace characters.

PHR_TERM : PHR_TERM is a query term which follows a TERM or another PHR_TERM and is separated only by one or more phrase generator characters (hyphen and apostrophe are common

examples). Phrase generator characters (tested via is_phrase_generator, line 483,queryparser_internal.cc) are the characters that generate a phrase search.

Currently Xapian supports the following characters as phrase generator : "." , "-" , "/" , ":" , "
" , "@".

The phrase operator allows for searching for a specific phrase and returns only matches where all terms appear in the document, in the correct order, giving a weight of the sum of each term.

For example:The query object "a@1 PHRASE 3 b@2 PHRASE 3 c@3" matches the documents which match A followed by B followed by C and gives them a weight of A+B+C.

Examples of phrase search : The query: xapian.org ,returns the Query object "xapian@1 PHRASE 2 org@2" (since "." is a phrase generator)

The query: "A B C" , returns the Query object "a@1 PHRASE 3 b@2 PHRASE 3 c@3" whereas the query : A B C , returns the Query object "a@1 OR b@2 OR c@3".

The query : /home/user/xapian/xapian-core , returns the Query object "home@1 PHRASE 5 user@2 PHRASE 5 xapian@3 PHRASE 5 xapian@4 PHRASE 5 core@5".

Phrase search also plays an important role with the filters.
E.g. if we add the filter (non-boolean) for field "title" by mapping it to prefix "T" (by doing qp.add_prefix("title","T")), then the query: title:"''Harry Potter and the Chamber of Secrets''" , returns the Query object "Tharry@1 PHRASE 7 Tpotter@2 PHRASE 7 Tand@3 PHRASE 7 Tthe@4 PHRASE 7 Tchamber@5 PHRASE 7 Tof@6 PHRASE 7 Tsecrets" i.e. the whole title is treated as a single entity since the words are connected by OP_PHRASE and also that all words are prefixed by "T". whereas the query: title:Harry Potter and the Chamber of Secrets , returns the Query object "Tharry@1 OR potter@2 OR and@3 OR the@4 OR chamber@5 OR of@6 OR secrets@7" i.e. the whole title is not treated as a single entity since the words are connected by OP_OR and also all words are not prefixed by "T".

Note: For the phrase searches, FLAG_PHRASE should be enabled. (By default it is enabled)

WILD_TERM : WILD_TERM is like a TERM, but has a trailing wildcard which needs to be expanded. It is used to match any number of trailing characters within a term (Right Truncation).

Note: Like in the case of synonyms, for the wildcard expansion we must call set_database(). Also the wildcard expansion works ONLY IF FLAG_WILDCARD is enabled. (By default, it is not enabled). You can limit the number of terms a wildcard will expand to by calling Xapian::QueryParser::set_max_wildcard_expansion(). If a wildcard expands to more terms than that number, an exception will be thrown. The exception may be thrown by the QueryParser, or later when Enquire handles the query. The default is not to limit the expansion.

Example of wildcard query : If our database contains the terms "code" , "coding" , "coded" , "coder" , "codomain" and "codomain_new" , then the query "cod*"

will return the Query object "code@1 SYNONYM coded@1 SYNONYM coder@1 SYNONYM coding@1 SYNONYM codomain@1 SYNONYM codomain_new@1".

PARTIAL_TERM : PARTIAL_TERM is like a TERM, but it's at the end of the query string and we're doing "search as you type". It refers to the final term of a partial match query,

with no following characters and is thus treated as a wildcard, thus expands to something like WILD_TERM. Partial matching causes the parser to treat the query as a "partially entered" search. This will automatically treat the final word as a wildcard match, unless it is followed by whitespace, to produce more stable results from interactive searches.

Note : FLAG_PARTIAL should be enables to support the partial term query

Example of partial term query : Consider the same database as used above in wildcard query. Our database contains the terms "code" , "coding" , "coded" , "coder" ,

"codomain" and "codomain_new".

Thus the query "I am a cod" will treat the last word of the query ("cod") as wildcard term and thus return the following Query object

"(i@1 OR am@2 OR a@3) OR ((code@4 SYNONYM coded@4 SYNONYM coder@4 SYNONYM coding@4 SYNONYM codomain@4 SYNONYM codomain_new@4) OR cod@4)"

BOOLEAN_FILTER : BOOLEAN_FILTER is a query term with a prefix registered using add_bool_prefix(). It's added to the query using an OP_FILTER operator,(or OP_AND_NOT if it's negated)

e.g. site:xapian.org or -site:xapian.org. These were explained in detail along with examples earlier while describing the different types of filters.

RANGE :

QUOTE : Characters ' " ' , left curly double quote(0x201c) and the right curly double quote(0x201d) match to the token QUOTE. An unmatched " at the end of the query is ignored to

avoid generating an empty pair of QUOTEs which will cause a parse error. The grammar rule corresponding to the phrased searched is : QUOTE phrase(P) QUOTE. Examples of phrased search were given above.

BRA : Character '(' after a whitespace, bracket , '+' or '-' matches to the token BRA with the following conditions:
1. It is ignored if present at the end of the query.
2. It is ignored if the case corresponds to empty ().

The grammar rule corresponding to the bracketed expression is : compound_term ::= BRA expr KET

KET : Character ')' represents the token KET. It represents the end of a bracketed expression.

The grammar rule corresponding to the bracketed expression is : compound_term ::= BRA expr KET

CJKTERM : It corresponds to the case if CJK n-gram code is being used i.e. if CJK::is_cjk_enabled() true and CJK::codepoint_is_cjk(*itertor) returns true.

TODO : don't know much about the Xapian CJK n-gram at present. Will update this later.

EMPTY_GROUP_OK : The corresponding grammar rule is : group ::= group EMPTY_GROUP_OK

The QueryParser grammar has the following 15 NON-TERMINALS (as mentioned earlier, in Lemon, non-terminals can have different type/value):

error

query : %type query {int}

The whole query - just an expr or nothing.

The corresponding grammar rules and their description are as follow along with the pseudo code:

query ::= expr(E). The parsed query is saved in State structure i.e. { state->query = *E }
query ::= . If it is nothing, then point to empty query i.e. { state->query = Query() }

expr : %type expr {Query *}

expr - A query expression.

The corresponding grammar rules and their description are as follow along with the pseudo code:

expr(E) ::= prob_expr(P). Assign E equal to P i.e. { E = P }
expr(E) ::= bool_arg(A) AND bool_arg(B). Two bool_arg with AND operator in between them. Neither A nor B should be null. { E = new Query(OP_AND, *A, *B); }
expr(E) ::= bool_arg(A) NOT bool_arg(B). Two bool_arg with NOT operator in between them. If FLAG_PURE_NOT is enabled, then A can be null, otherwise neither A nor

B should be null. { E = new Query(OP_AND_NOT, *A, *B); }

expr(E) ::= bool_arg(A) AND NOT bool_arg(B). Two bool_arg with AND NOT in between them. Neither A nor B should be null. { E = new Query(OP_AND_NOT, *A, *B); }
expr(E) ::= bool_arg(A) AND HATE_AFTER_AND bool_arg(B). Two bool_arg with AND '-' in between them. Neither A nor B should be null. { E = new Query(OP_AND_NOT, *A, *B); }
expr(E) ::= bool_arg(A) OR bool_arg(B). Two bool_arg with OR in between them. Neither A nor B should be null. { E = new Query(OP_OR, *A, *B); }
expr(E) ::= bool_arg(A) XOR bool_arg(B). Two bool_arg with OR in between them. Neither A nor B should be null. { E = new Query(OP_XOR, *A, *B); }

bool_arg : %type bool_arg {Query *}

bool_arg - an argument to a boolean operator such as AND or OR.

The corresponfing grammar rules and their description are as follow along with the pseudo code:

bool_arg(A) ::= expr(E). Assign A equal to E. { A = E; }
bool_arg(A) ::= . This case corresponds to ERROR since boolean operators require two terms. Set the argument to NULL, which enables the bool_arg-using rules in

expr above to report uses of AND, OR, etc which don't have two arguments. { A = NULL; }

prob_expr : %type prob_expr {Query *}

prob_expr - a single compound term, or a prob.

The corresponding grammar rules and their description are as follow along with the pseudo code :

prob_expr(E) ::= prob(P). prob has type structure ProbQuery [line 1276 of queryparser_internal.cc]. Depending on the values of P->love, P->filter and P->hate, the actions are

taken. The query corresponding to P->love (query corresponding to terms with '+' infront of them in the query) is combined with P->query via OP_AND_MAYBE. { E = P->query; swap(E, P->love); E = Query(E,OP_AND_MAYBE,P->love) } . Then the filters are merged and are further combined to E. { E = Query(E,OP_FILTER,P->merge_filers); or E = Query(E,OP_SCALE_WEIGHT,P->merge_filters(),0.0) } Further the query corresponding to P->hate (query corresponding to terms with '-' in front of them in the query) is combined with E. { E = Query(Query::OP_AND_NOT, E, P->hate); } TODO : Explaining the details of structure ProbQuery may be relevant here.

prob_expr(E) ::= term(T). Assign E equal to T. {E = T;}

prob : %type prob {ProbQuery *}

prob - a probabilistic sub-expression consisting of stop_terms, "+" terms, "-" terms, boolean filters, and/or value ranges.

The corresponding grammar rules and their description are as follow along with the pseudo code :

prob(P) ::= RANGE(R). TODO : Later
prob(P) ::= stop_prob(Q) RANGE(R). TODO : Later
prob(P) ::= stop_term(T) stop_term(U). Assign P equal to a new instance of type ProbQuery. Assign P->query equal to T. Now depending on if the operator is

OP_NEAR or OP_OP_PHRASE, set the window size to 11 for the first pair of terms and it will automatically grow by one for each subsequent term. Then join the P->query to the query U, otherwise simply use the method add_to_query() to join P->query and U.

prob(P) ::= prob(Q) stop_term(T). Assign P equal to Q. If T is a stopword, then nothing to be done, else add the query described by T to P->Query via default operator.

{ P = Q; if (T) add_to_query(P->query, state->default_op(), T); }

prob(P) ::= LOVE term(T). Assign P equal to a new instance of type ProbQuery. Depending on whether the default operator is OP_AND or not, assign P->query equal to T

or P->love equal to T. { P=new ProbQuery; if (state->default_op() == Query::OP_AND) P->query = T; else P->love = T; }

6. prob(P) ::= stop_prob(Q) LOVE term(T). Assign P equal to Q. Since the next token is LOVE and a term follows it, add the query described by T to P->love via OP_AND.

But if the default operator is OP_AND, then add the query described by T to P->query. { P = Q; if(state->default_op() == Query::OP_AND) add_to_query(P->query, Query::OP_AND, T); else add_to_query(P->love, Query::OP_AND, T);

prob(P) ::= HATE term(T). Assign P equal to a new instance of type ProbQuery and assign P->hate equal to T, since it is the query after the token HATE.

{ P = new ProbQuery; P->hate = T; }

prob(P) ::= stop_prob(Q) HATE term(T). Assign P equal to Q. Since the next token is HATE, thus combine the query represented by T to Q->hate via OP_OR.

{ P = Q; add_to_query(P->hate, Query::OP_OR, T); }

prob(P) ::= HATE BOOLEAN_FILTER(T). Assign P equal to a new instance of type ProbQuery and add the query described by T to P->hate (since there is the token HATE).

{ P = new ProbQuery; P->hate = new Query(T->get_query()); }

prob(P) ::= stop_prob(Q) HATE BOOLEAN_FILTER(T). Add the query described by the filter T to the queries present in Q->hate.

{ P = Q; add_to_query(P->hate, Query::OP_OR, T->get_query()); }

prob(P) ::= BOOLEAN_FILTER(T). Add the filter described by T to the instance of ProbQuery referred by P.

{ P = new ProbQuery(); P->add_filter(T->get_filter_group_id(), T->get_query()); }

12. prob(P) ::= stop_prob(Q) BOOLEAN_FILTER(T). P is assigned to Q. The type of P is ProbQuery. Since the next token is BOOLEAN_FILTER(T), we need to add the filter

described by T to the instance of the structure ProbQuery referred by P, by adding (append_filter) the mapping of filter (T->get_filter_group_id()) to the corresponding query (T->get_query()). { P = Q; P->append_filter(T->get_filter_group_id(), T->get_query()); }

prob(P) ::= LOVE BOOLEAN_FILTER(T). LOVE BOOLEAN_FILTER(T) is just the same as BOOLEAN_FILTER. {P = Q; P->append_filter(T->get_filter_group_id(), T->get_query());}
prob(P) ::= stop_prob(Q) LOVE BOOLEAN_FILTER(T). LOVE BOOLEAN_FILTER(T) is just the same as BOOLEAN_FILTER. OR filters with same prefix.

{ P = Q; Query & q = P->filter[T->get_filter_group_id()]; q = Query(Query::OP_OR, q, T->get_query()); }

stop_prob : %type stop_prob {ProbQuery *}

stop_prob - A prob or a stop_term.

The corresponding grammar rules and their description are as follow along with the pseudo code :

stop_prob(P) ::= prob(Q). Assign P equal to Q. { P = Q; }
stop_prob(P) ::= stop_term(T). Assign P equal to a new instance of type ProbQuery. Make P->query equal to T. { P = new ProbQuery; P->query = T; }

stop_term : %type stop_term {Query *}

stop_term - A term which should be checked against the stopword list, or a compound_term. If a term is loved, hated, or in a phrase, we don't want to consult the stopword list, so stop_term isn't used there (instead term is).

The corresponding grammar rules and their description are as follow along with the pseudo code :

stop_term(T) ::= TERM(U). If u corresponds to a stop word, then T is made NULL, and the U is added to the stoplist. Otherwise, T is assigned to the query formed

by considering the synonyms IF the corresponding synonym flags are enabled, other the query is made from U and is assigned to T. [ if (state->is_stopword(U)) {T = NULL; state->add_to_stoplist(U);} else { T = new Query(U->get_query_with_auto_synonyms()); } ]

stop_term(T) ::= compound_term(U). Assign T equal to U. { T = U }

term : %type term {Query *}

term - A term or a compound_term. It is different from stop_term in the sense that here we don't consult the stopword list. This corresponds to the case if the term is loved, hated, or in a phrase.

The corresponding grammar rules and their description are as follow along with the pseudo code :

term(T) ::= TERM(U). Form a synonym query by checking for synonyms IF the synonym flag is enabled, otherwise form the query simply from U. { T = new Query(U->get_query_with_auto_synonyms()); }
term(T) ::= compound_term(U). Assign T equal to U. { T = U. }

compound_term : %type compound_term {Query *}

compound_term - A WILD_TERM, a quoted phrase (with or without prefix), a phrased_term, group, near_expr, adj_expr, or a bracketed subexpression (with or without prefix).

The corresponding grammar rules and their description are as follow along with the pseudo code :

compound_term(T) ::= WILD_TERM(U). { T = U->as_wildcarded_query(state); }
compound_term(T) ::= PARTIAL_TERM(U). { T = U->as_partial_query(state); }
compound_term(T) ::= QUOTE phrase(P) QUOTE. { T = P->as_phrase_query(); }
compound_term(T) ::= phrased_term(P). { T = P->as_phrase_query(); }
compound_term(T) ::= group(P). { T = P->as_group(state); }
compound_term(T) ::= near_expr(P). { T = P->as_near_query(); }
compound_term(T) ::= adj_expr(P). { T = P->as_adj_query(); }
compound_term(T) ::= BRA expr(E) KET. { T = E; }
compound_term(T) ::= SYNONYM TERM(U). { T = new Query(U->get_query_with_synonyms()); } 10 compound_term(T) ::= CJKTERM(U). { T = U->as_cjk_query(); }

phrase : %type phrase {Terms *}

phrase - The "inside the quotes" part of a double-quoted phrase.

The corresponding grammar rules and their description are as follow along with the pseudo code :

phrase(P) ::= TERM(T). { P = new Terms; P->add_positional_term(T); }
phrase(P) ::= CJKTERM(T). { P = new Terms; T->as_positional_cjk_term(P); }
phrase(P) ::= phrase(Q) TERM(T). { P = Q; P->add_positional_term(T); }

phrased_term : %type phrased_term {Terms *}

phrased_term - A phrased term works like a single term, but is actually 2 or more terms linked together into a phrase by punctuation. There must be at least 2 terms in order to be able to have punctuation between the terms[[BR]]

The corresponding grammar rules and their description are as follow along with the pseudo code :

phrased_term(P) ::= TERM(T) PHR_TERM(U). { P = new Terms; P->add_positional_term(T); P->add_positional_term(U); }
phrased_term(P) ::= phrased_term(Q) PHR_TERM(T). { P = Q; P->add_positional_term(T); }

group : %type group {TermGroup *}

group - A group of terms separated only by whitespace - candidates for multi-term synonyms. The corresponding grammar rules and their description are as follow along with the pseudo code :

group(P) ::= TERM(T) GROUP_TERM(U). { P = new TermGroup; P->add_term(T); P->add_term(U); }
group(P) ::= group(Q) GROUP_TERM(T). { P = Q; P->add_term(T); }
group(P) ::= group(Q) EMPTY_GROUP_OK. { P = Q; P->set_empty_ok(); }

near_expr : %type near_expr {Terms *}

near_expr - 2 or more terms with NEAR in between. There must be at least 2 terms in order for there to be any NEAR operators! The corresponding grammar rules and their description are as follow along with the pseudo code :

near_expr(P) ::= TERM(T) NEAR(N) TERM(U). { P = new Terms; P->add_positional_term(T); P->add_positional_term(U); if (N) P->adjust_window(N->get_termpos()); }
near_expr(P) ::= near_expr(Q) NEAR(N) TERM(T). { P = Q; P->add_positional_term(T); if (N) P->adjust_window(N->get_termpos()); }

adj_expr : %type adj_expr {Terms *}

adj_expr - 2 or more terms with ADJ in between. There must be at least 2 terms in order for there to be any ADJ operators! The corresponding grammar rules and their description are as follow along with the pseudo code :

adj_expr(P) ::= TERM(T) ADJ(N) TERM(U). { P = new Terms; P->add_positional_term(T); P->add_positional_term(U); if (N) P->adjust_window(N->get_termpos()); }
adj_expr(P) ::= adj_expr(Q) ADJ(N) TERM(T). { P = Q; P->add_positional_term(T); if (N) P->adjust_window(N->get_termpos()); }

To summarize, following are the grammar rules of QueryParser, listed together in the order :

"query ::= expr",

"query ::=",

"expr ::= prob_expr",

3 "expr ::= bool_arg AND bool_arg",

"expr ::= bool_arg NOT bool_arg",

"expr ::= bool_arg AND NOT bool_arg",

"expr ::= bool_arg AND HATE_AFTER_AND bool_arg",

"expr ::= bool_arg OR bool_arg",

"expr ::= bool_arg XOR bool_arg",

"bool_arg ::= expr",

"bool_arg ::=",

"prob_expr ::= prob",

"prob_expr ::= term",

"prob ::= RANGE",

"prob ::= stop_prob RANGE",

"prob ::= stop_term stop_term",

"prob ::= prob stop_term",

"prob ::= LOVE term",

"prob ::= stop_prob LOVE term",

"prob ::= HATE term",

"prob ::= stop_prob HATE term",

"prob ::= HATE BOOLEAN_FILTER",

"prob ::= stop_prob HATE BOOLEAN_FILTER",

"prob ::= BOOLEAN_FILTER",

"prob ::= stop_prob BOOLEAN_FILTER",

"prob ::= LOVE BOOLEAN_FILTER",

"prob ::= stop_prob LOVE BOOLEAN_FILTER",

"stop_prob ::= prob",

"stop_prob ::= stop_term",

"stop_term ::= TERM",

"stop_term ::= compound_term",

"term ::= TERM",

"term ::= compound_term",

"compound_term ::= WILD_TERM",

"compound_term ::= PARTIAL_TERM",

"compound_term ::= QUOTE phrase QUOTE",

"compound_term ::= phrased_term",

"compound_term ::= group",

"compound_term ::= near_expr",

"compound_term ::= adj_expr",

"compound_term ::= BRA expr KET",

"compound_term ::= SYNONYM TERM",

"compound_term ::= CJKTERM",

"phrase ::= TERM",

"phrase ::= CJKTERM",

"phrase ::= phrase TERM",

"phrase ::= phrase CJKTERM",

"phrased_term ::= TERM PHR_TERM",

"phrased_term ::= phrased_term PHR_TERM",

"group ::= TERM GROUP_TERM",

"group ::= group GROUP_TERM",

"group ::= group EMPTY_GROUP_OK",

"near_expr ::= TERM NEAR TERM",

"near_expr ::= near_expr NEAR TERM",

"adj_expr ::= TERM ADJ TERM",

"adj_expr ::= adj_expr ADJ TERM",

Last modified 13 years ago Last modified on 20/05/12 12:13:12

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text