Opened 8 years ago

Closed 8 years ago

Last modified 7 years ago

#719 closed defect (fixed)

Tokenized CJK query terms wrongly combined with respect to prefixes

Reported by: Aaron LI Owned by: Olly Betts
Priority: normal Milestone: 1.2.25
Component: QueryParser Version: 1.2.23
Severity: normal Keywords: CJK, prefix
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Aaron LI)

I first came across this issue when querying CJK with mu (https://github.com/djcb/mu) and reported the issue there (https://github.com/djcb/mu/issues/123#issuecomment-180999233). However, after some further investigations into mu and xapian recently, I find it is a bug in xapian.


Here I demonstrate this issue with python-xapian:

qp = xapian.QueryParser()

qp.add_prefix("subject", "S")
qp.add_prefix("s", "S")
qp.add_prefix("body", "B")
qp.add_prefix("b", "B")
qp.add_prefix("", "B")
qp.add_prefix("", "S")

qstr1 = "中文"
qstr2 = "b:中文"
qstr3 = "hello AND world"

q1 = qp.parse_query(qstr1)
q2 = qp.parse_query(qstr2)
q3 = qp.parse_query(qstr3)

print(q1)
# Xapian::Query((B中:(pos=1) AND S中:(pos=1) AND
#                B中文:(pos=1) AND S中文:(pos=1) AND
#                B文:(pos=1) AND S文:(pos=1)))

print(q2)
# Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))

print(q3)
# Xapian::Query(((Bhello:(pos=1) OR Shello:(pos=1)) AND
#                (Bworld:(pos=2) OR Sworld:(pos=2))))

The parsed queries for qstr2 and qstr3 are right, while the parsed query q1 for (the CJK query string without a prefix) qstr1 is wrongly combined with OP_AND with respect to the prefixes. As we can see, the same tokenized CJK term (e.g., ) is wrongly OP_AND combined for each prefix (i.e., B and S here), which should instead be OP_OR combined. Therefore, I have the CJK search problem in mu which gives me wrong or empty results.


The expected parsed query for qstr1 should look like this:

Xapian::Query(((B中:(pos=1) OR S中:(pos=1)) AND
               (B中文:(pos=1) OR S中文:(pos=1)) AND
               (B文:(pos=1) OR S文:(pos=1))))

where the same tokenized CJK term should be OP_OR combined with respect to the prefixes, and then be OP_AND combined with respect to each tokenized CJK term.

On the other hand, the query may also look like this (i.e., qstr1 = "b:中文 OR s:中文" for the above example):

Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
               (S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))

which seems to be more intuitive and maybe more logical to me.


Environment:

  • Linux: Debian, testing, amd64
  • Xapian: libxapian22v5, version 1.2.23
  • python-xapian: version 1.2.23-1
  • environment variable: XAPIAN_CJK_NGRAM=1

Best regards!

Aly

Change History (9)

comment:1 by Aaron LI, 8 years ago

Description: modified (diff)

Explain more clearly about the CJK query parsing issue.

comment:2 by Olly Betts, 8 years ago

Milestone: 1.4.x
Status: newassigned

I think I agree that this is the better option:

Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
               (S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))

This case seems more analogous to a phrase search, which we handle more like that:

>>> print(qp.parse_query('"hello world"'))
Query(((Bhello@1 PHRASE 2 Bworld@2) OR (Shello@1 PHRASE 2 Sworld@2)))

Though either would be an improvement.

Marking for 1.4.x (once fixed there we can consider backporting).

comment:3 by Olly Betts, 8 years ago

(I tested and the same issue is present in current git master).

comment:4 by Olly Betts, 8 years ago

Milestone: 1.4.x1.4.1

Let's try for 1.4.1.

comment:5 by Olly Betts, 8 years ago

Milestone: 1.4.11.2.25
Operating System: LinuxAll
Resolution: fixed
Status: assignedclosed

Fixed in git master [a5b11842a6469ea37205455500094af3d7db85ec], backported to RELEASE/1.4 as [758a4f93a076f7aabb991a67b200cb3568fa4cdd] and svn/1.2 as [d724b413089329038e0cee05190b8c38cd794cc0]. So fix will be in 1.4.1 and 1.2.25.

comment:6 by Aaron LI, 7 years ago

Thanks for your hard work!

I have received this notification some time ago, but my distro (Gentoo Linux) still does not include the fixed release...

Once I got the fixed release, I will test again and report back.

Cheers,

Aly

comment:7 by Olly Betts, 7 years ago

Neither 1.2.25 nor 1.4.1 have been released yet. 1.4.1 should be soon though.

comment:8 by Aaron LI, 7 years ago

I recently upgraded to Xapian v1.4.1, and the Chinese query parser works as expected. Here is the new and correct behavior:

xapian.version_string()
# '1.4.1'

qp = xapian.QueryParser()
qp.add_prefix("subject", "S")
qp.add_prefix("s", "S")
qp.add_prefix("body", "B")
qp.add_prefix("b", "B")
qp.add_prefix("", "B")
qp.add_prefix("", "S")

qstr1 = "中文"
q1 = qp.parse_query(qstr1)
print(q1)
# Query(((B中@1 AND B中文@1 AND B文@1) OR (S中@1 AND S中文@1 AND S文@1)))

I also tried to rebuild recent mu (see issue https://github.com/djcb/mu/issues/123 ), and now the Chinese search works.

Thank you!

comment:9 by Olly Betts, 7 years ago

Great, thanks for confirming.

Note: See TracTickets for help on using tickets.