#719 closed defect (fixed)
Tokenized CJK query terms wrongly combined with respect to prefixes
| Reported by: | Aaron LI | Owned by: | Olly Betts |
|---|---|---|---|
| Priority: | normal | Milestone: | 1.2.25 |
| Component: | QueryParser | Version: | 1.2.23 |
| Severity: | normal | Keywords: | CJK, prefix |
| Cc: | Blocked By: | ||
| Blocking: | Operating System: | All |
Description (last modified by )
I first came across this issue when querying CJK with mu (https://github.com/djcb/mu) and reported the issue there (https://github.com/djcb/mu/issues/123#issuecomment-180999233). However, after some further investigations into mu and xapian recently, I find it is a bug in xapian.
Here I demonstrate this issue with python-xapian:
qp = xapian.QueryParser()
qp.add_prefix("subject", "S")
qp.add_prefix("s", "S")
qp.add_prefix("body", "B")
qp.add_prefix("b", "B")
qp.add_prefix("", "B")
qp.add_prefix("", "S")
qstr1 = "中文"
qstr2 = "b:中文"
qstr3 = "hello AND world"
q1 = qp.parse_query(qstr1)
q2 = qp.parse_query(qstr2)
q3 = qp.parse_query(qstr3)
print(q1)
# Xapian::Query((B中:(pos=1) AND S中:(pos=1) AND
# B中文:(pos=1) AND S中文:(pos=1) AND
# B文:(pos=1) AND S文:(pos=1)))
print(q2)
# Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))
print(q3)
# Xapian::Query(((Bhello:(pos=1) OR Shello:(pos=1)) AND
# (Bworld:(pos=2) OR Sworld:(pos=2))))
The parsed queries for qstr2 and qstr3 are right, while the parsed query q1 for (the CJK query string without a prefix) qstr1 is wrongly combined with OP_AND with respect to the prefixes.
As we can see, the same tokenized CJK term (e.g., 中) is wrongly OP_AND combined for each prefix (i.e., B and S here), which should instead be OP_OR combined.
Therefore, I have the CJK search problem in mu which gives me wrong or empty results.
The expected parsed query for qstr1 should look like this:
Xapian::Query(((B中:(pos=1) OR S中:(pos=1)) AND
(B中文:(pos=1) OR S中文:(pos=1)) AND
(B文:(pos=1) OR S文:(pos=1))))
where the same tokenized CJK term should be OP_OR combined with respect to the prefixes, and then be OP_AND combined with respect to each tokenized CJK term.
On the other hand, the query may also look like this (i.e., qstr1 = "b:中文 OR s:中文" for the above example):
Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
(S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
which seems to be more intuitive and maybe more logical to me.
Environment:
- Linux: Debian, testing, amd64
- Xapian:
libxapian22v5, version 1.2.23 python-xapian: version 1.2.23-1- environment variable:
XAPIAN_CJK_NGRAM=1
Best regards!
Aly
Change History (9)
comment:1 by , 10 years ago
| Description: | modified (diff) |
|---|
comment:2 by , 10 years ago
| Milestone: | → 1.4.x |
|---|---|
| Status: | new → assigned |
I think I agree that this is the better option:
Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
(S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
This case seems more analogous to a phrase search, which we handle more like that:
>>> print(qp.parse_query('"hello world"'))
Query(((Bhello@1 PHRASE 2 Bworld@2) OR (Shello@1 PHRASE 2 Sworld@2)))
Though either would be an improvement.
Marking for 1.4.x (once fixed there we can consider backporting).
comment:5 by , 9 years ago
| Milestone: | 1.4.1 → 1.2.25 |
|---|---|
| Operating System: | Linux → All |
| Resolution: | → fixed |
| Status: | assigned → closed |
Fixed in git master [a5b11842a6469ea37205455500094af3d7db85ec], backported to RELEASE/1.4 as [758a4f93a076f7aabb991a67b200cb3568fa4cdd] and svn/1.2 as [d724b413089329038e0cee05190b8c38cd794cc0]. So fix will be in 1.4.1 and 1.2.25.
comment:6 by , 9 years ago
Thanks for your hard work!
I have received this notification some time ago, but my distro (Gentoo Linux) still does not include the fixed release...
Once I got the fixed release, I will test again and report back.
Cheers,
Aly
comment:7 by , 9 years ago
Neither 1.2.25 nor 1.4.1 have been released yet. 1.4.1 should be soon though.
comment:8 by , 9 years ago
I recently upgraded to Xapian v1.4.1, and the Chinese query parser works as expected. Here is the new and correct behavior:
xapian.version_string()
# '1.4.1'
qp = xapian.QueryParser()
qp.add_prefix("subject", "S")
qp.add_prefix("s", "S")
qp.add_prefix("body", "B")
qp.add_prefix("b", "B")
qp.add_prefix("", "B")
qp.add_prefix("", "S")
qstr1 = "中文"
q1 = qp.parse_query(qstr1)
print(q1)
# Query(((B中@1 AND B中文@1 AND B文@1) OR (S中@1 AND S中文@1 AND S文@1)))
I also tried to rebuild recent mu (see issue https://github.com/djcb/mu/issues/123 ), and now the Chinese search works.
Thank you!

Explain more clearly about the CJK query parsing issue.