#719 closed defect (fixed)
Tokenized CJK query terms wrongly combined with respect to prefixes
Reported by: | Aaron LI | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.25 |
Component: | QueryParser | Version: | 1.2.23 |
Severity: | normal | Keywords: | CJK, prefix |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
I first came across this issue when querying CJK with mu
(https://github.com/djcb/mu) and reported the issue there (https://github.com/djcb/mu/issues/123#issuecomment-180999233). However, after some further investigations into mu
and xapian
recently, I find it is a bug in xapian
.
Here I demonstrate this issue with python-xapian
:
qp = xapian.QueryParser() qp.add_prefix("subject", "S") qp.add_prefix("s", "S") qp.add_prefix("body", "B") qp.add_prefix("b", "B") qp.add_prefix("", "B") qp.add_prefix("", "S") qstr1 = "中文" qstr2 = "b:中文" qstr3 = "hello AND world" q1 = qp.parse_query(qstr1) q2 = qp.parse_query(qstr2) q3 = qp.parse_query(qstr3) print(q1) # Xapian::Query((B中:(pos=1) AND S中:(pos=1) AND # B中文:(pos=1) AND S中文:(pos=1) AND # B文:(pos=1) AND S文:(pos=1))) print(q2) # Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1))) print(q3) # Xapian::Query(((Bhello:(pos=1) OR Shello:(pos=1)) AND # (Bworld:(pos=2) OR Sworld:(pos=2))))
The parsed queries for qstr2
and qstr3
are right, while the parsed query q1
for (the CJK query string without a prefix) qstr1
is wrongly combined with OP_AND
with respect to the prefixes.
As we can see, the same tokenized CJK term (e.g., 中
) is wrongly OP_AND
combined for each prefix (i.e., B
and S
here), which should instead be OP_OR
combined.
Therefore, I have the CJK search problem in mu
which gives me wrong or empty results.
The expected parsed query for qstr1
should look like this:
Xapian::Query(((B中:(pos=1) OR S中:(pos=1)) AND (B中文:(pos=1) OR S中文:(pos=1)) AND (B文:(pos=1) OR S文:(pos=1))))
where the same tokenized CJK term should be OP_OR
combined with respect to the prefixes, and then be OP_AND
combined with respect to each tokenized CJK term.
On the other hand, the query may also look like this (i.e., qstr1 = "b:中文 OR s:中文"
for the above example):
Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR (S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
which seems to be more intuitive and maybe more logical to me.
Environment:
- Linux: Debian, testing, amd64
- Xapian:
libxapian22v5
, version 1.2.23 python-xapian
: version 1.2.23-1- environment variable:
XAPIAN_CJK_NGRAM=1
Best regards!
Aly
Change History (9)
comment:1 by , 9 years ago
Description: | modified (diff) |
---|
comment:2 by , 9 years ago
Milestone: | → 1.4.x |
---|---|
Status: | new → assigned |
I think I agree that this is the better option:
Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR (S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
This case seems more analogous to a phrase search, which we handle more like that:
>>> print(qp.parse_query('"hello world"')) Query(((Bhello@1 PHRASE 2 Bworld@2) OR (Shello@1 PHRASE 2 Sworld@2)))
Though either would be an improvement.
Marking for 1.4.x (once fixed there we can consider backporting).
comment:5 by , 8 years ago
Milestone: | 1.4.1 → 1.2.25 |
---|---|
Operating System: | Linux → All |
Resolution: | → fixed |
Status: | assigned → closed |
Fixed in git master [a5b11842a6469ea37205455500094af3d7db85ec], backported to RELEASE/1.4 as [758a4f93a076f7aabb991a67b200cb3568fa4cdd] and svn/1.2 as [d724b413089329038e0cee05190b8c38cd794cc0]. So fix will be in 1.4.1 and 1.2.25.
comment:6 by , 8 years ago
Thanks for your hard work!
I have received this notification some time ago, but my distro (Gentoo Linux) still does not include the fixed release...
Once I got the fixed release, I will test again and report back.
Cheers,
Aly
comment:7 by , 8 years ago
Neither 1.2.25 nor 1.4.1 have been released yet. 1.4.1 should be soon though.
comment:8 by , 8 years ago
I recently upgraded to Xapian v1.4.1, and the Chinese query parser works as expected. Here is the new and correct behavior:
xapian.version_string() # '1.4.1' qp = xapian.QueryParser() qp.add_prefix("subject", "S") qp.add_prefix("s", "S") qp.add_prefix("body", "B") qp.add_prefix("b", "B") qp.add_prefix("", "B") qp.add_prefix("", "S") qstr1 = "中文" q1 = qp.parse_query(qstr1) print(q1) # Query(((B中@1 AND B中文@1 AND B文@1) OR (S中@1 AND S中文@1 AND S文@1)))
I also tried to rebuild recent mu
(see issue https://github.com/djcb/mu/issues/123 ), and now the Chinese search works.
Thank you!
Explain more clearly about the CJK query parsing issue.