#741 closed defect (fixed)
"Empty termnames aren't allowed" by indexing text in Arabic
Reported by: | Kelson | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.4.2 |
Component: | Library API | Version: | 1.4.1 |
Severity: | normal | Keywords: | |
Cc: | Assem | Blocked By: | |
Blocking: | Operating System: | Linux |
Description
By index_text_without_positions() the text in Arabic (in the attached file), the Xapian::TermGenerator throws a "Empty termnames aren't allowed" Xapian::InvalidArgumentError exception. This works otherwise pretty fine with other texts in Arabic, so this one seems to have something "special". This might be a similar problem to the old #106 bug.
Info:
- Xapian 1.4.1
- Using the steemmer in Arabic (without it works)
- Xapian::TermGenerator::STEM_ALL as stemming_strategy
Attachments (1)
Change History (14)
by , 8 years ago
comment:1 by , 8 years ago
Component: | Other → Library API |
---|---|
Milestone: | → 1.4.2 |
Status: | new → assigned |
I suspect the arabic stemmer is producing an empty stem for some input word which is in that file.
Assuming that's the cause, I feel that's a bug in this stemmer, but given we allow user-implemented stemmers, perhaps we ought to quietly ignore this anyway too.
comment:2 by , 8 years ago
@Olly Do you mean I can safely ignore "Xapian::InvalidArgumentError" exceptions here?
We use the stemmer of Xapian, here is the corresponding piece of code:
/* Build ICU Local object to retrieve ISO-639 language code (from ISO-639-3) */ icu::Locale *languageLocale = new icu::Locale(language.c_str()); /* Configuring language base steemming */ try { this->stemmer = Xapian::Stem(languageLocale->getLanguage()); this->indexer.set_stemmer(this->stemmer); this->indexer.set_stemming_strategy(Xapian::TermGenerator::STEM_ALL); } catch (...) { std::cout << "No steemming for language '" << languageLocale->getLanguage() << "'" << std::endl; }
comment:3 by , 8 years ago
Do you mean I can safely ignore
Xapian::InvalidArgumentError
exceptions here?
You can if you're happy to skip the document containing the problematic text.
/* Configuring language base steemming */
[...]
std::cout << "No steemming for language '" << languageLocale->getLanguage() << "'" << std::endl;
"steemming" should be "stemming" there.
comment:5 by , 8 years ago
Cc: | added |
---|
The problematic word in wrong.txt
consists of a single 'ARABIC TATWEEL' (U+0640) character, which indeed stems to an empty string.
I'd argue that's a bug in the Arabic stemmer (I've Cc:-ed assem who wrote that algorithm - what do you think, Assem?)
But we should handle this case better (especially as we support user-implemented stemming algorithms). At the minimum the error message should be improved, but I think overall makes sense to just skip empty stems if they arise.
comment:6 by , 8 years ago
@Olly
The arabic stemmer do the normalization before the stemming , that's why it removes'ARABIC TATWEEL' (U+0640) character
which is used to make the words longer without losing the shape of word. In the wrong.txt case, the Tatweel came strangely alone (confused for dash or underscore). I think an alone TATWEEL should treated like alone dot ".", should never tokenized as an independent term.
What you suggest as solution for this, doing normalization before tokenization? or edit tokenization to not generate the term in the first place.
comment:7 by , 8 years ago
Thanks for the insights - it's not a problem if a lone tatweel is ignored then.
Unicode classes U+0640 as "Letter, Modifier" (Lm) and the tokeniser treats all subclasses of "Letter" the same way. It could reject words that are comprised entirely of modifiers, but it looks like UAX#29 (Unicode Text Segmentation) also treats Lm the same as other Letter subcategories, and a quick experiment with ICU on "x ـ x"
seems to confirm this. It seems rash to deviate from that without knowing a lot more about the details of the various Lm characters in all various different scripts than I do. (Also it would mean that "cute" stuff like ᴾᴼᴿᵀᴱᴿ wouldn't be indexed!)
To handle this outside the stemmer I think it's best to just quietly ignore empty stems.
Looking at the Arabic algorithm, I notice it seems overly aggressive in its removal of non-letters in general, unlike the other snowball stemmers which generally leave non-words alone (your example of .
stems to .
with all the other language stemmers I tried). While pure punctuation strings are perhaps a bit esoteric, leaving non-words alone generally seems a sensible approach - one problematic case with the current Arabic stemming algorithm is that real numbers lose their decimal point - e.g. 20.16
-> 2016
comment:8 by , 8 years ago
the normalization done within the ARABIC stemmer is about :
- Strip vocalization marks
- Convert Eastern Arabic numerals (٠١٢٣٤٥٦٧٨٩) to Western Arabic numerals (0123456789)
- Convert shaped letters to their independent form unicode
- Separate LAM-ALEF into independent LAM and ALEF (Some systems still saving them as a single symbol)
- Remove Kashida == ARABIC TATWEEL
'{_}' ( delete ) // strip kasheeda
- Remove punctuation marks
// Punctuation marks '.' ',' ';' ':' '?' '!' '/' '*' '%' '\' '"' ( delete) // General '{,}' '{;}' '{?}' ( delete ) // Arabic-specific
For punctuation marks, suggest what you think it's better to keep them. Just a note, in Arabic we generally use ,
for decimal mark which is not used as a punctuation (،
is the comma). Yet, some are just using the English decimal mark .
.
comment:9 by , 8 years ago
To handle this outside the stemmer I think it's best to just quietly ignore empty stems.
Implemented in b717dc0f3b9074cec38bc3f1cb0dff778bd44b73 on git master. Needs backporting for 1.4.2, and maybe considering for the next 1.2.x release (1.2.x doesn't have the arabic stemmer, but does support user stemmers).
For punctuation marks, suggest what you think it's better to keep them
I think it's probably better for the stemmers to leave punctuation alone as a general rule - the tokeniser should already have handled removing it where it isn't wanted. There may be a few language-specific exceptions for special cases (the English stemmer has some special handling for a 's
suffix for example - e.g. "king's" and "king" really should be conflated).
comment:10 by , 8 years ago
Please check if the tokeniser is handling those arabic punctuation marks:
hex '060C' // ARABIC COMMA hex '061B' // ARABIC SEMICOLON hex '061F' // ARABIC QUESTION hex '066a' // ARABIC PERCENT hex '066b' // ARABIC DECIMAL hex '066c' // ARABIC THOUSANDS SEPARATOR
comment:11 by , 8 years ago
$ perl -CO -e 'print "A\x{60c}B\x{61b}C\x{61f}D\x{66a}E\x{66b}F\x{66c}G"'|examples/simpleindex ar.db $ xapian-delve ar.db -r1 Term List for record #1: Za Zb Zc Zd Ze Zf Zg a b c d e f g
So Xapian currently splits up words at all those characters (simpleindex
is currently hard-wired to use the English stemmer, so it's not that the Arabic stemmer is stripping them).
comment:12 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Backported for 1.4.2 in 832f6a8b746247095c8cf6733ecb181b5c65601c.
Assem's opened a PR on the snowball repo for altering the punctuation handling (thanks for that - will get to that soon I hope) so the rest of this is now being handled at https://github.com/snowballstem/snowball/pull/50 so closing this ticket.
comment:13 by , 8 years ago
Thank you very much for all the good work. Will test the fix as soon as 1.4.2 will be released.
Text file in Arabic to index