Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#741 closed defect (fixed)

"Empty termnames aren't allowed" by indexing text in Arabic

Reported by: Kelson Owned by: Olly Betts
Priority: normal Milestone: 1.4.2
Component: Library API Version: 1.4.1
Severity: normal Keywords:
Cc: Assem Blocked By:
Blocking: Operating System: Linux

Description

By index_text_without_positions() the text in Arabic (in the attached file), the Xapian::TermGenerator throws a "Empty termnames aren't allowed" Xapian::InvalidArgumentError exception. This works otherwise pretty fine with other texts in Arabic, so this one seems to have something "special". This might be a similar problem to the old #106 bug.

Info:

  • Xapian 1.4.1
  • Using the steemmer in Arabic (without it works)
  • Xapian::TermGenerator::STEM_ALL as stemming_strategy

Attachments (1)

wrong.txt (2.0 KB ) - added by Kelson 6 years ago.
Text file in Arabic to index

Download all attachments as: .zip

Change History (14)

by Kelson, 6 years ago

Attachment: wrong.txt added

Text file in Arabic to index

comment:1 by Olly Betts, 6 years ago

Component: OtherLibrary API
Milestone: 1.4.2
Status: newassigned

I suspect the arabic stemmer is producing an empty stem for some input word which is in that file.

Assuming that's the cause, I feel that's a bug in this stemmer, but given we allow user-implemented stemmers, perhaps we ought to quietly ignore this anyway too.

comment:2 by Kelson, 6 years ago

@Olly Do you mean I can safely ignore "Xapian::InvalidArgumentError" exceptions here?

We use the stemmer of Xapian, here is the corresponding piece of code:

/* Build ICU Local object to retrieve ISO-639 language code (from
   ISO-639-3) */
icu::Locale *languageLocale = new icu::Locale(language.c_str());

/* Configuring language base steemming */
try {
   this->stemmer = Xapian::Stem(languageLocale->getLanguage());
   this->indexer.set_stemmer(this->stemmer);
   this->indexer.set_stemming_strategy(Xapian::TermGenerator::STEM_ALL);
} catch (...) {
   std::cout << "No steemming for language '" << languageLocale->getLanguage() << "'" << std::endl;
}

comment:3 by Olly Betts, 5 years ago

Do you mean I can safely ignore Xapian::InvalidArgumentError exceptions here?

You can if you're happy to skip the document containing the problematic text.

/* Configuring language base steemming */

[...]

std::cout << "No steemming for language '" << languageLocale->getLanguage() << "'" << std::endl;

"steemming" should be "stemming" there.

comment:4 by Kelson, 5 years ago

@Olly Thank you for the answer.

comment:5 by Olly Betts, 5 years ago

Cc: Assem added

The problematic word in wrong.txt consists of a single 'ARABIC TATWEEL' (U+0640) character, which indeed stems to an empty string.

I'd argue that's a bug in the Arabic stemmer (I've Cc:-ed assem who wrote that algorithm - what do you think, Assem?)

But we should handle this case better (especially as we support user-implemented stemming algorithms). At the minimum the error message should be improved, but I think overall makes sense to just skip empty stems if they arise.

comment:6 by Assem, 5 years ago

@Olly

The arabic stemmer do the normalization before the stemming , that's why it removes'ARABIC TATWEEL' (U+0640) character which is used to make the words longer without losing the shape of word. In the wrong.txt case, the Tatweel came strangely alone (confused for dash or underscore). I think an alone TATWEEL should treated like alone dot ".", should never tokenized as an independent term.

What you suggest as solution for this, doing normalization before tokenization? or edit tokenization to not generate the term in the first place.

Last edited 5 years ago by Assem (previous) (diff)

comment:7 by Olly Betts, 5 years ago

Thanks for the insights - it's not a problem if a lone tatweel is ignored then.

Unicode classes U+0640 as "Letter, Modifier" (Lm) and the tokeniser treats all subclasses of "Letter" the same way. It could reject words that are comprised entirely of modifiers, but it looks like UAX#29 (Unicode Text Segmentation) also treats Lm the same as other Letter subcategories, and a quick experiment with ICU on "x ـ x" seems to confirm this. It seems rash to deviate from that without knowing a lot more about the details of the various Lm characters in all various different scripts than I do. (Also it would mean that "cute" stuff like ᴾᴼᴿᵀᴱᴿ wouldn't be indexed!)

To handle this outside the stemmer I think it's best to just quietly ignore empty stems.

Looking at the Arabic algorithm, I notice it seems overly aggressive in its removal of non-letters in general, unlike the other snowball stemmers which generally leave non-words alone (your example of . stems to . with all the other language stemmers I tried). While pure punctuation strings are perhaps a bit esoteric, leaving non-words alone generally seems a sensible approach - one problematic case with the current Arabic stemming algorithm is that real numbers lose their decimal point - e.g. 20.16 -> 2016

comment:8 by Assem, 5 years ago

the normalization done within the ARABIC stemmer is about :

  • Strip vocalization marks
  • Convert Eastern Arabic numerals (٠١٢٣٤٥٦٧٨٩) to Western Arabic numerals (0123456789)
  • Convert shaped letters to their independent form unicode
  • Separate LAM-ALEF into independent LAM and ALEF (Some systems still saving them as a single symbol)
  • Remove Kashida == ARABIC TATWEEL
     '{_}' ( delete ) // strip kasheeda
    
  • Remove punctuation marks
                // Punctuation marks
                '.' ',' ';' ':'  '?' '!' '/' '*' '%' '\' '"' ( delete) // General
                '{,}' '{;}' '{?}'  ( delete ) // Arabic-specific

For punctuation marks, suggest what you think it's better to keep them. Just a note, in Arabic we generally use , for decimal mark which is not used as a punctuation (، is the comma). Yet, some are just using the English decimal mark ..

Last edited 5 years ago by Assem (previous) (diff)

comment:9 by Olly Betts, 5 years ago

To handle this outside the stemmer I think it's best to just quietly ignore empty stems.

Implemented in b717dc0f3b9074cec38bc3f1cb0dff778bd44b73 on git master. Needs backporting for 1.4.2, and maybe considering for the next 1.2.x release (1.2.x doesn't have the arabic stemmer, but does support user stemmers).

For punctuation marks, suggest what you think it's better to keep them

I think it's probably better for the stemmers to leave punctuation alone as a general rule - the tokeniser should already have handled removing it where it isn't wanted. There may be a few language-specific exceptions for special cases (the English stemmer has some special handling for a 's suffix for example - e.g. "king's" and "king" really should be conflated).

comment:10 by Assem, 5 years ago

Please check if the tokeniser is handling those arabic punctuation marks:

hex '060C' // ARABIC COMMA
hex '061B' // ARABIC SEMICOLON
hex '061F' // ARABIC QUESTION
hex '066a'  // ARABIC PERCENT
hex '066b'  // ARABIC DECIMAL
hex '066c'  // ARABIC THOUSANDS SEPARATOR
Last edited 5 years ago by Assem (previous) (diff)

comment:11 by Olly Betts, 5 years ago

$ perl -CO -e 'print "A\x{60c}B\x{61b}C\x{61f}D\x{66a}E\x{66b}F\x{66c}G"'|examples/simpleindex ar.db
$ xapian-delve ar.db -r1 
Term List for record #1: Za Zb Zc Zd Ze Zf Zg a b c d e f g

So Xapian currently splits up words at all those characters (simpleindex is currently hard-wired to use the English stemmer, so it's not that the Arabic stemmer is stripping them).

comment:12 by Olly Betts, 5 years ago

Resolution: fixed
Status: assignedclosed

Backported for 1.4.2 in 832f6a8b746247095c8cf6733ecb181b5c65601c.

Assem's opened a PR on the snowball repo for altering the punctuation handling (thanks for that - will get to that soon I hope) so the rest of this is now being handled at https://github.com/snowballstem/snowball/pull/50 so closing this ticket.

comment:13 by Kelson, 5 years ago

Thank you very much for all the good work. Will test the fix as soon as 1.4.2 will be released.

Note: See TracTickets for help on using tickets.