wiki:GSoC2011/ChineseSegmentationAnalysis/Journal

Community Bonding Week 1: April 25-May 1

Community Bonding Week 2: May 2-May 8

Community Bonding Week 3: May 9-May 15

Community Bonding Week 4: May 16-May 22

Coding Week 1: May 23-May 29

May 26

using chinese segmentation This file is needed. https://trac.xapian.org/browser/trunk/xapian-bindings/java/native/unordered_map.h

May 27

using libunicode, read .dic file into dictionary, convert utf_8 string to unicode string ,and store it,using hash_map.

/this one is abolished, using xapian/unicode.h instead

Coding Week 2: May 30-June 5

May 30 - June 2:

using map, hash_map to organize the dictionary, the efficiency is not test yet

the organization of dictionary in txt is one line per word so ,just read one line , convert into unicode, and read into the word .

current problem: how to convert unicode unsigned int into string, so it can be stored.

Coding Week 3: June 6-June 12

June 9: try to find the fasted way to read files into memory

Coding Week 4: June 13-June 19

Coding Week 5: June 20-June 26

Coding Week 6: June 27-July 3

there are two different algorithm in my project, and the second one is better

algorithm one:

1: the general idea of the algorithm:

step 1: load dictionary words from file: compiled-base.dic

  1. the encoding format is UTF-8
  1. the structure of this file is: one line indicate a Chinese word

c: the words are sorted these ways:

1)words first sorted by unicode value of the first character

2)when the first character all same, words are sorted by the size, the front words contain two character , then three character, then four, then more than four

3)if first character and the size are all same, words are sorted by unicode value

4)if first character are all same, the words whose size are more than 5 are sorted by the each character's unicode value, such as AAAABB, AAAABC, AAAABCCC, AAAAC, and so on.

load dictionary words is just read words line by line into : string * ascWords

step 2: organize the data structure

  1. divide the ascWords, which may contain more than 200,0000 words
  1. first divided by the first character, all the words begin with same characters is put into FirstLevelDictionary?, which means each character map to one FirstLevelDictionary?
  1. divide the words begin with same characters by size, put them into four sub dictionary: twoWordDic, threeWordDic, fourWordDic,multWordDic,

  1. sub dictionary will be either SecondLevelDictionary? or DBinaryDictionary

5: DBinaryDictionary is a dictionary which store the begin index and the end index of this dictionary, which will be searched using binary search

6: SecondLevelDictionary?: to organize this dictionary, there are two situations:

  1. if the word size is 2(which is 6 of the UTF-8 length), then each second character will be different , so using map<int, string> to indicate this word

  1. if the word size is more than 2, then , there will be several words contains same second character, so just put these words which contains same second character into a DBinaryDictionary

step 3. segment the input string

  1. from the beginning, if the first character is in the dictionary , such as A, has a relative FirstLevelDictionary?, pick it out

  1. get the maxlength, which is the biggest length of words which begin with the A.

  1. search the sub dictionary to check out whether it is in the subdictionary

  1. DBinaryDictionary search, if it is the multWordDic, then first find out which position the five character string is in the dictionary, maybe it is in the dictionary , maybe not, then check the next word until word length is shorter than the former one, in this case, word is started with other second or third character.
  1. SecondLevelDictionary? search is just the opposite of how to organize the structure
  1. if certain length word exist, then put it into vector<string> output

  1. if no word is found, just move into next chinese character

algorithm two:(which is same as Paoding)

step 1:load dictionary: dictionary is different : t-base.dic

t-base.dic: this file contains same words as compiled-base.dic, but in different order

order: first sorted by unicode value of first character, then second character, then third character, and so on

such as, A, AA, AB, ABC, AC, ACDDG, ACE, BA, BDEG, BF......

step 2: organize the dictionary

first divide the ascWords by first character, and create a sub dicitonary for each character, if the number of words begin with same characters is less than 16, then put them into a BinaryDictionary?; if not, divide them by second character, and then check whether the number of words is less than 16, and so on, until every sub dictinary contains less than 16 words.

step 3: search the dictionary

according the first character to find out the sub dictionary

the stuff remains is using do some extra work before the segmentation, such as deal with names, numbers

and do some extra work after the segmentation

the reason i kept two algorithm is for further improvement, such as check the input string backward, in such situation , the first algorithm might be better, this is just and idea, i haven't tested it yet.

Coding Week 7: July 4-July 10

July 4 - July 5 deal with name:

the Chinese name combined by next two part: family name and first name. Unlike English name, the Chinese first name put behind the family name, and it can't be recognised like English name by the lower case or upper case. And generally, the first name contains one or two characters, it is very rare to have three characters as first name. The Chinese family names are fixed, the file x-confucian-family-name.dic contains almost every family name. The Chinese family names generally contain one character, some contain two character.

the general idea of deal with name is below:

when some strings cannot be found on dictionary, then check whether they contain some names.

first , load family files: x-confucian-family-name.dic into HashDictionary? familyNameDic

then, check whether the unfounded string contains a family name, if it is, add the family name and next one or two character into vector<string > ouput;

check the the string left to see whether it contains a title, if a family name is followed by a title, we can assume that a family name and the title combined into a name.

if between a family name and a title has one or two characters, we can assume the family name and the middle strings are combined into a name.

July 6: finish coding about how to recognize names

July 7: deal with numbers

numbers can be written in different ways in Chinese, and if it is written totally by Chinese character, it contains countless numbers

First: Chinese number style

English contains words like: hundred, thousand, million, billion, and so on

the same pattern in Chinese is, "十"(ten), "百"(hundred), "千"(thousand), "万"(ten thousand), "亿"(hundred million)

in English, numbers will be written by ways 300,000,000,

in Chinese numbers will be written like 3,0000,0000. it use "万"(ten thousand) and "亿"(hundred million) to form a big number,

also , in Chinese, numbers contains 零(0), 一(1), 二(2), 三(3), 四(4), 五(5), 六(6), 七(7), 八(8), 九(9)

xapian-core/queryparser/termgenerator_internal.cc

Coding Week 8: July 11-July 17 (Midterm)

finish the code about how to deal with numbers

first, segment the input string into Chinese string and none Chinese string

dealing with Chinese numbers which means check whether a piece of string A contains Chinese number.

check whether a character is a Chinese number-related character, if it is then search the next character until it is not a Chinese number-related character

then if this Chinese number is started at the beginning of the string, the input string before it should also be checked, because some numbers has already existed in the dictionary, so it might be segmented incorrect. For example, one hundred and one point three, in Chinese one point is a term in dictionary, so the first segment will cut the string into one hundred and/one point/three. So when deal with "one hundred and ",it should search the followed string one point three, and when search three, it should search the one hundred and one point. it will be redundant, but there will be not occurred often.

deal with Arab number or Arab number combined with Chinese number character.

when Arab number followed with '-', ',','.','/', it will be considered as accepted term. ',', '.' existed in number, and '-', and '/' are used in date.

when the none Chinese string ended with Arab number, check the followed character to find out whether it is a Chinese number related character, then, combined with the Arab number with the followed Chinese number as a term.

Coding Week 9: July 18-July 24

dealing with words that frequent exist in one file , but there are not in dictionaries

words like this exists quite often

the step to deal with such words is below:

step 1: count all the characters from substrings which are not recoginized by the dictionary, count every characters' existing number

step 2: if the characters exists more than three times, it maybe a frequency character in a frequency word, pick up such characters

step 3: check the unrecoginized substrings, if a frequency character followed by another frequency character, combined them may become a word, record all possible word

step 4: count all possible word, if a word exits more than twice, it be recognised a word

step 5: check all the original substrings to record the segmentation

July 21. so far , this step is done.

because the result form is changed, so it need to change the original result form, so they can be combined together

Coding Week 10: July 25-July 31

compiler the libxapian.lib and testsuite.lib

Coding Week 11: August 1-August 7

testing the result and write test file

Coding Week 12: August 8-August 14

August 9: add code about 2-gram segment 2-gram: if Chinese string is : ABC, then split the string AB/BC

if Chinese string is : ABCD, then split the string AB/BC/CD, and so on.

2-gram only worked after segmentation,chinese numbers, chinese names, freqency words.

Coding Week 13: August 15-August 22 (Final evaluation)

Last modified 4 years ago Last modified on 26/01/16 10:10:43