Table of Contents
- Community Bonding Week 1: April 25-May 1
- Community Bonding Week 2: May 2-May 8
- Community Bonding Week 3: May 9-May 15
- Community Bonding Week 4: May 16-May 22
- Coding Week 1: May 23-May 29
- Coding Week 2: May 30-June 5
- Coding Week 3: June 6-June 12
- Coding Week 4: June 13-June 19
- Coding Week 5: June 20-June 26
- Coding Week 6: June 27-July 3
- Coding Week 7: July 4-July 10
- Coding Week 8: July 11-July 17 (Midterm)
- Coding Week 9: July 18-July 24
- Coding Week 10: July 25-July 31
- Coding Week 11: August 1-August 7
- Coding Week 12: August 8-August 14
- Coding Week 13: August 15-August 22 (Final evaluation)
Community Bonding Week 1: April 25-May 1
Community Bonding Week 2: May 2-May 8
Community Bonding Week 3: May 9-May 15
Community Bonding Week 4: May 16-May 22
Coding Week 1: May 23-May 29
using chinese segmentation This file is needed. https://trac.xapian.org/browser/trunk/xapian-bindings/java/native/unordered_map.h
using libunicode, read .dic file into dictionary, convert utf_8 string to unicode string ,and store it,using hash_map.
/this one is abolished, using xapian/unicode.h instead
Coding Week 2: May 30-June 5
May 30 - June 2:
using map, hash_map to organize the dictionary, the efficiency is not test yet
the organization of dictionary in txt is one line per word so ,just read one line , convert into unicode, and read into the word .
current problem: how to convert unicode unsigned int into string, so it can be stored.
Coding Week 3: June 6-June 12
June 9: try to find the fasted way to read files into memory
Coding Week 4: June 13-June 19
Coding Week 5: June 20-June 26
Coding Week 6: June 27-July 3
there are two different algorithm in my project, and the second one is better
1: the general idea of the algorithm:
step 1: load dictionary words from file: compiled-base.dic
- the encoding format is UTF-8
- the structure of this file is: one line indicate a Chinese word
c: the words are sorted these ways:
1)words first sorted by unicode value of the first character
2)when the first character all same, words are sorted by the size, the front words contain two character , then three character, then four, then more than four
3)if first character and the size are all same, words are sorted by unicode value
4)if first character are all same, the words whose size are more than 5 are sorted by the each character's unicode value, such as AAAABB, AAAABC, AAAABCCC, AAAAC, and so on.
load dictionary words is just read words line by line into : string * ascWords
step 2: organize the data structure
- divide the ascWords, which may contain more than 200,0000 words
- first divided by the first character, all the words begin with same characters is put into FirstLevelDictionary, which means each character map to one FirstLevelDictionary
- divide the words begin with same characters by size, put them into four sub dictionary: twoWordDic, threeWordDic, fourWordDic,multWordDic,
- sub dictionary will be either SecondLevelDictionary or DBinaryDictionary
5: DBinaryDictionary is a dictionary which store the begin index and the end index of this dictionary, which will be searched using binary search
6: SecondLevelDictionary: to organize this dictionary, there are two situations:
- if the word size is 2(which is 6 of the UTF-8 length), then each second character will be different , so using map<int, string> to indicate this word
- if the word size is more than 2, then , there will be several words contains same second character, so just put these words which contains same second character into a DBinaryDictionary
step 3. segment the input string
- from the beginning, if the first character is in the dictionary , such as A, has a relative FirstLevelDictionary, pick it out
- get the maxlength, which is the biggest length of words which begin with the A.
- search the sub dictionary to check out whether it is in the subdictionary
- DBinaryDictionary search, if it is the multWordDic, then first find out which position the five character string is in the dictionary, maybe it is in the dictionary , maybe not, then check the next word until word length is shorter than the former one, in this case, word is started with other second or third character.
- SecondLevelDictionary search is just the opposite of how to organize the structure
- if certain length word exist, then put it into vector<string> output
- if no word is found, just move into next chinese character
algorithm two:(which is same as Paoding)
step 1：load dictionary: dictionary is different : t-base.dic
t-base.dic: this file contains same words as compiled-base.dic, but in different order
order: first sorted by unicode value of first character, then second character, then third character, and so on
such as, A, AA, AB, ABC, AC, ACDDG, ACE, BA, BDEG, BF......
step 2: organize the dictionary
first divide the ascWords by first character, and create a sub dicitonary for each character, if the number of words begin with same characters is less than 16, then put them into a BinaryDictionary; if not, divide them by second character, and then check whether the number of words is less than 16, and so on, until every sub dictinary contains less than 16 words.
step 3: search the dictionary
according the first character to find out the sub dictionary
the stuff remains is using do some extra work before the segmentation, such as deal with names, numbers
and do some extra work after the segmentation
the reason i kept two algorithm is for further improvement, such as check the input string backward, in such situation , the first algorithm might be better, this is just and idea, i haven't tested it yet.
Coding Week 7: July 4-July 10
July 4 - July 5 deal with name:
the Chinese name combined by next two part: family name and first name. Unlike English name, the Chinese first name put behind the family name, and it can't be recognised like English name by the lower case or upper case. And generally, the first name contains one or two characters, it is very rare to have three characters as first name. The Chinese family names are fixed, the file x-confucian-family-name.dic contains almost every family name. The Chinese family names generally contain one character, some contain two character.
the general idea of deal with name is below:
when some strings cannot be found on dictionary, then check whether they contain some names.
first , load family files: x-confucian-family-name.dic into HashDictionary familyNameDic
then, check whether the unfounded string contains a family name, if it is, add the family name and next one or two character into vector<string > ouput;
check the the string left to see whether it contains a title, if a family name is followed by a title, we can assume that a family name and the title combined into a name.
if between a family name and a title has one or two characters, we can assume the family name and the middle strings are combined into a name.
July 6: finish coding about how to recognize names
July 7: deal with numbers
numbers can be written in different ways in Chinese, and if it is written totally by Chinese character, it contains countless numbers
First: Chinese number style
English contains words like: hundred, thousand, million, billion, and so on
the same pattern in Chinese is, "十"(ten), "百"(hundred), "千"(thousand), "万"(ten thousand), "亿"(hundred million)
in English, numbers will be written by ways 300,000,000,
in Chinese numbers will be written like 3,0000,0000. it use "万"(ten thousand) and "亿"(hundred million) to form a big number,
also , in Chinese, numbers contains 零(0), 一(1), 二(2), 三(3), 四(4), 五(5), 六(6), 七(7), 八(8), 九(9)
Coding Week 8: July 11-July 17 (Midterm)
finish the code about how to deal with numbers
first, segment the input string into Chinese string and none Chinese string
dealing with Chinese numbers which means check whether a piece of string A contains Chinese number.
check whether a character is a Chinese number-related character, if it is then search the next character until it is not a Chinese number-related character
then if this Chinese number is started at the beginning of the string, the input string before it should also be checked, because some numbers has already existed in the dictionary, so it might be segmented incorrect. For example, one hundred and one point three, in Chinese one point is a term in dictionary, so the first segment will cut the string into one hundred and/one point/three. So when deal with "one hundred and ",it should search the followed string one point three, and when search three, it should search the one hundred and one point. it will be redundant, but there will be not occurred often.
deal with Arab number or Arab number combined with Chinese number character.
when Arab number followed with '-', ',','.','/', it will be considered as accepted term. ',', '.' existed in number, and '-', and '/' are used in date.
when the none Chinese string ended with Arab number, check the followed character to find out whether it is a Chinese number related character, then, combined with the Arab number with the followed Chinese number as a term.
Coding Week 9: July 18-July 24
dealing with words that frequent exist in one file , but there are not in dictionaries
words like this exists quite often
the step to deal with such words is below:
step 1: count all the characters from substrings which are not recoginized by the dictionary, count every characters' existing number
step 2: if the characters exists more than three times, it maybe a frequency character in a frequency word, pick up such characters
step 3: check the unrecoginized substrings, if a frequency character followed by another frequency character, combined them may become a word, record all possible word
step 4: count all possible word, if a word exits more than twice, it be recognised a word
step 5: check all the original substrings to record the segmentation
July 21. so far , this step is done.
because the result form is changed, so it need to change the original result form, so they can be combined together
Coding Week 10: July 25-July 31
compiler the libxapian.lib and testsuite.lib
Coding Week 11: August 1-August 7
testing the result and write test file
Coding Week 12: August 8-August 14
August 9: add code about 2-gram segment 2-gram: if Chinese string is : ABC, then split the string AB/BC
if Chinese string is : ABCD, then split the string AB/BC/CD, and so on.
2-gram only worked after segmentation,chinese numbers, chinese names, freqency words.