字标记
Lemma(词元): same stem(相同词干), part of speech, rough word sense
Wordform(词形): the full inflected surface form(完全改变的表示)
Type: an element of the vocabulary
Token: an instance of that type in running text
词语规范化和词干提取
规范化(Normalization)
大写转换(Case folding)
词形还原(Lemmatization)
形态学(Morphology)
词干提取(Stemming)
句子划分与决策树
句子划分(Sentence Segmentation)
确定单词是否为句子结尾:决策树(Decision Tree)