目录
-
-
- Some Definitions
- Reasons for Preprocessing
- Preprocessing Steps
- Sentence Segmentation 句子分割
- Binary Classifier 二元分类器
- Word Tokenization: English 英文词元标记化
- Word Tokenization: Chinese 中文词元标记化
- Word Tokenization: German 德语词元标记化
- Subword Tokenization 子词标记化
- Word Normalization 词汇规范化
- Inflectional Morphology(屈折词法)
- Lemmatization(词性还原)
- Derivational Morphology(派生词法)
- Stemming(词干提取)
- Fixing Spelling Errors 修改拼写错误
- Other Word Normalization 其他词汇规范化
- Stopwords 停用词
-
Some Definitions
- Words: Sequence of characters with a meaning and/or function 具有含义和/或功能的字符序列
- Sentence: Sequence of words 单词序列
- Document: One or more sentences 一个或多个句子
- Corpus: A collection of documents 文档的集合
- Word token: Each instance of a word 每个单词的实例
- Word type: Distinct words 不同的单词
- Lexicon (“dictionary”): A group of word types 单词类型的组合
- E.g.:
- Sentence:
"The student is enrolled at the University of Melbourne."
- Word: 9 words in the sentence above:
["The", "student", "is", "enrolled", "at", "the", "University", "of", "Melbourne"]
- Word Token: 9 word tokens in the sentence above:
["the", "student", "is", "enrolled", "at", "the", "university", "of", "melbourne"]
- Word Type: 8 word types in the sentence above:
["the", "student", "is", "enrolled", "at", "university", "of", "melbourne"]
- Sentence:
Reasons for Preprocessing
- Most NLP applications have documents as inputs 大多数NLP应用程序以文档作为输入
- Language is compositional(组合的). As humans, we can break these documents into individual components. To understand language, a computer should do the same 语言是有组合性的。作为人类,我们可以将这些文档分解成各个组件。为了理解语言,计算机应该做同样的事情
- Preprocessing is the first to break documents into individual components. 预处理是将文档分解成个别组件的第一步。
Preprocessing Steps
- Remove unwanted formatting. E.g. HTML tags 删除不需要的格式。
- Sentence Segmentation: Break documents into sentences 将文档分割成句子
- Word Tokenization: Break sentences into words 将句子分解成单词
- Word Normalization: Transform words into canonical(标准的) forms 将单词转换成标准形式
- Stopword removal: delete unwanted words 删除不需要的单词
- E.g. Sample Document:
“Hi there. I’m TARS.
”- Step 1: “Hi there. I’m TARS.”
- Step 2: [“Hi there.”, “I’m TARS”]
- Step 3: [[“Hi”, “there”, “.”], [“i”, “'m”, “TARS”, “.”]]
- Step 4: [[“hi”, “there”, “.”], [“i”, “am”, “tars”, “.”]]
- Step 5: [[], [“tars”]]
Sentence Segmentation
Sentence Segmentation 句子分割
- Naïve approach: break on sentence punctuations. E.g.
[.?!]
初级/天真的方法:在句子标点处分割。- Problem: Some punctuations are used in abbreviations. E.g. “U.S. dollar”, “Yahoo!” 一些标点符号用在缩写中。
- Second approach: User regex to require capital letter after the punctuations. E.g.
[.?!][A-z]
第二种方法:使用正则表达式在标点符号后面要求大写字母。- Problem: Abbreviations for name also matching this case. E.g. “Mr.Brown” 姓名缩写也符合这种情况。
- Better approach: Have lexicons(词典) 更好的方法:有词典
- Problem: Difficult to enumerate all names and abbrevi