目录
- Some Definitions
- Reasons for Preprocessing
- Preprocessing Steps
- Sentence Segmentation 句子分割
- Binary Classifier 二元分类器
- Word Tokenization: English 英文词元标记化
- Word Tokenization: Chinese 中文词元标记化
- Word Tokenization: German 德语词元标记化
- Subword Tokenization 子词标记化
- Word Normalization 词汇规范化
- Inflectional Morphology(屈折词法)
- Lemmatization(词性还原)
- Derivational Morphology(派生词法)
- Stemming(词干提取)
- Fixing Spelling Errors 修改拼写错误
- Other Word Normalization 其他词汇规范化
- Stopwords 停用词
Some Definitions
- Words: Sequence of characters with a meaning and/or function 具有含义和/或功能的字符序列
- Sentence: Sequence of words 单词序列
- Document: One or more sentences 一个或多个句子
- Corpus: A collection of documents 文档的集合
- Word token: Each instance of a word 每个单词的实例
- Word type: Distinct words 不同的单词
- Lexicon (“dictionary”): A group of word types 单词类型的组合
- E.g.:
- Sentence:
"The student is enrolled at the University of Melbourne."
- Word: 9 words in the sentence above:
["The", "student", "is", "enrolled", "at", "the", "University", "of", "Melbourne"]
- Word Token: 9 word tokens in the sentence above:
["the", "student", "is", "enrolled", "at", "the", "university", "of", "melbourne"]
- Word Type: 8 word types in the sentence above:
["the", "student", "is", "enrolled", "at", "university", "of", "melbourne"]
- Sentence:
Reasons for Preprocessing
- Most NLP applications have documents as inputs 大多数NLP应用程序以文档作为输入
- Language is compositional(组合的). As humans, we can break these documents into individual components. To understand language, a computer should do the same 语言是有组合性的。作为人类,我们可以将这些文档分解成各个组件。为了理解语言,计算机应该做同样的事情
- Preprocessing is the first to break documents into individual components. 预处理是将文档分解成个别组件的第一步。
Preprocessing Steps
- Remove unwanted formatting. E.g. HTML tags 删除不需要的格式。
- Sentence Segmentation: Break documents into sentences 将文档分割成句子
- Word Tokenization: Break sentences into words 将句子分解成单词
- Word Normalization: Transform words into canonical(标准的) forms 将单词转换成标准形式
- Stopword removal: delete unwanted words 删除不需要的单词
- E.g. Sample Document:
“Hi there. I’m TARS.
”- Step 1: “Hi there. I’m TARS.”
- Step 2: [“Hi there.”, “I’m TARS”]
- Step 3: [[“Hi”, “there”, “.”], [“i”, “'m”, “TARS”, “.”]]
- Step 4: [[“hi”, “there”, “.”], [“i”, “am”, “tars”, “.”]]
- Step 5: [[], [“tars”]]
Sentence Segmentation
Sentence Segmentation 句子分割
- Naïve approach: break on sentence punctuations. E.g.
[.?!]
初级/天真的方法:在句子标点处分割。- Problem: Some punctuations are used in abbreviations. E.g. “U.S. dollar”, “Yahoo!” 一些标点符号用在缩写中。
- Second approach: User regex to require capital letter after the punctuations. E.g.
[.?!][A-z]
第二种方法:使用正则表达式在标点符号后面要求大写字母。- Problem: Abbreviations for name also matching this case. E.g. “Mr.Brown” 姓名缩写也符合这种情况。
- Better approach: Have lexicons(词典) 更好的方法:有词典
- Problem: Difficult to enumerate all names and abbreviations 问题:难以列举所有的名字和缩写
- State-of-the-Art approach: User machine learning, rather than rules 最先进的方法:使用机器学习,而不是规则
Binary Classifier 二元分类器
- Looks at every “.” and decides whether it is the end of a sentence. 检查每一个"."并决定它是否是句子的结束。
- Applicable models: decision trees, logistic regression 可应用的模型:决策树,逻辑回归
- Features:
- Look at the words before and after “.” 查看"."之前和之后的单词
- Word shapes: 单词形态
- Uppercase 大写
- Lowercase 小写
- ALL_CAPS 全大写
- number 数字
- Character length 字符长度
- Part-of-speech tags: Determiners tend to start a sentence 词性标签:决定词往往在句子开始的位置
Word Tokenization
Word Tokenization: English 英文词元标记化
- Naïve approach: separate out alphabetic strings. E.g.
\w+
初级方法:分离字母字符串。- Problem: There are words that has special punctuations or does not contain alphabetics: 问题:有些词含有特殊的标点或者不含字母
- Abbreviations: E.g. U.S.A 缩写
- Hyphens: E.g. merry-go-round, well-respected, yes-but 连字符
- Numbers: E.g. 10000.01 数字
- Dates: E.g. 3/1/2016 日期
- Clitics(附着词): E.g. can’t 附着词
- Internet language: E.g. http://www.google.com, #metoo, 😃 网络语言
- Multiword units: E.g. New Zealand 多词单元
- Problem: There are words that has special punctuations or does not contain alphabetics: 问题:有些词含有特殊的标点或者不含字母
Word Tokenization: Chinese 中文词元标记化
- Some Asian languages are written without spaces between words. 一些亚洲语言在单词之间没有空格
- In Chinese, words often correspond to more than one character. 在中文中,单词通常对应多个字符
- E.g
墨大 的 学生 与众不同
Unimelb 's students(are) special
- Standard approach assumes an existing vocabulary 标准方法假设已有词汇表
- MaxMatch algorithm: Greedily match the longest word in the vocabulary 贪婪地匹配词汇表中最长的单词
- E.g. 墨大的学生与众不同
V = {墨,大,的,学,生,与,众,不,同,墨大,学生,不同,与众不同} MaxMatch process: 1. match 墨大 2. move to 的 3. match 的 4. move to 学 5. match 学生 6. move to 与 7. match 与众不同 8. done Result: 墨大|的|学生|与众不同
- Problem: We might not get all vocabulary accurately. 问题:我们可能无法准确获取所有的词汇
- E.g.
Sentence: 去买新西兰花 1. 去|买|新西兰|花 -> go|buy|New Zealand|flowers 2. 去|买|新|西兰花 -> go|buy|new|broccoli
Word Tokenization: German 德语词元标记化
- Lebensversicherungsgesellschaftsangestellter -> life insurance company employee 人寿保险公司员工
- Requires compound splitter 需要复合词分割器
Subword Tokenization 子词标记化
- Colorless green ideas sleep furiously ->
[color][less][green][idea][s][sleep][furious][ly]
- One popular algorithm: byte-pair encoding(BPE) 字节对编码(BPE)
- Core idea: Iteratively merge frequent pairs of characters 迭代合并频繁出现的字符对
- Advantages:
- Data-informed tokenization
- Works for different languages 适用于不同的语言
- Deals better with unknown words 更好地处理未知词汇
Byte-Pair Encoding 字节对编码
- E.g.
1. count all word frequencies in the corpus. [] means how many times the word appears in the corpus 计算语料库中所有单词的频率。[] 中的数字表示单词在语料库中出现的次数 - [5] l o w _ - [2] l o w e s t _ - [6] n e w e r _ - [3] w i d e r _ - [2] n e w _ 2. First add all individual characters into the vocabulary: 首先将所有单独的字符添加到词汇表中 V = {_, d, e, i, l, n, o, r , s, t, w} 3. Find the next most frequent byte-pair in the corpus: 寻找语料库中下一个最频繁的字节对 Next frequent is : r_ which appears in newer and wider in total of 6 + 3 = 9 times V = {_, d, e, i, l, n, o, r , s, t, w, r_} 4. Continue to add the next most frequent byte-pair until all subwords are in the vocabulary: 继续添加下一个最频繁的字节对,直到所有的子词都在词汇表中 V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_} V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew} V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new} V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new, lo} V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new, lo, low} V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new, lo, low, newer_} V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new, lo, low, newer_, low_} ...
- In practice BPE will run with thousands of merges creating a large vocabulary 在实际使用中,BPE将执行数千次合并,创建一个大的词汇表
- Most frequent words will be represented as full words 最常见的词将被表示为完整的词
- Rarer words will be broken into subwords 更稀有的词将被分解为子词
- In the word case, unknown words in the test data will be broken into individual letter 在最坏的情况下,测试数据中的未知词将被分解为单个字母
Disadvantage of Word Tokenization 词元标记化的缺点
- Creates non-sensical subwords 创建无意义的子词
- Generally large vocabulary, but size of vocabulary is controllable(less merges). 通常词汇量大,但词汇量大小是可以控制的(较少的合并次数)。
Word Normalization
Word Normalization 词汇规范化
- Lower casing: Australia -> australia 小写化
- Removing morphology(词法): cooking -> cook 删除词法
- Correcting spelling: definately -> definitely 纠正拼写
- Expanding abbreviations: U.S.A -> USA 展开缩写
- Goal of word normalization: 目标
- Reduce vocabulary 减少词汇量
- Map words into the same type 将词映射到同一类型
Inflectional Morphology(屈折词法)
- Inflectional morphology creates grammatical variants 屈折词法创建语法变体
- english inflects nouns, verbs, and adjectives: 英语中的名词、动词和形容词有屈折变化
- Nouns: number of nouns (-s) 名词的数量(-s)
- Verbs: number of the subjects (-s), the aspect (-ing) of the action, and the tense (-ed) of the action 主语的数量(-s),动作的时态(-ed)和动作的方面(-ing)
- Adjectives: comparatives (-er) and superlatives(-est) 比较级(-er)和最高级(-est)
- Many languages have much richer inflectional morphology than English 许多语言的屈折词法比英语丰富得多
- E.g. French inflects nouns for gender (un chat vs. une chatte) 例如,法语中的名词有性别变化(un chat vs. une chatte)
Lemmatization(词性还原)
- Lemmatization means removing any inflection to reach the uninflected form called the lemma 词形还原意味着去除任何屈折,以达到未屈折的形式,称为词元
- E.g. speaking -> speak
- In English, these are irregularities that prevent a trivial solution: 在英语中,这些不规则性阻碍了一个简单的解决方案
- poked -> poke (not pok)
- stopping -> stop (not stopp)
- watches -> watch (not watche)
- was -> be (not wa)
- A lexicon of lemmas needed for accurate lemmatization 需要一个词元词典进行准确的词形还原
Derivational Morphology(派生词法)
- Derivational morphology creates distinct words 派生词法创建了不同的词
- English derivational suffixes often change the lexical category. 英语的派生后缀通常改变了词的类别
- E.g.:
- -ly: personal -> personally (adjectives -> adverbs) (形容词 -> 副词)
- -ize: final -> finalize (adjective -> transitive verbs) (形容词 -> 及物动词)
- -er: write -> writer (verb -> noun) (动词 -> 名词)
- E.g.:
- English derivational prefixes often change the meaning without changing the lexical category. 英语派生前缀经常在不改变词类的情况下改变意思。
- E.g.:
- write -> rewrite
- healthy -> unhealthy
- E.g.:
Stemming(词干提取)
- Stemming strips off all suffixes, leaving a stem 词干提取是剥离所有的后缀,只保留词干
- E.g. automate, automatic, automation -> automat
- Often not an actual lexical item 经常不是实际的词汇项
- Even less lexical sparsity than lemmatization 相比词形还原,词干提取的词汇稀疏度更小
- Popular in information retrieval 在信息检索中非常流行
- Stem not always interpretable 词干并不总是可解释的
The Porter Stemmer Porter词干提取器
- The most popular stemmer for English 英语中最受欢迎的词干提取器
- Applies rewrite rules in stages: 在各阶段应用重写规则
- First strip inflectional suffixes: E.g. -ies -> -i 首先剥离词形变化后缀
- Then derivatioal suffixes: E.g. -ization -> -ize -> i 然后衍生后缀
- Notations: 符号注释
- c = consonant E.g. ‘b’, ‘c’, ‘d’ 辅音
- v = vowel E.g. ‘a’, ‘e’, ‘i’, ‘o’, ‘u’ 元音
- C = a sequence of consonants E.g. ‘s’, ‘ss’, ‘tr’, ‘bl’ 辅音序列
- V = a sequence of vowels E.g. ‘o’, ‘oo’, ‘ee’, ‘io’ 元音序列
- A word has one of the four forms: 一个词有四种形式之一
- CVCV … C
- CVCV … V
- VCVC … C
- VCVC … V
- Therefore, a word can be represented as: 因此一个词可以被表示为
[C]VCVC…[V] -> [C](VC)m[V] where m is the measure 其中 m 是度量- E.g.
- Tree = C(tr)V(ee) = C(VC)0V
- Trees = C(tr)V(ee)C(s) = C(VC)1
- Troubles = C(tr)V(ou)C(bl)V(e)C(s) = C(VC)2
- E.g.
- Rules format: (condition) S1 -> S2
- E.g. (m > 1) ement -> null: replacement -> replac
- replac -> CVCVC = C(VC)2 -> m = 2
- E.g. (m > 1) ement -> null: replacement -> replac
- Always use the longest matching S1
- E.g.
- Rules: sses -> ss, ies -> i, ss -> ss, s -> null
- caresses -> caress
- caress -> caress
- cares -> care
- E.g.
- Algorithm:
- Step 1: plurals and inflectional morphology 复数和词形变化
- Step 2, 3, 4: derivational inflections 衍生变形
- Step 5: Tidying up 整理
- Examples:
- computational -> comput
- Step 2: ational -> ate: computate
- Step 4: ate -> null: comput
- computer -> comput:
- Step 4: er -> null: comput
- computational -> comput
Fixing Spelling Errors 修改拼写错误
- Reasons:
- Spelling errors create new, rare types 拼写错误创造了新的,稀有的类型
- Disrupt various kinds of linguistic analysis 干扰了各种形式的语言分析
- Very common in internet corpora 在网络语料库中非常常见
- In web search, particularly important in queries 在网络搜索中,在查询中尤其重要
- Methods:
- String distance (Levenshtein) 字符串距离
- Modelling of error types (phonetic, typing, etc) 错误类型建模
- Use an n-gram language model 使用n-gram语言模型
Other Word Normalization 其他词汇规范化
- Normalizing spelling variations 规范化拼写变体
- normalise -> normalize
- U r so coool! -> you are so cool
- Expanding abbreviations 展开缩写
- US, U.S. -> United States
- imho -> in my humble opinion
Stopword Removal
Stopwords 停用词
- Definition: a list of words to be removed from the document 要从文档中移除的词的列表
- Typical in bag-of-word(BOW) representations 在词袋(BOW)表示中很典型
- Not appropriate when sequence is important 在序列重要时不适用
- Choosing stopwords: 选择停用词:
- All closed-class or function words: E.g. the, a, of, for, he, … 所有封闭类或功能词
- Any high frequency words 任何高频词
- NLTK, spaCy NLP toolkits