Lecture 2 Text Preprocessing

Some Definitions

  • Words: Sequence of characters with a meaning and/or function 具有含义和/或功能的字符序列
  • Sentence: Sequence of words 单词序列
  • Document: One or more sentences 一个或多个句子
  • Corpus: A collection of documents 文档的集合
  • Word token: Each instance of a word 每个单词的实例
  • Word type: Distinct words 不同的单词
  • Lexicon (“dictionary”): A group of word types 单词类型的组合
  • E.g.:
    • Sentence: "The student is enrolled at the University of Melbourne."
    • Word: 9 words in the sentence above: ["The", "student", "is", "enrolled", "at", "the", "University", "of", "Melbourne"]
    • Word Token: 9 word tokens in the sentence above: ["the", "student", "is", "enrolled", "at", "the", "university", "of", "melbourne"]
    • Word Type: 8 word types in the sentence above: ["the", "student", "is", "enrolled", "at", "university", "of", "melbourne"]

Reasons for Preprocessing

  • Most NLP applications have documents as inputs 大多数NLP应用程序以文档作为输入
  • Language is compositional(组合的). As humans, we can break these documents into individual components. To understand language, a computer should do the same 语言是有组合性的。作为人类,我们可以将这些文档分解成各个组件。为了理解语言,计算机应该做同样的事情
  • Preprocessing is the first to break documents into individual components. 预处理是将文档分解成个别组件的第一步。

Preprocessing Steps

  1. Remove unwanted formatting. E.g. HTML tags 删除不需要的格式。
  2. Sentence Segmentation: Break documents into sentences 将文档分割成句子
  3. Word Tokenization: Break sentences into words 将句子分解成单词
  4. Word Normalization: Transform words into canonical(标准的) forms 将单词转换成标准形式
  5. Stopword removal: delete unwanted words 删除不需要的单词
  • E.g. Sample Document:

    Hi there. I’m TARS.

    • Step 1: “Hi there. I’m TARS.”
    • Step 2: [“Hi there.”, “I’m TARS”]
    • Step 3: [[“Hi”, “there”, “.”], [“i”, “'m”, “TARS”, “.”]]
    • Step 4: [[“hi”, “there”, “.”], [“i”, “am”, “tars”, “.”]]
    • Step 5: [[], [“tars”]]

Sentence Segmentation

Sentence Segmentation 句子分割

  • Naïve approach: break on sentence punctuations. E.g. [.?!] 初级/天真的方法:在句子标点处分割。
    • Problem: Some punctuations are used in abbreviations. E.g. “U.S. dollar”, “Yahoo!” 一些标点符号用在缩写中。
  • Second approach: User regex to require capital letter after the punctuations. E.g. [.?!][A-z] 第二种方法:使用正则表达式在标点符号后面要求大写字母。
    • Problem: Abbreviations for name also matching this case. E.g. “Mr.Brown” 姓名缩写也符合这种情况。
  • Better approach: Have lexicons(词典) 更好的方法:有词典
    • Problem: Difficult to enumerate all names and abbreviations 问题:难以列举所有的名字和缩写
  • State-of-the-Art approach: User machine learning, rather than rules 最先进的方法:使用机器学习,而不是规则

Binary Classifier 二元分类器

  • Looks at every “.” and decides whether it is the end of a sentence. 检查每一个"."并决定它是否是句子的结束。
    • Applicable models: decision trees, logistic regression 可应用的模型:决策树,逻辑回归
  • Features:
    • Look at the words before and after “.” 查看"."之前和之后的单词
    • Word shapes: 单词形态
      • Uppercase 大写
      • Lowercase 小写
      • ALL_CAPS 全大写
      • number 数字
      • Character length 字符长度
    • Part-of-speech tags: Determiners tend to start a sentence 词性标签:决定词往往在句子开始的位置

Word Tokenization

Word Tokenization: English 英文词元标记化

  • Naïve approach: separate out alphabetic strings. E.g. \w+ 初级方法:分离字母字符串。
    • Problem: There are words that has special punctuations or does not contain alphabetics: 问题:有些词含有特殊的标点或者不含字母
      • Abbreviations: E.g. U.S.A 缩写
      • Hyphens: E.g. merry-go-round, well-respected, yes-but 连字符
      • Numbers: E.g. 10000.01 数字
      • Dates: E.g. 3/1/2016 日期
      • Clitics(附着词): E.g. can’t 附着词
      • Internet language: E.g. http://www.google.com, #metoo, 😃 网络语言
      • Multiword units: E.g. New Zealand 多词单元

Word Tokenization: Chinese 中文词元标记化

  • Some Asian languages are written without spaces between words. 一些亚洲语言在单词之间没有空格
  • In Chinese, words often correspond to more than one character. 在中文中,单词通常对应多个字符
  • E.g
墨大      的   学生           与众不同
Unimelb  's   students(are)  special
  • Standard approach assumes an existing vocabulary 标准方法假设已有词汇表
  • MaxMatch algorithm: Greedily match the longest word in the vocabulary 贪婪地匹配词汇表中最长的单词
    • E.g. 墨大的学生与众不同
    V = {墨,大,的,学,生,与,众,不,同,墨大,学生,不同,与众不同}
    MaxMatch process:
    1. match 墨大
    2. move to 的
    3. match 的
    4. move to 学
    5. match 学生
    6. move to 与
    7. match 与众不同
    8. done
    Result: 墨大|的|学生|与众不同
    
    • Problem: We might not get all vocabulary accurately. 问题:我们可能无法准确获取所有的词汇
      • E.g.
        Sentence: 去买新西兰花
        1. 去|买|新西兰|花 -> go|buy|New Zealand|flowers
        2. 去|买|新|西兰花 -> go|buy|new|broccoli
      

Word Tokenization: German 德语词元标记化

  • Lebensversicherungsgesellschaftsangestellter -> life insurance company employee 人寿保险公司员工
  • Requires compound splitter 需要复合词分割器

Subword Tokenization 子词标记化

  • Colorless green ideas sleep furiously -> [color][less][green][idea][s][sleep][furious][ly]
  • One popular algorithm: byte-pair encoding(BPE) 字节对编码(BPE)
    • Core idea: Iteratively merge frequent pairs of characters 迭代合并频繁出现的字符对
    • Advantages:
      • Data-informed tokenization
      • Works for different languages 适用于不同的语言
      • Deals better with unknown words 更好地处理未知词汇
Byte-Pair Encoding 字节对编码
  • E.g.
    1. count all word frequencies in the corpus. [] means how many times the word appears in the corpus 
       计算语料库中所有单词的频率。[] 中的数字表示单词在语料库中出现的次数
    - [5] l o w _
    - [2] l o w e s t _
    - [6] n e w e r _
    - [3] w i d e r _
    - [2] n e w _
    
    2. First add all individual characters into the vocabulary:
       首先将所有单独的字符添加到词汇表中
    V = {_, d, e, i, l, n, o, r , s, t, w}
    
    3. Find the next most frequent byte-pair in the corpus:
       寻找语料库中下一个最频繁的字节对
      Next frequent is : r_ which appears in newer and wider in total of 6 + 3 = 9 times
      V = {_, d, e, i, l, n, o, r , s, t, w, r_}
    
    4. Continue to add the next most frequent byte-pair until all subwords are in the vocabulary:
       继续添加下一个最频繁的字节对,直到所有的子词都在词汇表中
      V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_}
      V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew}
      V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new}
      V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new, lo}
      V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new, lo, low}
      V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new, lo, low, newer_}
      V = {_, d, e, i, l, n, o, r , s, t, w, r_, er_, ew, new, lo, low, newer_, low_}
      ... 
    
  • In practice BPE will run with thousands of merges creating a large vocabulary 在实际使用中,BPE将执行数千次合并,创建一个大的词汇表
  • Most frequent words will be represented as full words 最常见的词将被表示为完整的词
  • Rarer words will be broken into subwords 更稀有的词将被分解为子词
  • In the word case, unknown words in the test data will be broken into individual letter 在最坏的情况下,测试数据中的未知词将被分解为单个字母
Disadvantage of Word Tokenization 词元标记化的缺点
  • Creates non-sensical subwords 创建无意义的子词
  • Generally large vocabulary, but size of vocabulary is controllable(less merges). 通常词汇量大,但词汇量大小是可以控制的(较少的合并次数)。

Word Normalization

Word Normalization 词汇规范化

  • Lower casing: Australia -> australia 小写化
  • Removing morphology(词法): cooking -> cook 删除词法
  • Correcting spelling: definately -> definitely 纠正拼写
  • Expanding abbreviations: U.S.A -> USA 展开缩写
  • Goal of word normalization: 目标
    • Reduce vocabulary 减少词汇量
    • Map words into the same type 将词映射到同一类型

Inflectional Morphology(屈折词法)

  • Inflectional morphology creates grammatical variants 屈折词法创建语法变体
  • english inflects nouns, verbs, and adjectives: 英语中的名词、动词和形容词有屈折变化
    • Nouns: number of nouns (-s) 名词的数量(-s)
    • Verbs: number of the subjects (-s), the aspect (-ing) of the action, and the tense (-ed) of the action 主语的数量(-s),动作的时态(-ed)和动作的方面(-ing)
    • Adjectives: comparatives (-er) and superlatives(-est) 比较级(-er)和最高级(-est)
  • Many languages have much richer inflectional morphology than English 许多语言的屈折词法比英语丰富得多
    • E.g. French inflects nouns for gender (un chat vs. une chatte) 例如,法语中的名词有性别变化(un chat vs. une chatte)

Lemmatization(词性还原)

  • Lemmatization means removing any inflection to reach the uninflected form called the lemma 词形还原意味着去除任何屈折,以达到未屈折的形式,称为词元
    • E.g. speaking -> speak
  • In English, these are irregularities that prevent a trivial solution: 在英语中,这些不规则性阻碍了一个简单的解决方案
    • poked -> poke (not pok)
    • stopping -> stop (not stopp)
    • watches -> watch (not watche)
    • was -> be (not wa)
  • A lexicon of lemmas needed for accurate lemmatization 需要一个词元词典进行准确的词形还原

Derivational Morphology(派生词法)

  • Derivational morphology creates distinct words 派生词法创建了不同的词
  • English derivational suffixes often change the lexical category. 英语的派生后缀通常改变了词的类别
    • E.g.:
      • -ly: personal -> personally (adjectives -> adverbs) (形容词 -> 副词)
      • -ize: final -> finalize (adjective -> transitive verbs) (形容词 -> 及物动词)
      • -er: write -> writer (verb -> noun) (动词 -> 名词)
  • English derivational prefixes often change the meaning without changing the lexical category. 英语派生前缀经常在不改变词类的情况下改变意思。
    • E.g.:
      • write -> rewrite
      • healthy -> unhealthy

Stemming(词干提取)

  • Stemming strips off all suffixes, leaving a stem 词干提取是剥离所有的后缀,只保留词干
    • E.g. automate, automatic, automation -> automat
    • Often not an actual lexical item 经常不是实际的词汇项
  • Even less lexical sparsity than lemmatization 相比词形还原,词干提取的词汇稀疏度更小
  • Popular in information retrieval 在信息检索中非常流行
  • Stem not always interpretable 词干并不总是可解释的
The Porter Stemmer Porter词干提取器
  • The most popular stemmer for English 英语中最受欢迎的词干提取器
  • Applies rewrite rules in stages: 在各阶段应用重写规则
    • First strip inflectional suffixes: E.g. -ies -> -i 首先剥离词形变化后缀
    • Then derivatioal suffixes: E.g. -ization -> -ize -> i 然后衍生后缀
  • Notations: 符号注释
    • c = consonant E.g. ‘b’, ‘c’, ‘d’ 辅音
    • v = vowel E.g. ‘a’, ‘e’, ‘i’, ‘o’, ‘u’ 元音
    • C = a sequence of consonants E.g. ‘s’, ‘ss’, ‘tr’, ‘bl’ 辅音序列
    • V = a sequence of vowels E.g. ‘o’, ‘oo’, ‘ee’, ‘io’ 元音序列
  • A word has one of the four forms: 一个词有四种形式之一
    • CVCV … C
    • CVCV … V
    • VCVC … C
    • VCVC … V
  • Therefore, a word can be represented as: 因此一个词可以被表示为
    [C]VCVC…[V] -> [C](VC)m[V] where m is the measure 其中 m 是度量
    • E.g.
      • Tree = C(tr)V(ee) = C(VC)0V
      • Trees = C(tr)V(ee)C(s) = C(VC)1
      • Troubles = C(tr)V(ou)C(bl)V(e)C(s) = C(VC)2
  • Rules format: (condition) S1 -> S2
    • E.g. (m > 1) ement -> null: replacement -> replac
      • replac -> CVCVC = C(VC)2 -> m = 2
  • Always use the longest matching S1
    • E.g.
      • Rules: sses -> ss, ies -> i, ss -> ss, s -> null
      • caresses -> caress
      • caress -> caress
      • cares -> care
  • Algorithm:
    • Step 1: plurals and inflectional morphology 复数和词形变化
    • Step 2, 3, 4: derivational inflections 衍生变形
    • Step 5: Tidying up 整理
  • Examples:
    • computational -> comput
      • Step 2: ational -> ate: computate
      • Step 4: ate -> null: comput
    • computer -> comput:
      • Step 4: er -> null: comput

Fixing Spelling Errors 修改拼写错误

  • Reasons:
    • Spelling errors create new, rare types 拼写错误创造了新的,稀有的类型
    • Disrupt various kinds of linguistic analysis 干扰了各种形式的语言分析
    • Very common in internet corpora 在网络语料库中非常常见
    • In web search, particularly important in queries 在网络搜索中,在查询中尤其重要
  • Methods:
    • String distance (Levenshtein) 字符串距离
    • Modelling of error types (phonetic, typing, etc) 错误类型建模
    • Use an n-gram language model 使用n-gram语言模型

Other Word Normalization 其他词汇规范化

  • Normalizing spelling variations 规范化拼写变体
    • normalise -> normalize
    • U r so coool! -> you are so cool
  • Expanding abbreviations 展开缩写
    • US, U.S. -> United States
    • imho -> in my humble opinion

Stopword Removal

Stopwords 停用词

  • Definition: a list of words to be removed from the document 要从文档中移除的词的列表
    • Typical in bag-of-word(BOW) representations 在词袋(BOW)表示中很典型
    • Not appropriate when sequence is important 在序列重要时不适用
  • Choosing stopwords: 选择停用词:
    • All closed-class or function words: E.g. the, a, of, for, he, … 所有封闭类功能词
    • Any high frequency words 任何高频词
    • NLTK, spaCy NLP toolkits
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值