Lecture 2 Text Preprocessing

Some Definitions

  • Words: Sequence of characters with a meaning and/or function 具有含义和/或功能的字符序列
  • Sentence: Sequence of words 单词序列
  • Document: One or more sentences 一个或多个句子
  • Corpus: A collection of documents 文档的集合
  • Word token: Each instance of a word 每个单词的实例
  • Word type: Distinct words 不同的单词
  • Lexicon (“dictionary”): A group of word types 单词类型的组合
  • E.g.:
    • Sentence: "The student is enrolled at the University of Melbourne."
    • Word: 9 words in the sentence above: ["The", "student", "is", "enrolled", "at", "the", "University", "of", "Melbourne"]
    • Word Token: 9 word tokens in the sentence above: ["the", "student", "is", "enrolled", "at", "the", "university", "of", "melbourne"]
    • Word Type: 8 word types in the sentence above: ["the", "student", "is", "enrolled", "at", "university", "of", "melbourne"]

Reasons for Preprocessing

  • Most NLP applications have documents as inputs 大多数NLP应用程序以文档作为输入
  • Language is compositional(组合的). As humans, we can break these documents into individual components. To understand language, a computer should do the same 语言是有组合性的。作为人类,我们可以将这些文档分解成各个组件。为了理解语言,计算机应该做同样的事情
  • Preprocessing is the first to break documents into individual components. 预处理是将文档分解成个别组件的第一步。

Preprocessing Steps

  1. Remove unwanted formatting. E.g. HTML tags 删除不需要的格式。
  2. Sentence Segmentation: Break documents into sentences 将文档分割成句子
  3. Word Tokenization: Break sentences into words 将句子分解成单词
  4. Word Normalization: Transform words into canonical(标准的) forms 将单词转换成标准形式
  5. Stopword removal: delete unwanted words 删除不需要的单词
  • E.g. Sample Document:

    Hi there. I’m TARS.

    • Step 1: “Hi there. I’m TARS.”
    • Step 2: [“Hi there.”, “I’m TARS”]
    • Step 3: [[“Hi”, “there”, “.”], [“i”, “'m”, “TARS”, “.”]]
    • Step 4: [[“hi”, “there”, “.”], [“i”, “am”, “tars”, “.”]]
    • Step 5: [[], [“tars”]]

Sentence Segmentation

Sentence Segmentation 句子分割

  • Naïve approach: break on sentence punctuations. E.g. [.?!] 初级/天真的方法:在句子标点处分割。
    • Problem: Some punctuations are used in abbreviations. E.g. “U.S. dollar”, “Yahoo!” 一些标点符号用在缩写中。
  • Second approach: User regex to require capital letter after the punctuations. E.g. [.?!][A-z] 第二种方法:使用正则表达式在标点符号后面要求大写字母。
    • Problem: Abbreviations for name also matching this case. E.g. “Mr.Brown” 姓名缩写也符合这种情况。
  • Better approach: Have lexicons(词典) 更好的方法:有词典
    • Problem: Difficult to enumerate all names and abbrevi
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值