Lecture 2 Text Preprocessing

小羊和小何

已于 2023-06-04 18:04:34 修改

阅读量783

点赞数

分类专栏：自然语言处理文章标签：自然语言处理 Preprocessing

于 2023-06-03 19:36:01 首次发布

本文链接：https://blog.csdn.net/Abner98414/article/details/131022595

版权

Words: Sequence of characters with a meaning and/or function 具有含义和/或功能的字符序列
Sentence: Sequence of words 单词序列
Document: One or more sentences 一个或多个句子
Corpus: A collection of documents 文档的集合
Word token: Each instance of a word 每个单词的实例
Word type: Distinct words 不同的单词
Lexicon (“dictionary”): A group of word types 单词类型的组合
E.g.:
- Sentence: "The student is enrolled at the University of Melbourne."
- Word: 9 words in the sentence above: ["The", "student", "is", "enrolled", "at", "the", "University", "of", "Melbourne"]
- Word Token: 9 word tokens in the sentence above: ["the", "student", "is", "enrolled", "at", "the", "university", "of", "melbourne"]
- Word Type: 8 word types in the sentence above: ["the", "student", "is", "enrolled", "at", "university", "of", "melbourne"]

Most NLP applications have documents as inputs 大多数NLP应用程序以文档作为输入
Language is compositional(组合的). As humans, we can break these documents into individual components. To understand language, a computer should do the same 语言是有组合性的。作为人类，我们可以将这些文档分解成各个组件。为了理解语言，计算机应该做同样的事情
Preprocessing is the first to break documents into individual components. 预处理是将文档分解成个别组件的第一步。

Naïve approach: break on sentence punctuations. E.g. [.?!] 初级/天真的方法：在句子标点处分割。
- Problem: Some punctuations are used in abbreviations. E.g. “U.S. dollar”, “Yahoo!” 一些标点符号用在缩写中。
Second approach: User regex to require capital letter after the punctuations. E.g. [.?!][A-z] 第二种方法：使用正则表达式在标点符号后面要求大写字母。
- Problem: Abbreviations for name also matching this case. E.g. “Mr.Brown” 姓名缩写也符合这种情况。
Better approach: Have lexicons(词典) 更好的方法：有词典
- Problem: Difficult to enumerate all names and abbrevi