第2章 文本的歧义及其清理
文本处理的过程:
词项化—>去除停用词---->词干提取或词形还原
1. 简单看看json文件的基本内容:
example.json:
{
“array”: [1,2,3,4],
“boolean”: “True”,
“object”: {
“a”: “b”
},
“string”: “Hello World”
}
简单的处理代码:
import json
#打开文件
jsonfile=open("example.json")
#加载数据
data=json.load(jsonfile)
print(data['array'],data['boolean'],data['object'],data['string'])
结果如下:
2.语句分离
前边应该进行文本清理,如前面对html语言进行处理不必要字符,以及删去长度短的字母。
语句分离即将大段原生文本分割成一系列语句。
利用sent_tokenize分离语句:
from nltk.tokenize import sent_tokenize
#sent_tokenize是专门根据语句边界检测来分离语句的
inputstring=" This is an example sent. The sentence splitter will split on sent markers. Ohh really !!"
all_sent=sent_tokenize(inputstring)
print(all_sent)
结果如下:
D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [’ This is an
example sent.’, ‘The sentence splitter will split on sent markers.’,
‘Ohh really !’, ‘!’]Process finished with exit code 0
3.标识化处理
有各种表示器,最简单的python字符串类型的split()方法,利用空白符进行单词分割。word_tokenize()是一个更加强大同样的方法,还有另一个选择regex_tokenize()。同时也可以基于正则表达式来分割出相同字符串。
具体代码如下:
- 利用split():
s = "Hi Everyone ! This is the first day we go to school."
print(s.split())
结果:
D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘!’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’,
‘to’, ‘school.’]Process finished with exit code 0
- 利用word_tokenize:
from nltk.tokenize import word_tokenize
s = "Hi Everyone ! This is the first day we go to school."
all_word=word_tokenize(s)
print(all_word)
结果:
D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘!’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’,
‘to’, ‘school’, ‘.’]Process finished with exit code 0
- 利用·regexp_tokenize:
可以用\w+这个正则表达式,分隔出单词和数字,如果用\d+这个正则表达式,提取出纯数字内容。
from nltk.tokenize import regexp_tokenize,wordpunct_tokenize,blankline_tokenize
s = "Hi Everyone ! This is the first day we go to school."
all_word=regexp_tokenize(s, pattern='\w+')
print(all_word)
结果:
D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘Hi’,
‘Everyone’, ‘This’, ‘is’, ‘the’, ‘first’, ‘day’, ‘we’, ‘go’, ‘to’,
‘school’]Process finished with exit code 0
4.词干提取stemming
举个例子:
eating、eaten、eats->eat
将不同的词形变化归结为相同的词根,在像移除-s/es、-ing或-ed这类事情上都可以有70%以上的精确度
简单代码如下:
from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
#创建Porter词干提取器
pst=PorterStemmer()
#创建Lancaster词干提取器
lst=LancasterStemmer()
print(lst.stem("eating"))
print(pst.stem("shopping"))
结果展示:
D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py eat shop
Process finished with exit code 0
5.词形还原
更有条理,会利用上下文语境和词性来确定相关单词的变化形式
简单代码如下:
from nltk.stem import WordNetLemmatizer
wlem=WordNetLemmatizer()
print(wlem.lemmatize("I want to shopping"))
结果:
D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py I want to
shoppingProcess finished with exit code 0
6.停用词移除
停用词对文档或者查询时无用的,有两种方法筛选出停用词:
方法一:通过人工或者网站上找到停用词列表。
方法二:利用频率来构建停用词列表
NLTK就有停用词库
简单代码如下:
#从corpus中导出停用词序列
from nltk.corpus import stopwords
#得到英语english的停用词
stoplist=stopwords.words('english')
#我们可以查看一下停用词有哪些
print(stoplist)
text="This is just a test"
#将文本的字母全部调整为小写
text=text.lower()
print(text)
#剔除在停用词列表中的单词
clenwordlist=[word for word in text.split() if word not in stoplist]
print(clenwordlist)
结果:
D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py [‘i’, ‘me’, ‘my’,
‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’, ‘yours’,
‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, ‘her’,
‘hers’, ‘herself’, ‘it’, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’,
‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’,
‘that’, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’,
‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’,
‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’,
‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’,
‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’,
‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’,
‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’,
‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’,
‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’,
‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’,
‘don’, ‘should’, ‘now’] this is just a test [‘test’]Process finished with exit code 0
7.拼音纠错
我们可以通过纯字典查找方式创建一个非常基本的拼写检查器,也可以用模糊字符串匹配,最常用的是edit-distance算法,具体见后面章节
简单代码如下:
from nltk.metrics import edit_distance
print(edit_distance("rain","shine"))
结果如下:
D:\IR_lab\venv\Scripts\python.exe D:/IR_lab/learn.py
3
Process finished with exit code 0
词干提取与词形还原有什么区别:
个人认为词干提取是缩减,砍掉尾部,比如driving->driv,而不是drive;而词形还原会根据上下文进行变换;比如drove->drive
本章小结:
主要是文本的处理,我们学习了:
文本分离,单词分离,词干提取、词形还原以及去除停用词,还有拼音纠错等等
需注意:
在完成停用词移除之后,我们还可以执行其它NLP操作吗?
答案是否定的:这是不可能的。所有典型的NLP应用,譬如词性标注、断句处理等,都需要根据上下文语境来为既定文本生成相关的标签。一旦我们移除了停用词,其上下文环境也就不存在了。