精通Python自然语言处理 1 ：字符串操作

最新推荐文章于 2024-08-26 07:45:00 发布

CopperDong

最新推荐文章于 2024-08-26 07:45:00 发布

阅读量1k

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/QFire/article/details/80472760

版权

NLP 专栏收录该内容

88 篇文章 47 订阅

订阅专栏

代码 https://github.com/PacktPublishing/Mastering-Natural-Language-Processing-with-Python

1、切分

将文本分割成更小的并被称作标识符的模块的过程。sent_tokenize函数使用了NLTK包的一个叫PunktSentenceTokenizer类的实例。基于那些可以标记句子开始和结束的字母和标记符号，这个歌实例已经被训练用于对不同的欧洲语言执行切分。

import nltk
text=" Welcom readers. I hope you find it interesting. Please do reply."
from nltk.tokenize import sent_tokenize
sent_tokenize(text)
Out[4]: [' Welcom readers.', 'I hope you find it interesting.', 'Please do reply.']

切分大批量的句子，可以加载PunktSentenceTokenizer并使用其tokenize()函数，也可加载其它语言

tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')  # french.pickle
tokenizer.tokenize(text)

将句子切分为单词，使用word_tokenize()函数，其使用NLTK包的一个叫TreebankWordTokenizer类的实例

text=nltk.word_tokenize("I hope you find it interesting.")
print(text)
['I', 'hope', 'you', 'find', 'it', 'interesting', '.']

通过分离缩略词来实现切分

text=nltk.word_tokenize("Don't hesitate to ask questions")
print(text)
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']

还可以通过加载TreebankWordTokenizer，然后调用tokenzie()函数来完成。

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize("Have a nice day. I hope you find the book interesting")
Out[14]: 
['Have',
 'a',
 'nice',
 'day.',
  ...

另一个通过分离标点来切分的PunktWordTokenizer，还有一个分词器是WordPunctTokenizer，通过将标点转化为一个全新的标识符来实现切分：

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Don't hesitate to ask questions")
Out[17]: ['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

分词器的继承树：

使用正则表达式实现切分：P20，通过匹配单词与匹配空格或间隔的方法

from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w]+")
tokenizer.tokenize("Don't hesitate to ask questions")
Out[20]: ['Don', 't', 'hesitate', 'to', 'ask', 'questions']

2、标准化

主要涉及消除标点符号、转为大写或小写、数字转换成单词、扩展缩略词、文本的规范化等操作

消除标点

文本的大小写转换：

处理停止词：需要被过滤掉的词，因为这些词对理解句子的整体意思没有多大的意义。搜索引擎通过去除停止词来工作，以便缩小搜索范围。可从nltk_data/corpora/stopwords中访问停止词列表

from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
words = ["Don't", "hesitate", "to", "ask", "questions"]
[word for word in words if word not in stops]
Out[28]: ["Don't", 'hesitate', 'ask', 'questions']

3、替换和校正标识符

使用正则表达式替换单词：

用单词的同义词替换

4、在文本上应用Zipf定律

Zipf定律指出，文本中标识符出现的频率与其在排序列表中的排名或位置成反比。该定律描述了标识符在语言中是如何分布的：一些标识符非常频繁地出现，另一些出现频率较低，还有一些基本上不出现。

import nltk
from nltk.corpus import gutenberg
from nltk.probability import FreqDist
import matplotlib
import matplotlib.pyplot as plt 
matplotlib.use('TkAgg')
fd = FreqDist()
for text in gutenberg.fileids():
	for word in gutenberg.words(text):
		fd[word] += 1

ranks = []
freqs = []
for rank, word in enumerate(fd):
	ranks.append(rank+1)
	freqs.append(fd[word])

plt.loglog(ranks, freqs)
plt.xlabel('frequency(f)', fontsize=14, fontweight='bold')
plt.ylabel('rank(r)', fontsize=14, fontweight='bold')
plt.grid(True)
plt.show()