本文写于2017年10月8日,国庆假期最后一天,亦是寒露节气。
一首小诗,摘自《时间之书》:空山晓来露寒,独自且凭栏杆。
大雁排字南去,但闻深谷流泉。
Natural Language Processing:Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.
使用Python进行文本分析时,最常用的是NLTK库(Natural Language Toolkit)。安装好NLTK包后,可以使用nltk.download()语句,下载相应的语料库和数据处理模型。
典型的文本预处理流程如下:
分词:将文本句子拆分成语义学上的词语。英语中是以单词之间的空格作为自然分隔符,而中文则比较复杂,中文的分词工具有结巴分词。得到分词结果后,中英文后续处理方法区别不大。
词形归一化:有时为了提高语料学习的准确度,会使用词形归一化,把一个词的不同时态形式归并成一种形式。但注意,文本分析的目的是否与时态、单复数相关,如果相关,则不可随意进行归一化。
词性标注:为每个分词标注词性,在NLTK库中,nltk.word_tokenize()语句可实现该功能。
停用词:为节省存储空间和提高搜索效率,NLP中会自动过滤掉某些字或词,如一些功能词:the,is······;词汇词,如频率高的want,虽然···但是···等。停用词都是人工输入、非自动化生成的,形成停用词表。NLTK中,使用stopwords.words()可去除停用词。
分词:
import jieba
import nltk
#分词
sentence='In order to improve our quality of life,\we must first determine what our goals and desires\are and then put a plan into place to work toward achieving\those goals. With each of our long term goals comes many\choices and decisions, from what to try to how much effort\to put forth. By assessing your current quality of life, you can\focus on bridging the gaps and take advantage of opportunities\you have to make improvements.'
tokens=nltk.word_tokenize(sentence)
print(tokens)
#结巴分词
seg_list=jieba.cut("今天是国庆假期最后一天",cut_all=True)
print("全模式:"+'/'.join(seg_list))
seg_list=jieba.cut("今天是国庆假期最后一天",cut_all=False)
print("精确模式:"+'/'.join(seg_list))
词形归一化:
#词干提取
# PorterStemmer
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('looked'))
print(porter_stemmer.stem('looking'))
#词性归并
from nltk.stem import WordNetLemmatizer # 需要下载wordnet语料库
wordnet_lematizer = WordNetLemmatizer()
print(wordnet_lematizer.lemmatize('cats'))
print(wordnet_lematizer.lemmatize('boxes'))
print(wordnet_lematizer.lemmatize('are'))
print(wordnet_lematizer.lemmatize('went'))
# 指明词性可以更准确地进行lemma
# lemmatize 默认为名词
print(wordnet_lematizer.lemmatize('are', pos='v'))
print(wordnet_lematizer.lemmatize('went', pos='v'))
#词性标注
import nltk
words = nltk.word_tokenize('Python is a widely used programming language.')
print(nltk.pos_tag(words)) # 需要下载 averaged_perceptron_tagger
去除停用词:
from nltk.corpus import stopwords # 需要下载stopwords
filtered_words = [word for word in words if word not in stopwords.words('english')]
print('原始词:', words)
print('去除停用词后:', filtered_words)
典型的文本预处理流程:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# 原始文本
raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.'
# 分词
raw_words = nltk.word_tokenize(raw_text)
# 词形归一化
wordnet_lematizer = WordNetLemmatizer()
words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]
# 去除停用词
filtered_words = [word for word in words if word not in stopwords.words('english')]
print('原始文本:', raw_text)
print('预处理结果:', filtered_words)
情感分析
首先构造情感字典(dentiment dictionary),然后对样本进行训练,得到模型。
比较实用,但遇到新词、特殊次等,扩展行较差。
# 简单的例子
#使用朴素贝叶斯算法进行分类训练。
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
text1 = 'I like the movie so much!'
text2 = 'That is a good movie.'
text3 = 'This is a great one.'
text4 = 'That is a really bad movie.'
text5 = 'This is a terrible movie.'
def proc_text(text):
"""预处处理文本"""
# 分词
raw_words = nltk.word_tokenize(text)
# 词形归一化
wordnet_lematizer = WordNetLemmatizer()
words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]
# 去除停用词
filtered_words = [word for word in words if word not in stopwords.words('english')]
# True 表示该词在文本中,为了使用nltk中的分类器
return {word: True for word in filtered_words}
# 构造训练样本
train_data = [[proc_text(text1), 1],
[proc_text(text2), 1],
[proc_text(text3), 1],
[proc_text(text4), 0],
[proc_text(text5), 0]]
# 训练模型
nb_model = NaiveBayesClassifier.train(train_data)
# 测试模型
text6 = 'That is a bad one.'
print(nb_model.classify(proc_text(text6)))
文本相似度
度量文本之间的相似性是,可以使用词频来表示文本特征。我们选用多个文本中出现频次最高的几个单词,作为文本的维度,将每个文本表示成向量,每个维度的值由该词在文中出现的频次来表示,NLTK可以实现词频统计。这样,文本之间的相似度,可用余弦相似度公式来计算,值越大,相似度越高。
import nltk
from nltk import FreqDist
text1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'
text = text1 + text2 + text3 + text4 + text5
words = nltk.word_tokenize(text) #对文本进行分词
freq_dist = FreqDist(words) #计算词频
print(freq_dist['is']) #输出‘is’的频次
# 取出常用的n=5个单词
n = 5
# 构造“常用单词列表”
most_common_words = freq_dist.most_common(n)
print(most_common_words)
def lookup_pos(most_common_words):
"""查找常用单词的位置"""
result = {}
pos = 0
for word in most_common_words:
result[word[0]] = pos
pos += 1
return result
# 记录位置
std_pos_dict = lookup_pos(most_common_words) #因为most_common_words是由列表中嵌元组构成##,我们需要的是常用单词表中的单词,转换成字典形式后,便于提取单词,看后续代码。
print(std_pos_dict)
# 新文本
new_text = 'That one is a good movie. This is so good!'
# 初始化向量
freq_vec = [0] * n
# 分词
new_words = nltk.word_tokenize(new_text)
# 在“常用单词列表”上计算词频
for new_word in new_words:
if new_word in list(std_pos_dict.keys()):
freq_vec[std_pos_dict[new_word]] += 1
print(freq_vec)
文本分类:
使用TF-IDF (词频-逆文档频率)数值来进行分类。
TF: Term Frequency(词频)。某个词在该文件中出现的次数
IDF:Inverse Document Frequency(逆文档频率),用于衡量某个词普
遍的重要性。
TF-IDF = TF * IDF
NLTK实现TF-IDF: TextCollection.tf_idf()
示例代码:
from nltk.text import TextCollection
text1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'
# 构建TextCollection对象
tc = TextCollection([text1, text2, text3,
text4, text5])
new_text = 'That one is a good movie. This is so good!'
word = 'That'
tf_idf_val = tc.tf_idf(word, new_text)
print('{}的TF-IDF值为:{}'.format(word, tf_idf_val))