python文本数据处理_Python文本数据分析

本文写于2017年10月8日,国庆假期最后一天,亦是寒露节气。

一首小诗,摘自《时间之书》:空山晓来露寒,独自且凭栏杆。

大雁排字南去,但闻深谷流泉。

Natural Language Processing:Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages.

使用Python进行文本分析时,最常用的是NLTK库(Natural Language Toolkit)。安装好NLTK包后,可以使用nltk.download()语句,下载相应的语料库和数据处理模型。

典型的文本预处理流程如下:

分词:将文本句子拆分成语义学上的词语。英语中是以单词之间的空格作为自然分隔符,而中文则比较复杂,中文的分词工具有结巴分词。得到分词结果后,中英文后续处理方法区别不大。

词形归一化:有时为了提高语料学习的准确度,会使用词形归一化,把一个词的不同时态形式归并成一种形式。但注意,文本分析的目的是否与时态、单复数相关,如果相关,则不可随意进行归一化。

词性标注:为每个分词标注词性,在NLTK库中,nltk.word_tokenize()语句可实现该功能。

停用词:为节省存储空间和提高搜索效率,NLP中会自动过滤掉某些字或词,如一些功能词:the,is······;词汇词,如频率高的want,虽然···但是···等。停用词都是人工输入、非自动化生成的,形成停用词表。NLTK中,使用stopwords.words()可去除停用词。

分词:

import jieba

import nltk

#分词

sentence='In order to improve our quality of life,\we must first determine what our goals and desires\are and then put a plan into place to work toward achieving\those goals. With each of our long term goals comes many\choices and decisions, from what to try to how much effort\to put forth. By assessing your current quality of life, you can\focus on bridging the gaps and take advantage of opportunities\you have to make improvements.'

tokens=nltk.word_tokenize(sentence)

print(tokens)

#结巴分词

seg_list=jieba.cut("今天是国庆假期最后一天",cut_all=True)

print("全模式:"+'/'.join(seg_list))

seg_list=jieba.cut("今天是国庆假期最后一天",cut_all=False)

print("精确模式:"+'/'.join(seg_list))

词形归一化:

#词干提取

# PorterStemmer

from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()

print(porter_stemmer.stem('looked'))

print(porter_stemmer.stem('looking'))

#词性归并

from nltk.stem import WordNetLemmatizer # 需要下载wordnet语料库

wordnet_lematizer = WordNetLemmatizer()

print(wordnet_lematizer.lemmatize('cats'))

print(wordnet_lematizer.lemmatize('boxes'))

print(wordnet_lematizer.lemmatize('are'))

print(wordnet_lematizer.lemmatize('went'))

# 指明词性可以更准确地进行lemma

# lemmatize 默认为名词

print(wordnet_lematizer.lemmatize('are', pos='v'))

print(wordnet_lematizer.lemmatize('went', pos='v'))

#词性标注

import nltk

words = nltk.word_tokenize('Python is a widely used programming language.')

print(nltk.pos_tag(words)) # 需要下载 averaged_perceptron_tagger

去除停用词:

from nltk.corpus import stopwords # 需要下载stopwords

filtered_words = [word for word in words if word not in stopwords.words('english')]

print('原始词:', words)

print('去除停用词后:', filtered_words)

典型的文本预处理流程:

import nltk

from nltk.stem import WordNetLemmatizer

from nltk.corpus import stopwords

# 原始文本

raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.'

# 分词

raw_words = nltk.word_tokenize(raw_text)

# 词形归一化

wordnet_lematizer = WordNetLemmatizer()

words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]

# 去除停用词

filtered_words = [word for word in words if word not in stopwords.words('english')]

print('原始文本:', raw_text)

print('预处理结果:', filtered_words)

情感分析

首先构造情感字典(dentiment dictionary),然后对样本进行训练,得到模型。

比较实用,但遇到新词、特殊次等,扩展行较差。

# 简单的例子

#使用朴素贝叶斯算法进行分类训练。

import nltk

from nltk.stem import WordNetLemmatizer

from nltk.corpus import stopwords

from nltk.classify import NaiveBayesClassifier

text1 = 'I like the movie so much!'

text2 = 'That is a good movie.'

text3 = 'This is a great one.'

text4 = 'That is a really bad movie.'

text5 = 'This is a terrible movie.'

def proc_text(text):

"""预处处理文本"""

# 分词

raw_words = nltk.word_tokenize(text)

# 词形归一化

wordnet_lematizer = WordNetLemmatizer()

words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]

# 去除停用词

filtered_words = [word for word in words if word not in stopwords.words('english')]

# True 表示该词在文本中,为了使用nltk中的分类器

return {word: True for word in filtered_words}

# 构造训练样本

train_data = [[proc_text(text1), 1],

[proc_text(text2), 1],

[proc_text(text3), 1],

[proc_text(text4), 0],

[proc_text(text5), 0]]

# 训练模型

nb_model = NaiveBayesClassifier.train(train_data)

# 测试模型

text6 = 'That is a bad one.'

print(nb_model.classify(proc_text(text6)))

文本相似度

度量文本之间的相似性是,可以使用词频来表示文本特征。我们选用多个文本中出现频次最高的几个单词,作为文本的维度,将每个文本表示成向量,每个维度的值由该词在文中出现的频次来表示,NLTK可以实现词频统计。这样,文本之间的相似度,可用余弦相似度公式来计算,值越大,相似度越高。

import nltk

from nltk import FreqDist

text1 = 'I like the movie so much '

text2 = 'That is a good movie '

text3 = 'This is a great one '

text4 = 'That is a really bad movie '

text5 = 'This is a terrible movie'

text = text1 + text2 + text3 + text4 + text5

words = nltk.word_tokenize(text) #对文本进行分词

freq_dist = FreqDist(words) #计算词频

print(freq_dist['is']) #输出‘is’的频次

# 取出常用的n=5个单词

n = 5

# 构造“常用单词列表”

most_common_words = freq_dist.most_common(n)

print(most_common_words)

def lookup_pos(most_common_words):

"""查找常用单词的位置"""

result = {}

pos = 0

for word in most_common_words:

result[word[0]] = pos

pos += 1

return result

# 记录位置

std_pos_dict = lookup_pos(most_common_words) #因为most_common_words是由列表中嵌元组构成##,我们需要的是常用单词表中的单词,转换成字典形式后,便于提取单词,看后续代码。

print(std_pos_dict)

# 新文本

new_text = 'That one is a good movie. This is so good!'

# 初始化向量

freq_vec = [0] * n

# 分词

new_words = nltk.word_tokenize(new_text)

# 在“常用单词列表”上计算词频

for new_word in new_words:

if new_word in list(std_pos_dict.keys()):

freq_vec[std_pos_dict[new_word]] += 1

print(freq_vec)

文本分类:

使用TF-IDF (词频-逆文档频率)数值来进行分类。

TF: Term Frequency(词频)。某个词在该文件中出现的次数

IDF:Inverse Document Frequency(逆文档频率),用于衡量某个词普

遍的重要性。

TF-IDF = TF * IDF

NLTK实现TF-IDF: TextCollection.tf_idf()

示例代码:

from nltk.text import TextCollection

text1 = 'I like the movie so much '

text2 = 'That is a good movie '

text3 = 'This is a great one '

text4 = 'That is a really bad movie '

text5 = 'This is a terrible movie'

# 构建TextCollection对象

tc = TextCollection([text1, text2, text3,

text4, text5])

new_text = 'That one is a good movie. This is so good!'

word = 'That'

tf_idf_val = tc.tf_idf(word, new_text)

print('{}的TF-IDF值为:{}'.format(word, tf_idf_val))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值