Python做文本情感分析之情感极性分析_python情感极性分析有哪些常用方法

四星级Java导师

于 2024-04-26 12:38:26 发布

阅读量446

点赞数 4

分类专栏：程序员文章标签： python 开发语言

本文链接：https://blog.csdn.net/m0_60707263/article/details/138215191

版权

程序员专栏收录该内容

305 篇文章 0 订阅

订阅专栏

newSent.append(word)

return newSent`


在此笔者使用**Jieba**进行分词。


###### 1.2.2 去除停用词


遍历所有语料中的所有词语，**删除其中的停用词**  
e.g. 这样/的/酒店/配/这样/的/价格/还算/不错  
--> 酒店/配/价格/还算/不错


##### 1.3 构建模型


###### 1.3.1 将词语分类并记录其位置


将句子中各类词分别存储并标注位置。

`“”"
2. 情感定位
“”"
def classifyWords(wordDict):

(1) 情感词

senList = readLines(‘BosonNLP_sentiment_score.txt’)
senDict = defaultdict()
for s in senList:
senDict[s.split(’ ‘)[0]] = s.split(’ ')[1]

(2) 否定词

notList = readLines(‘notDict.txt’)

(3) 程度副词

degreeList = readLines(‘degreeDict.txt’)
degreeDict = defaultdict()
for d in degreeList:
degreeDict[d.split(‘,’)[0]] = d.split(‘,’)[1]

senWord = defaultdict()
notWord = defaultdict()
degreeWord = defaultdict()

for word in wordDict.keys():
if word in senDict.keys() and word not in notList and word not in degreeDict.keys():
senWord[wordDict[word]] = senDict[word]
elif word in notList and word not in degreeDict.keys():
notWord[wordDict[word]] = -1
elif word in degreeDict.keys():
degreeWord[wordDict[word]] = degreeDict[word]
return senWord, notWord, degreeWord`


###### 1.3.2 计算句子得分



> 
> 在此，简化的情感分数计算逻辑：**所有情感词语组的分数之和**
> 
> 
> 


定义一个**情感词语组**：两情感词之间的所有否定词和程度副词与这两情感词中的后一情感词构成一个情感词组，即`notWords + degreeWords + sentiWords`，例如`不是很交好`，其中`不是`为否定词，`很`为程度副词，`交好`为情感词，那么这个情感词语组的分数为：  
`finalSentiScore = (-1) ^ 1 * 1.25 * 0.747127733968`  
其中`1`指的是一个否定词，`1.25`是程度副词的数值，`0.747127733968`为`交好`的情感分数。


伪代码如下：  
`finalSentiScore = (-1) ^ (num of notWords) * degreeNum * sentiScore`  
`finalScore = sum(finalSentiScore)`

`“”"
3. 情感聚合
“”"
def scoreSent(senWord, notWord, degreeWord, segResult):
W = 1
score = 0

存所有情感词的位置的列表

senLoc = senWord.keys()
notLoc = notWord.keys()
degreeLoc = degreeWord.keys()
senloc = -1

notloc = -1

degreeloc = -1

遍历句中所有单词segResult，i为单词绝对位置

for i in range(0, len(segResult)):

如果该词为情感词

if i in senLoc:

loc为情感词位置列表的序号

senloc += 1

直接添加该情感词分数

score += W * float(senWord[i])

print “score = %f” % score

if senloc < len(senLoc) - 1:

判断该情感词与下一情感词之间是否有否定词或程度副词

j为绝对位置

for j in range(senLoc[senloc], senLoc[senloc + 1]):

如果有否定词

if j in notLoc:
W *= -1

如果有程度副词

elif j in degreeLoc:
W *= float(degreeWord[j])

i定位至下一个情感词

if senloc < len(senLoc) - 1:
i = senLoc[senloc + 1]
return score`


##### 1.4 模型评价


将600多条朋友圈文本的得分排序后做出散点图：


![](https://upload-images.jianshu.io/upload_images/2434465-6ef2a6cec1c67a2b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700)


Score Distribution


其中大多数文本被判为正向文本符合实际情况，且绝大多数文本的情感得分的绝对值在10以内，这是因为笔者在计算一个文本的情感得分时，以句号作为一句话结束的标志，在一句话内，情感词语组的分数累加，如若一个文本中含有多句话时，则取其**所有句子情感得分的平均值**。



然而，这个模型的**缺点与局限性**也非常明显：


* 首先，段落的得分是其所有句子得分的平均值，这一方法并不符合实际情况。正如文章中先后段落有重要性大小之分，一个段落中前后句子也同样有重要性的差异。
* 其次，有一类文本使用贬义词来表示正向意义，这类情况常出现与**宣传**文本中，还是那个例子：  
`有车一族都用了这个宝贝，后果很严重哦[偷笑][偷笑][偷笑]1，交警工资估计会打5折，没有超速罚款了[呲牙][呲牙][呲牙]2，移动联通公司大幅度裁员，电话费少了[呲牙][呲牙][呲牙]3，中石化中石油裁员2成，路痴不再迷路，省油[悠闲][悠闲][悠闲]5，保险公司裁员2成，保费折上折2成，全国通用[憨笑][憨笑][憨笑]买不买你自己看着办吧[调皮][调皮][调皮]2980元轩辕魔镜带回家，推广还有返利[得意]`  
Score Distribution中得分小于`-10`的几个文本都是与这类情况相似，这也许需要**深度学习**的方法才能有效解决这类问题，普通机器学习方法也是很难的。
* 对于正负向文本的判断，该算法忽略了很多其他的否定词、程度副词和情感词搭配的情况；用于判断情感强弱也过于简单。



> 
> 总之，这一模型只能用做BENCHMARK...
> 
> 
> 


### 2. 基于机器学习的文本情感极性分析


##### 2.1 还是数据准备


###### 2.1.1 停用词


（同1.1.4）


###### 2.1.2 正负向语料库


来源于[有关中文情感挖掘的酒店评论语料]( )，其中正向7000条，负向3000条（笔者是不是可以认为这个世界还是充满着满满的善意呢…），当然也可以参考[情感分析资源（转）]( )使用其他语料作为训练集。


###### 2.1.3 验证集


Amazon上对iPhone 6s的评论，来源已不可考……


##### 2.2 数据预处理


###### 2.2.1 还是要分词


（同1.2.1）

`import numpy as np
import sys
import re
import codecs
import os
import jieba
from gensim.models import word2vec
from sklearn.cross_validation import train_test_split
from sklearn.externals import joblib
from sklearn.preprocessing import scale
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from scipy import stats
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
from sklearn.metrics import f1_score
from bayes_opt import BayesianOptimization as BO
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

def parseSent(sentence):
seg_list = jieba.cut(sentence)
output = ‘’.join(list(seg_list)) # use space to join them
return output`


###### 2.2.2 也要去除停用词


（同1.2.2）


###### 2.2.3 训练词向量



> 
> **（重点来了！）**模型的输入需是**数据元组**，那么就需要**将每条数据的词语组合转化为一个数值向量**
> 
> 
> 


常见的转化算法有但不仅限于如下几种：


* **Bag of Words**
* **TF-IDF**
* **Word2Vec  
  
![](https://upload-images.jianshu.io/upload_images/2434465-49bc3214f731d079.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700)**
* **![](https://upload-images.jianshu.io/upload_images/2434465-623ec3a4ab62fbc7.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700)![](https://upload-images.jianshu.io/upload_images/2434465-014a34e304b66e1c.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/700)**


在此笔者选用Word2Vec将语料转化成向量，具体步骤可参考笔者的文章[问答机器人的Python分类器]( )。

`def getWordVecs(wordList):
vecs = []
for word in wordList:
word = word.replace(‘\n’, ‘’)
try:
vecs.append(model[word])
except KeyError:
continue

vecs = np.concatenate(vecs)

return np.array(vecs, dtype = ‘float’)

def buildVecs(filename):
posInput = []
with open(filename, “rb”) as txtfile:

print txtfile

for lines in txtfile:
lines = lines.split('\n ')
for line in lines:
line = jieba.cut(line)
resultList = getWordVecs(line)

for each sentence, the mean vector of all its vectors is used to represent this sentence

if len(resultList) != 0:
resultArray = sum(np.array(resultList))/len(resultList)
posInput.append(resultArray)

return posInput

load word2vec model

model = word2vec.Word2Vec.load_word2vec_format(“corpus.model.bin”, binary = True)

txtfile = [u’标准间太差房间还不如3星的而且设施非常陈旧.建议酒店把老的标准间从新改善.‘, u’在这个西部小城市能住上这样的酒店让我很欣喜，提供的免费接机服务方便了我的出行，地处市中心，购物很方便。早餐比较丰富，服务人员很热情。推荐大家也来试试，我想下次来这里我仍然会住这里’]

posInput = buildVecs(‘pos.txt’)
negInput = buildVecs(‘pos.txt’)

use 1 for positive sentiment, 0 for negative

y = np.concatenate((np.ones(len(posInput)), np.zeros(len(negInput))))

X = posInput[:]
for neg in negInput:
X.append(neg)
X = np.array(X)`


###### 2.2.4 标准化


虽然笔者觉得在这一问题中，标准化对模型的准确率影响不大，当然也可以尝试其他的标准化的方法。


# standardization  
X = scale(X)


###### 2.2.5 降维


根据PCA结果，发现前100维能够cover 95%以上的variance。


![](https://upload-images.jianshu.io/upload_images/2434465-c8a22969a9cbe464.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/247)

PCA

Plot the PCA spectrum

pca.fit(X)
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis(‘tight’)
plt.xlabel(‘n_components’)
plt.ylabel(‘explained_variance_’)

X_reduced = PCA(n_components = 100).fit_transform(X)


##### 2.3 构建模型


###### 2.3.1 SVM (RBF) + PCA


SVM (RBF)分类表现更为宽松，且使用**PCA**降维后的模型表现有明显提升，misclassified多为负向文本被分类为正向文本，其中`AUC = 0.92`，`KSValue = 0.7`。  
关于SVM的调参可以参考笔者的另一篇文章[Python利用Gausian Process对Hyper-parameter进行调参]( )

`“”"
2.1 SVM (RBF)
using training data with 100 dimensions
“”"

clf = SVC(C = 2, probability = True)

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友，可以戳这里无偿获取

一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！