VADER:社交网络文本情感分析库

VADER(Valence Aware Dictionary and sEntiment Reasoner)是专门为社交媒体进行情感分析的工具,目前仅支持英文文本,大邓在这里推荐给大家使用。大家可以结合大邓的教程  

【视频课程】Python爬虫与文本数据分析 

,自己采集数据自己进行分析。

VADER情感信息会考虑:

  • 否定表达(如,"not good")

  • 能表达情感信息和强度的标点符号 (如, "Good!!!")

  • 大小写等形式带来的强调,(如,"FUNNY.")

  • 情感强度(强度增强,如"very" ;强度减弱如, "kind of")

  • 表达情感信息的俚语 (如, 'sux')

  • 能修饰俚语情感强度的词语 ('uber'、'friggin'、'kinda')

  • 表情符号 :) and :D

  • utf-8编码中的emoj情感表情 ( ? and ? and ?)

  • 首字母缩略语(如,'lol') ...

VADER目前只支持英文文本,如果有符合VADER形式的中文词典,也能使用VADER对中文进行分析。

安装VADER

pip3 install vaderSentiment

使用方法

VADER会对文本分析,得到的结果是一个字典信息,包含

  • pos,文本中正面信息得分

  • neg,文本中负面信息得分

  • neu,文本中中性信息得分

  • compound,文本综合情感得分

文本情感分类

依据compound综合得分对文本进行分类的标准

  • 正面:compound score >= 0.05

  • 中性: -0.05 < compound score < 0.05

  • 负面: compound score <= -0.05

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer	
analyzer = SentimentIntensityAnalyzer()	
test = "VADER is smart, handsome, and funny."	
analyzer.polarity_scores(test)

运行

{'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}

这里我们只使用 compound 得分,用更多的例子让大家看到感叹号、俚语、emoji、强调等不同方式对得分的影响。为了方便,我们想将结果以dataframe方式展示

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer	
import pandas as pd	
analyzer = SentimentIntensityAnalyzer()	
sentences = ["VADER is smart, handsome, and funny.",  	
             "VADER is smart, handsome, and funny!",  #带感叹号	
             "VADER is very smart, handsome, and funny.", 	
             "VADER is VERY SMART, handsome, and FUNNY.",   #FUNNY.强调	
             "VADER is VERY SMART, handsome, and FUNNY!!!",	
             "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!", 	
             "VADER is not smart, handsome, nor funny.",  	
             "The book was good.",  	
             "At least it isn't a horrible book.",  	
             "The book was only kind of good.", 	
             "The plot was good, but the characters are uncompelling and the dialog is not great.", 	
             "Today SUX!",  	
             "Today only kinda sux! But I'll get by, lol",  #lol缩略语	
             "Make sure you :) or :D today!",  	
             "Catch utf-8 emoji such as such as ? and ? and ?",  #emoji	
             "Not bad at all"  	
             ]	
def senti(text):	
    return analyzer.polarity_scores(text)['compound']	
df = pd.DataFrame(sentences)	
df.columns = ['text']	
#对text列使用senti函数进行批处理,得到的得分赋值给sentiment列	
df['sentiment'] = df.agg({'text':[senti]})	
df

640?wx_fmt=png


更多

VADER目前只支持英文文本,如果想要对中文文本进行分析,需要做两大方面改动。

对于初学者来说有难度,建议大家不着急的话可以系统学完python基础语法,大概学习使用python数周就能自己更改库的源代码。

首先要将库中的vaderSentiment.py中相应的英文词语改为中文词语

# (empirically derived mean sentiment intensity rating increase for booster words)	
B_INCR = 0.293	
B_DECR = -0.293	
# (empirically derived mean sentiment intensity rating increase for using ALLCAPs to emphasize a word)	
C_INCR = 0.733	
N_SCALAR = -0.74	
#否定词	
NEGATE = \	
    ["aint", "arent", "cannot", "cant", "couldnt", "darent", "didnt", "doesnt",	
     "ain't", "aren't", "can't", "couldn't", "daren't", "didn't", "doesn't",	
     "dont", "hadnt", "hasnt", "havent", "isnt", "mightnt", "mustnt", "neither",	
     "don't", "hadn't", "hasn't", "haven't", "isn't", "mightn't", "mustn't",	
     "neednt", "needn't", "never", "none", "nope", "nor", "not", "nothing", "nowhere",	
     "oughtnt", "shant", "shouldnt", "uhuh", "wasnt", "werent",	
     "oughtn't", "shan't", "shouldn't", "uh-uh", "wasn't", "weren't",	
     "without", "wont", "wouldnt", "won't", "wouldn't", "rarely", "seldom", "despite"]	
# booster/dampener 'intensifiers' or 'degree adverbs'	
# http://en.wiktionary.org/wiki/Category:English_degree_adverbs	
# 情感强度 副词	
BOOSTER_DICT = \	
    {"absolutely": B_INCR, "amazingly": B_INCR, "awfully": B_INCR, 	
     "completely": B_INCR, "considerable": B_INCR, "considerably": B_INCR,	
     "decidedly": B_INCR, "deeply": B_INCR, "effing": B_INCR, "enormous": B_INCR, "enormously": B_INCR,	
     ......	
     "thoroughly": B_INCR, "total": B_INCR, "totally": B_INCR, "tremendous": B_INCR, "tremendously": B_INCR,	
     "uber": B_INCR, "unbelievably": B_INCR, "unusually": B_INCR, "utter": B_INCR, "utterly": B_INCR,	
     "very": B_INCR,	
     "almost": B_DECR, "barely": B_DECR, "hardly": B_DECR, "just enough": B_DECR,	
     "kind of": B_DECR, "kinda": B_DECR, "kindof": B_DECR, "kind-of": B_DECR,	
     "less": B_DECR, "little": B_DECR, "marginal": B_DECR, "marginally": B_DECR,	
     "occasional": B_DECR, "occasionally": B_DECR, "partly": B_DECR,	
     "scarce": B_DECR, "scarcely": B_DECR, "slight": B_DECR, "slightly": B_DECR, "somewhat": B_DECR,	
     "sort of": B_DECR, "sorta": B_DECR, "sortof": B_DECR, "sort-of": B_DECR}	
# 不再情感形容词词典中,但包含情感信息的俚语表达(目前英文方面也未完成)	
SENTIMENT_LADEN_IDIOMS = {"cut the mustard": 2, "hand to mouth": -2,	
                          "back handed": -2, "blow smoke": -2, "blowing smoke": -2,	
                          "upper hand": 1, "break a leg": 2,	
                          "cooking with gas": 2, "in the black": 2, "in the red": -2,	
                          "on the ball": 2, "under the weather": -2}	
# 包含词典单词的特殊情况俚语	
SPECIAL_CASE_IDIOMS = {"the shit": 3, "the bomb": 3, "bad ass": 1.5, "badass": 1.5,	
                       "yeah right": -2, "kiss of death": -1.5, "to die for": 3}

然后还要将英文词典 vaderlexicon.txt改为对应格式的中文词典。vaderlexicon.txt格式

TOKEN, MEAN-SENTIMENT-RATING, STANDARD DEVIATION, and RAW-HUMAN-SENTIMENT-RATINGS

引用信息

如果使用VADER词典、代码、或者分析方法发表学术文章,请注明出处,格式如下

Hutto, C.J. &amp; Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.	
Refactoring for Python 3 compatibility, improved modularity, and incorporation into [NLTK] ...many thanks to Ewan &amp; Pierpaolo.

推荐阅读

【视频课程】Python爬虫与文本数据分析

640?wx_fmt=jpeg

如何用nbmerge合并多个notebook文件?   

Handout库:能将python脚本转化为html展示文件

自然语言处理库nltk、spacy安装及配置方法

datatable:比pandas更快的GB量级的库

国人开发的数据可视化神库 pyecharts

pandas_profiling:生成动态交互的数据探索报告

cufflinks: 让pandas拥有plotly的炫酷的动态可视化能力

使用Pandas、Jinja和WeasyPrint制作pdf报告

使用Pandas更好的做数据科学

使用Pandas更好的做数据科学(二)

少有人知的python数据科学库

folium:地图数据可视化库

学习编程遇到问题,该如何正确的提问?

如何用Google Colab高效的学习Python

大神kennethreitz写出requests-html号称为人设计的解析库

flashtext:大规模文本数据清洗利器

  • 1
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值