NLP 带你分析 —— 扎克伯格在听证会上说了什么?

蹭一下这几天扎克伯格因为 Facebook 信息泄漏事件,坐上美国参议院委员会听证会的热度,我们用 NLP 手段来分析一下听证会上的对话内容,看看扎克伯格到底说了什么?

我是在 Jupyter 里进行分析的,需要的包有:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk.corpus as corpus
import nltk
from wordcloud import WordCloud,STOPWORDS
from PIL import Image

导入数据:

df = pd.read_csv('/Users/Cyan/Desktop/Testimony/testimony.csv')

数据集包含的内容形式是这样的,选取前10行看一下:

先用 wordcloud 来看看对话中的关键词:

(这里用的 mask 也是文章封面放的照片~)

mark_mask = np.array(Image.open("/Users/Cyan/Desktop/Testimony/mark-mask.jpg"))
col = df["Text"]
wordcloud = WordCloud(max_words=100, max_font_size=100, width=800, height=600, mask=mark_mask, background_color ="white").generate(' '.join(col))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.title("What did Zuckerberg say?", fontsize=30)
plt.axis("off")
plt.show()

words cloud

听证会上有多位议员进行提问,会后这些议员还被媒体指责太“文盲”,问的问题没有营养等等,扎克伯克也可以说是“毫发无损”。但也有种说法是,这是 Facebook 和国会唱的一唱双簧,剧本都是写好的,早就商量好和平解决。Anyway,还是看看有哪些议员吧,谁说的话最多~

def get_count(x):
    return len(nltk.word_tokenize(x))
df['total words']=df['Text'].map(get_count)
tw=df.query("Person !='ZUCKERBERG:'").groupby("Person").sum()
twl=tw.sort_values("total words",ascending=True).head(30).plot(kind="barh",color='#5b82c1',figsize=(10,8))
plt.title("Words Spoken By Senators",fontsize=30)
plt.ylabel("Senator",fontsize=20)
plt.yticks(fontsize=10)
plt.xlabel("Count",fontsize=20)

议员发言长度

再看看扎克伯格的语言习惯,首先我们把数据集中扎克伯格说的发言内容转化成 list:

mark=df[df['Person']=="ZUCKERBERG:"]['Text'].tolist()

再用 sklearn 的 CountVectorizer 方法来看看他说过的比较多的词组有哪些:

from sklearn.feature_extraction import text
tfidf=text.CountVectorizer(mark,ngram_range=(2,2),stop_words='english')
matrix=tfidf.fit_transform(mark)
bigrams=pd.Series(np.array(matrix.sum(axis=0))[0],index=tfidf.get_feature_names()).sort_values(ascending=False)
bigrams.head(15)

结果如下:

我注意到数据集中对于发言人停顿处,都用 "--" 隔开了,例如: "Senator, that's a -- a great question." 那么来看看扎克伯格都有哪些停顿:

(这段代码写得太丑了,就不贴了)

['the company in 2004 -- I started in my',
"I don't -- I'm not sitting here",
'actually brings advertising online -- on Facebook to an',
'what I can say -- and I',
'political or issue-related ad -- this is basically what',
'the abuse cases that -- that are very sensitive,',
...
'subpoena or ability or -- or reason to get',
"a way that doesn't -- that's not overly restrictive",
'Google, Apple, Amazon, Microsoft -- we overlap with them',
" Senator, that's -- that's correct.\r \r ",
"don't have that on -- sitting here today. So"]

我又用 textblob 做了一下简单的情感分析:

from textblob import TextBlob

df['Mark']=df['Text'].map(lambda x: str(x).replace('\r\n', '\n'))
df['Sentiment']=df['Mark'].map(lambda x: float(TextBlob(x).sentiment.polarity))

print('POSITIVE:\n')
sentiment=df[df['Sentiment']>0.5]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))
    
print('\n\nNeutral:\n')
sentiment=df[df['Sentiment']==0]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))
    
print('\n\nNEGATIVE:\n')
sentiment=df[df['Sentiment']<-0.4]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))

考虑到篇幅,积极、中立和消极分别只选了5句,结果如下:

我们再看看扎克伯格有哪些问题是回避的:

df[df['Answer'].str.contains("don't know", case=False)]

以及他模棱两可的回答:

df[df['Answer'].str.contains("not sure", case=False)]

还有他丢给团队的问题:

df[df['Answer'].str.contains(" get back", case=False)]

最有以听证会上这段对话结尾吧:

SULLIVAN: Thank you, Mr. Chairman. And Mr. Zuckerberg, quite a story, right? Dorm room to the global behemoth that you guys are. Only in America, would you agree with that?
ZUCKERBERG: Senator, mostly in America.
SULLIVAN: You couldn't — you couldn't do this in China, right? Or, what you did in 10 years.
ZUCKERBERG: Well — well, senator, there are — there are some very strong Chinese Internet companies.

议员 cue 扎克伯格的历程,从大学宿舍走向全世界,只有在美国才会发生这样的事,还皮了一下说,在中国就做不到,是吧?然后扎克伯格说,emmmm…… 可是在中国也有很多厉害的互联网公司呀。

欢迎关注我的知乎专栏【数据池塘】:https://zhuanlan.zhihu.com/datapool

 

⬇️ 扫描下方二维码关注公众号【数据池塘】 ⬇️

回复【算法】,获取最全面的机器学习算法网络图:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

风控大鱼

如果帮到了您,请我喝杯咖啡吧~

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值