蹭一下这几天扎克伯格因为 Facebook 信息泄漏事件,坐上美国参议院委员会听证会的热度,我们用 NLP 手段来分析一下听证会上的对话内容,看看扎克伯格到底说了什么?
我是在 Jupyter 里进行分析的,需要的包有:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk.corpus as corpus
import nltk
from wordcloud import WordCloud,STOPWORDS
from PIL import Image
导入数据:
df = pd.read_csv('/Users/Cyan/Desktop/Testimony/testimony.csv')
数据集包含的内容形式是这样的,选取前10行看一下:
先用 wordcloud 来看看对话中的关键词:
(这里用的 mask 也是文章封面放的照片~)
mark_mask = np.array(Image.open("/Users/Cyan/Desktop/Testimony/mark-mask.jpg"))
col = df["Text"]
wordcloud = WordCloud(max_words=100, max_font_size=100, width=800, height=600, mask=mark_mask, background_color ="white").generate(' '.join(col))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.title("What did Zuckerberg say?", fontsize=30)
plt.axis("off")
plt.show()
words cloud
听证会上有多位议员进行提问,会后这些议员还被媒体指责太“文盲”,问的问题没有营养等等,扎克伯克也可以说是“毫发无损”。但也有种说法是,这是 Facebook 和国会唱的一唱双簧,剧本都是写好的,早就商量好和平解决。Anyway,还是看看有哪些议员吧,谁说的话最多~
def get_count(x):
return len(nltk.word_tokenize(x))
df['total words']=df['Text'].map(get_count)
tw=df.query("Person !='ZUCKERBERG:'").groupby("Person").sum()
twl=tw.sort_values("total words",ascending=True).head(30).plot(kind="barh",color='#5b82c1',figsize=(10,8))
plt.title("Words Spoken By Senators",fontsize=30)
plt.ylabel("Senator",fontsize=20)
plt.yticks(fontsize=10)
plt.xlabel("Count",fontsize=20)
议员发言长度
再看看扎克伯格的语言习惯,首先我们把数据集中扎克伯格说的发言内容转化成 list:
mark=df[df['Person']=="ZUCKERBERG:"]['Text'].tolist()
再用 sklearn 的 CountVectorizer 方法来看看他说过的比较多的词组有哪些:
from sklearn.feature_extraction import text
tfidf=text.CountVectorizer(mark,ngram_range=(2,2),stop_words='english')
matrix=tfidf.fit_transform(mark)
bigrams=pd.Series(np.array(matrix.sum(axis=0))[0],index=tfidf.get_feature_names()).sort_values(ascending=False)
bigrams.head(15)
结果如下:
我注意到数据集中对于发言人停顿处,都用 "--" 隔开了,例如: "Senator, that's a -- a great question." 那么来看看扎克伯格都有哪些停顿:
(这段代码写得太丑了,就不贴了)
['the company in 2004 -- I started in my',
"I don't -- I'm not sitting here",
'actually brings advertising online -- on Facebook to an',
'what I can say -- and I',
'political or issue-related ad -- this is basically what',
'the abuse cases that -- that are very sensitive,',
...
'subpoena or ability or -- or reason to get',
"a way that doesn't -- that's not overly restrictive",
'Google, Apple, Amazon, Microsoft -- we overlap with them',
" Senator, that's -- that's correct.\r \r ",
"don't have that on -- sitting here today. So"]
我又用 textblob 做了一下简单的情感分析:
from textblob import TextBlob
df['Mark']=df['Text'].map(lambda x: str(x).replace('\r\n', '\n'))
df['Sentiment']=df['Mark'].map(lambda x: float(TextBlob(x).sentiment.polarity))
print('POSITIVE:\n')
sentiment=df[df['Sentiment']>0.5]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))
print('\n\nNeutral:\n')
sentiment=df[df['Sentiment']==0]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))
print('\n\nNEGATIVE:\n')
sentiment=df[df['Sentiment']<-0.4]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))
考虑到篇幅,积极、中立和消极分别只选了5句,结果如下:
我们再看看扎克伯格有哪些问题是回避的:
df[df['Answer'].str.contains("don't know", case=False)]
以及他模棱两可的回答:
df[df['Answer'].str.contains("not sure", case=False)]
还有他丢给团队的问题:
df[df['Answer'].str.contains(" get back", case=False)]
最有以听证会上这段对话结尾吧:
SULLIVAN: Thank you, Mr. Chairman. And Mr. Zuckerberg, quite a story, right? Dorm room to the global behemoth that you guys are. Only in America, would you agree with that?
ZUCKERBERG: Senator, mostly in America.
SULLIVAN: You couldn't — you couldn't do this in China, right? Or, what you did in 10 years.
ZUCKERBERG: Well — well, senator, there are — there are some very strong Chinese Internet companies.
议员 cue 扎克伯格的历程,从大学宿舍走向全世界,只有在美国才会发生这样的事,还皮了一下说,在中国就做不到,是吧?然后扎克伯格说,emmmm…… 可是在中国也有很多厉害的互联网公司呀。
欢迎关注我的知乎专栏【数据池塘】:https://zhuanlan.zhihu.com/datapool
⬇️ 扫描下方二维码关注公众号【数据池塘】 ⬇️
回复【算法】,获取最全面的机器学习算法网络图: