NLP 带你分析 —— 扎克伯格在听证会上说了什么？

本文链接：https://blog.csdn.net/cyan_soul/article/details/80650533

蹭一下这几天扎克伯格因为 Facebook 信息泄漏事件，坐上美国参议院委员会听证会的热度，我们用 NLP 手段来分析一下听证会上的对话内容，看看扎克伯格到底说了什么？

我是在 Jupyter 里进行分析的，需要的包有：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk.corpus as corpus
import nltk
from wordcloud import WordCloud,STOPWORDS
from PIL import Image

导入数据：

df = pd.read_csv('/Users/Cyan/Desktop/Testimony/testimony.csv')

数据集包含的内容形式是这样的，选取前10行看一下：

先用 wordcloud 来看看对话中的关键词：

（这里用的 mask 也是文章封面放的照片～）

mark_mask = np.array(Image.open("/Users/Cyan/Desktop/Testimony/mark-mask.jpg"))
col = df["Text"]
wordcloud = WordCloud(max_words=100, max_font_size=100, width=800, height=600, mask=mark_mask, background_color ="white").generate(' '.join(col))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.title("What did Zuckerberg say?", fontsize=30)
plt.axis("off")
plt.show()

words cloud

听证会上有多位议员进行提问，会后这些议员还被媒体指责太“文盲”，问的问题没有营养等等，扎克伯克也可以说是“毫发无损”。但也有种说法是，这是 Facebook 和国会唱的一唱双簧，剧本都是写好的，早就商量好和平解决。Anyway，还是看看有哪些议员吧，谁说的话最多～

def get_count(x):
    return len(nltk.word_tokenize(x))
df['total words']=df['Text'].map(get_count)
tw=df.query("Person !='ZUCKERBERG:'").groupby("Person").sum()
twl=tw.sort_values("total words",ascending=True).head(30).plot(kind="barh",color='#5b82c1',figsize=(10,8))
plt.title("Words Spoken By Senators",fontsize=30)
plt.ylabel("Senator",fontsize=20)
plt.yticks(fontsize=10)
plt.xlabel("Count",fontsize=20)

议员发言长度

再看看扎克伯格的语言习惯，首先我们把数据集中扎克伯格说的发言内容转化成 list：

mark=df[df['Person']=="ZUCKERBERG:"]['Text'].tolist()

再用 sklearn 的 CountVectorizer 方法来看看他说过的比较多的词组有哪些：

from sklearn.feature_extraction import text
tfidf=text.CountVectorizer(mark,ngram_range=(2,2),stop_words='english')
matrix=tfidf.fit_transform(mark)
bigrams=pd.Series(np.array(matrix.sum(axis=0))[0],index=tfidf.get_feature_names()).sort_values(ascending=False)
bigrams.head(15)

结果如下：

我注意到数据集中对于发言人停顿处，都用 "--" 隔开了，例如： "Senator, that's a -- a great question." 那么来看看扎克伯格都有哪些停顿：

（这段代码写得太丑了，就不贴了）

['the company in 2004 -- I started in my',
"I don't -- I'm not sitting here",
'actually brings advertising online -- on Facebook to an',
'what I can say -- and I',
'political or issue-related ad -- this is basically what',
'the abuse cases that -- that are very sensitive,',
...
'subpoena or ability or -- or reason to get',
"a way that doesn't -- that's not overly restrictive",
'Google, Apple, Amazon, Microsoft -- we overlap with them',
" Senator, that's -- that's correct.\r \r ",
"don't have that on -- sitting here today. So"]

我又用 textblob 做了一下简单的情感分析：

from textblob import TextBlob

df['Mark']=df['Text'].map(lambda x: str(x).replace('\r\n', '\n'))
df['Sentiment']=df['Mark'].map(lambda x: float(TextBlob(x).sentiment.polarity))

print('POSITIVE:\n')
sentiment=df[df['Sentiment']>0.5]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))
    
print('\n\nNeutral:\n')
sentiment=df[df['Sentiment']==0]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))
    
print('\n\nNEGATIVE:\n')
sentiment=df[df['Sentiment']<-0.4]
t=sentiment['Mark'][:5]
for n in t: print(n.replace('\n',' '))

考虑到篇幅，积极、中立和消极分别只选了5句，结果如下：

我们再看看扎克伯格有哪些问题是回避的：

df[df['Answer'].str.contains("don't know", case=False)]

以及他模棱两可的回答：

df[df['Answer'].str.contains("not sure", case=False)]

还有他丢给团队的问题：

df[df['Answer'].str.contains(" get back", case=False)]

最有以听证会上这段对话结尾吧：

SULLIVAN: Thank you, Mr. Chairman. And Mr. Zuckerberg, quite a story, right? Dorm room to the global behemoth that you guys are. Only in America, would you agree with that?
ZUCKERBERG: Senator, mostly in America.
SULLIVAN: You couldn't — you couldn't do this in China, right? Or, what you did in 10 years.
ZUCKERBERG: Well — well, senator, there are — there are some very strong Chinese Internet companies.

议员 cue 扎克伯格的历程，从大学宿舍走向全世界，只有在美国才会发生这样的事，还皮了一下说，在中国就做不到，是吧？然后扎克伯格说，emmmm…… 可是在中国也有很多厉害的互联网公司呀。