测度measurement
Along with a friend of mine, I recently completed building a website that scrapes data from social media platforms and major news websites regarding particular topics. It then uses deep learning ( an out of the box model ) to infer public sentiment regarding that topic and displays it in a graphical manner. This blog will basically be a tutorial so that you can set up such a system on your own. So let’s get started!
我最近和我的一个朋友一起,完成了一个网站的构建,该网站从社交媒体平台和主要新闻网站中收集有关特定主题的数据。 然后,它使用深度学习(开箱即用的模型)来推断有关该主题的公众情绪,并以图形方式显示它。 该博客基本上将是一个教程,以便您可以自己设置这样的系统。 因此,让我们开始吧!
We will be scraping information from the following sources :
我们将从以下来源收集信息:
- Twitter 推特
- Reddit Reddit
- A variety of news websites of your choice 您选择的各种新闻网站
To do so we will be requiring the following libraries :
为此,我们将需要以下库:
- GetOldTweets3 GetOldTweets3
- pandas 大熊猫
- newspaper3k 报纸3k
- langdetect langdetect
- psaw 锯
- praw raw
- numpy 麻木
- flair ( for the deep learning model) 天赋(针对深度学习模型)
I will be explaining the code piece by piece. You can find the GitHub repo containing the whole code here. So let’s start coding!
我将逐步解释代码。 您可以在此处找到包含完整代码的GitHub存储库。 因此,让我们开始编码!
import GetOldTweets3 as got
import pandas as pd
#from twitter_scraper import Profile
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import time
import schedule
import newspaper
from newspaper import Config, Article, Source
import pandas as pd
#import nltk
from langdetect import detect
from psaw import PushshiftAPI
import praw
import datetime as dt
import numpy as np
def TwitterDataGatherer(keyword,start_date,end_date,max_tweets=0,top_tweets=True):
df=pd.DataFrame()
tweetCriteria=got.manager.TweetCriteria().setQuerySearch(keyword).setSince(start_date).setUntil(end_date).setMaxTweets(max_tweets).setTopTweets(top_tweets)
tweets=got.manager.TweetManager.getTweets(tweetCriteria)
text_tweets=[tweet.text for tweet in tweets]
retweets_tweets=[tweet.retweets for tweet in tweets]
favorites_tweets=[tweet.favorites for tweet in tweets]
date_tweets=[tweet.date for tweet in tweets]
df['date']=date_tweets
df['title']=text_tweets
df['content']=''
df['retweets']=retweets_tweets
df['favorites']=favorites_tweets
return df
def TwitterDataFinal(keyword,from_date,to_date,max_tweets=0,top_tweets=True):
print(keyword)
df_m=TwitterDataGatherer(keyword,start_date=from_date,end_date=to_date,max_tweets=max_tweets,top_tweets=top_tweets)
df_m['retweets']=df_m['retweets']+2
df_m['favorites']=df_m['favorites']+2
df_m['retweets']=df_m['retweets'].astype(float)
df_m['favorites']=df_m['favorites'].astype(float)
df_m['retweets']=np.log(df_m['retweets'])
df_m['retweets']=df_m['retweets']*(2)
df_m['upvote ratio']=None
df_m['upvote ratio']=df_m['retweets']*np.log(df_m['favorites'])
return df_m
The above piece of code completes the following functions :
上面的代码完成了以下功能:
- Import all the necessary libraries we will be using moving forward. 导入我们将继续使用的所有必要库。
- Defines a function which takes a particular keyword and other details (date, max tweets to be scraped etc.) and then returns the relevant tweets and corresponding metadata as a pandas dataframe. 定义一个函数,该函数采用特定的关键字和其他详细信息(日期,要刮除的最大推文等),然后将相关推文和相应的元数据作为熊猫数据框返回。
- Another function is also defined which then converts the retweets and likes into quantifiable information which we can use to decide weightage of a particular tweet (more on that later!). 还定义了另一个功能,该功能然后将转发和喜欢的内容转换为可量化的信息,我们可以使用这些信息来确定特定推文的权重(稍后会详细介绍!)。
def NewsGatherer(keywords,newspaper_link,number_of_news,total_news) :
#Pass list of keywords and number of news to scrape
#Pass allk keywords in lowercase
df=pd.DataFrame()
df['title']=None
df['content']=None
df['Summary']=None
df['Keywords']=None
df['Date']=None
df['Name of Newspaper']=None
newspaper_link=newspaper_link
keywords=keywords
config = Config()
config.fetch_images=False
config.memoize_articles=False
paper=newspaper.build(newspaper_link,config=config)
print('Created Newspaper')
for keyword in keywords :
print(keyword)
news_counter=0
total_counter=0
for i in range(len(paper.articles)):
#print(i)
if total_counter <=total_news :
if news_counter<=number_of_news:
article=paper.articles[i]
try :
article.download()
article.parse()
article.nlp()
language_article=detect(article.text)
keyword_list=listToLower(article.keywords)
if language_article=='en':
if article.publish_date :
if (keyword in keyword_list) or (keyword in (article.title).lower()) or (keyword in (article.text).lower()) :
df=df.append({'title':article.title,'content':article.text,'Summary':article.summary,'Keywords':article.keywords,'Date':article.publish_date,'Name of Newspaper':str(newspaper_link)},ignore_index=True)
news_counter+=1
print(news_counter)
#print(article.title,str(newspaper_link))
total_counter+=1
else :
print(language_article)
except :
print('error',i)
pass
else :
break
else :
break
df['upvote ratio']=1
return df
The above piece of code defines a function that takes a list of keywords as an input and scrapes all the articles pertaining to it from a particular news website ( whose link is also taken as an input ). It then stores all this information into a pandas dataframe.
上面的代码定义了一个函数,该函数将关键字列表作为输入,并从特定新闻网站(其链接也被视为输入)中抓取与该关键字有关的所有文章。 然后将所有这些信息存储到熊猫数据框中。
reddit = praw.Reddit(
client_id = "",
client_secret = '',
password = '',
username = '',
user_agent = ''
)
api = PushshiftAPI(reddit)
# for submission in subreddit.hot(limit=10):
start_epoch=int(dt.datetime(2017, 1, 1).timestamp())
def redditsearch(key):
gen = api.search_submissions(q=key, limit=100, before = start_epoch)
result = list(gen)
data = [[submission.title, submission.selftext, submission.upvote_ratio] for submission in result]
df = pd.DataFrame(data, columns=['title', 'content', 'upvote ratio'])
return df
The function defined above does a task ( as its name suggests) of scraping posts from Reddit corresponding to a particular keyword along with its upvote ratios ( To use the above function you will have to generate API keys).
上面定义的功能完成了一个任务(顾名思义),它从Reddit抓取与特定关键字相对应的帖子以及其投票率(要使用上述功能,您将必须生成API密钥)。
So till now, we have scraped all the data necessary from various platforms. Now we will setup code to judge each post ( it’s sentiment) and also to determine how much weightage should be given to each post while calculating the final sentiment using various metrics like upvote ratio ( for Reddit ) and likes/retweets ( for Twitter).
因此,到目前为止,我们已经从各种平台上抓取了所有必要的数据。 现在,我们将设置代码来判断每个帖子(它的情感),并确定在使用各种指标(例如Reddit的点赞率和Twitter的点赞/转发)来计算最终情感时,应该给每个帖子赋予多少权重。
import flair
from flair.data import Sentence
def sentiment_flair(df):
fid = flair.models.TextClassifier.load('en-sentiment')
data = []
for _,s in df.iterrows():
dic_title = Sentence(s['title'])
fid.predict(dic_title)
dic = {'title': s['title'], 'title_score': (dic_title.labels[0].score), 'title_label':dic_title.labels[0].value , 'upvote': s['upvote ratio']}
if len(s['content']) !=0:
dic_text = Sentence(s['content'])
fid.predict(dic_text)
dic['text_score'] = dic_text.labels[0].score
if dic_text.labels[0].value == 'NEGATIVE':
dic['text_score']*=-1
if dic_title.labels[0].value == 'NEGATIVE':
dic['title_score']*=-1
data.append(dic)
return pd.DataFrame.from_dict(data)
The above code takes a dataframe as an input and outputs a dataframe containing sentiment value for each post in the dataframe. Now we need to determine the overall sentiment considering the number of retweets/upvotes for each post.
上面的代码将一个数据帧作为输入,并输出一个数据帧,其中包含该数据帧中每个帖子的情感值。 现在,我们需要考虑每个帖子的转发/赞数确定整体情绪。
def scorer(df):
df=sentiment_flair(df)
if 'text_score' in df.columns :
df['Net Sentiment'] = None
for i in range(len(df)):
if df.loc[i,'text_score']:
df.loc[i,'Net Sentiment']=df.loc[i,'text_score']
else :
df.loc[i,'Net Sentiment']=df.loc[i,'title_score']
else :
df['Net Sentiment']=df['title_score']
sum_of_sentiments=(df['Net Sentiment']*df['upvote']).sum()
sum_of_weights=df['upvote'].sum()
net_sentiment=(100*sum_of_sentiments/sum_of_weights)
return [net_sentiment,len(df)]
In the above code we have defined a function that takes a dataframe containing all the posts and spits out a list containing the overall sentiment and volume for a topic. It takes a weighted mean of sentiment of each post using the retweets/upvotes/likes as the weights.
在上面的代码中,我们定义了一个函数,该函数获取一个包含所有帖子的数据框,并吐出一个列表,其中包含主题的总体情绪和数量。 它使用转发/赞/喜欢作为权重,采用每个帖子的情感加权平均值。
So now all that is left is to tie up all of the code together! Let’s do so!
所以现在剩下的就是将所有代码捆绑在一起! 来吧!
now=dt.datetime.now()
yesterday=now-dt.timedelta(days=1)
day_before_yesterday=now - dt.timedelta(days=2)
today_date='{0}-{1}-{2}'.format(now.year,now.month,now.day)
twitter_data_list=[today_date]
reddit_data_list=[today_date]
news_data_list=[today_date]
overall_data_list=[today_date]
for keyword in keyword_list :
start_date='{0}-{1}-{2}'.format(day_before_yesterday.year,day_before_yesterday.month,day_before_yesterday.day)
end_date='{0}-{1}-{2}'.format(yesterday.year,yesterday.month,yesterday.day)
df=TwitterDataFinal(keyword,start_date,end_date,500,top_tweets=True)
try :
list_result=scorer(df)
print(keyword+' Twitter Done')
except :
list_result=[0,0]
print(keyword +' Twitter Issue')
twitter_result=list_result
twitter_data_list=twitter_data_list + twitter_result
try :
df=redditsearch(keyword)
reddit_result=scorer(df)
except:
reddit_result=[0,0]
list_result=list_result + reddit_result
reddit_data_list=reddit_data_list + reddit_result
df=pd.DataFrame()
try :
for news_link in news_links :
df=df.append(NewsGatherer([keyword],news_link,20,200),ignore_index=True)
news_result=scorer(df)
except :
news_result=[0,0]
list_result=list_result + news_result
news_data_list=news_data_list + news_result
news_volumes=list_result[5]
mean_of_all=(list_result[1] + list_result[3])/2
list_result[5]=mean_of_all
net_sentiment=(list_result[0]*list_result[1] + list_result[2]*list_result[3] + list_result[4]*list_result[5])/(list_result[1]+list_result[3]+list_result[5])
net_volume=list_result[1] + list_result[3] + list_result[5]
list_result=list_result + [net_sentiment,net_volume,news_volumes]
overall_data_list=overall_data_list+[net_sentiment,net_volume]
In the above code we do the following steps :
在上面的代码中,我们执行以下步骤:
- Mine data from Twitter, Reddit and Major news websites for the previous day. 前一天来自Twitter,Reddit和Major新闻网站的数据。
- Score each and every post and then use weighted mean to calculate the final sentiment for each platform. 对每个帖子评分,然后使用加权平均值计算每个平台的最终情绪。
- Calculate a final overall sentiment value using the data obtained for each platform. 使用从每个平台获得的数据计算最终的总体情感价值。
- Store all of this data into a list. 将所有这些数据存储到列表中。
- Repeat for all the keywords. 对所有关键字重复此操作。
And Voila! We have our own system which scrapes data from multiple platforms and gives us a final sentiment value for a particular topic.
和瞧! 我们拥有自己的系统,该系统可从多个平台抓取数据,并为我们提供特定主题的最终情感价值。
You can check out the website (a dashboard for all the data collected till now) here and it’s GitHub repo here.
您可以检查出的网站(适用于所有收集到现在的数据仪表盘), 在这里 ,它的GitHub库在这里 。
Thanks a lot for reading this blog!
非常感谢您阅读此博客!
P.S.Please feel free to connect with me or Harshit (cocreator of the website) for any questions or suggestions.
翻译自: https://towardsdatascience.com/social-media-sentiment-gauging-system-4b765acc1135
测度measurement