测度measurement_社交媒体情感测度系统

测度measurement

Along with a friend of mine, I recently completed building a website that scrapes data from social media platforms and major news websites regarding particular topics. It then uses deep learning ( an out of the box model ) to infer public sentiment regarding that topic and displays it in a graphical manner. This blog will basically be a tutorial so that you can set up such a system on your own. So let’s get started!

我最近和我的一个朋友一起,完成了一个网站的构建,该网站从社交媒体平台和主要新闻网站中收集有关特定主题的数据。 然后,它使用深度学习(开箱即用的模型)来推断有关该主题的公众情绪,并以图形方式显示它。 该博客基本上将是一个教程,以便您可以自己设置这样的系统。 因此,让我们开始吧!

We will be scraping information from the following sources :

我们将从以下来源收集信息:

  • Twitter

    推特
  • Reddit

    Reddit
  • A variety of news websites of your choice

    您选择的各种新闻网站

To do so we will be requiring the following libraries :

为此,我们将需要以下库:

  • GetOldTweets3

    GetOldTweets3
  • pandas

    大熊猫
  • newspaper3k

    报纸3k
  • langdetect

    langdetect
  • psaw

  • praw

    raw
  • numpy

    麻木
  • flair ( for the deep learning model)

    天赋(针对深度学习模型)

I will be explaining the code piece by piece. You can find the GitHub repo containing the whole code here. So let’s start coding!

我将逐步解释代码。 您可以在此处找到包含完整代码的GitHub存储库 因此,让我们开始编码!

import GetOldTweets3 as got
import pandas as pd
#from twitter_scraper import Profile
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import time
import schedule
import newspaper
from newspaper import Config, Article, Source
import pandas as pd


#import nltk
from langdetect import detect
from psaw import PushshiftAPI
import praw
import datetime as dt
import numpy as np




def TwitterDataGatherer(keyword,start_date,end_date,max_tweets=0,top_tweets=True):
    df=pd.DataFrame()
    tweetCriteria=got.manager.TweetCriteria().setQuerySearch(keyword).setSince(start_date).setUntil(end_date).setMaxTweets(max_tweets).setTopTweets(top_tweets)
    tweets=got.manager.TweetManager.getTweets(tweetCriteria)
    text_tweets=[tweet.text for tweet in tweets]
    retweets_tweets=[tweet.retweets for tweet in tweets]
    favorites_tweets=[tweet.favorites for tweet in tweets]
    date_tweets=[tweet.date for tweet in tweets]
    df['date']=date_tweets
    df['title']=text_tweets
    df['content']=''
    df['retweets']=retweets_tweets
    df['favorites']=favorites_tweets
    return df


def TwitterDataFinal(keyword,from_date,to_date,max_tweets=0,top_tweets=True):
    print(keyword)
    df_m=TwitterDataGatherer(keyword,start_date=from_date,end_date=to_date,max_tweets=max_tweets,top_tweets=top_tweets)
    df_m['retweets']=df_m['retweets']+2
    df_m['favorites']=df_m['favorites']+2
    df_m['retweets']=df_m['retweets'].astype(float)
    df_m['favorites']=df_m['favorites'].astype(float)
    df_m['retweets']=np.log(df_m['retweets'])
    df_m['retweets']=df_m['retweets']*(2)
    df_m['upvote ratio']=None
    df_m['upvote ratio']=df_m['retweets']*np.log(df_m['favorites'])
    return df_m

The above piece of code completes the following functions :

上面的代码完成了以下功能:

  • Import all the necessary libraries we will be using moving forward.

    导入我们将继续使用的所有必要库。
  • Defines a function which takes a particular keyword and other details (date, max tweets to be scraped etc.) and then returns the relevant tweets and corresponding metadata as a pandas dataframe.

    定义一个函数,该函数采用特定的关键字和其他详细信息(日期,要刮除的最大推文等),然后将相关推文和相应的元数据作为熊猫数据框返回。
  • Another function is also defined which then converts the retweets and likes into quantifiable information which we can use to decide weightage of a particular tweet (more on that later!).

    还定义了另一个功能,该功能然后将转发和喜欢的内容转换为可量化的信息,我们可以使用这些信息来确定特定推文的权重(稍后会详细介绍!)。
def NewsGatherer(keywords,newspaper_link,number_of_news,total_news) :
    #Pass list of keywords and number of news to scrape
    #Pass allk keywords in lowercase
    df=pd.DataFrame()
    df['title']=None
    df['content']=None
    df['Summary']=None
    df['Keywords']=None
    df['Date']=None
    df['Name of Newspaper']=None
    newspaper_link=newspaper_link
    keywords=keywords
    config = Config()
    config.fetch_images=False
    config.memoize_articles=False
    paper=newspaper.build(newspaper_link,config=config)
    print('Created Newspaper')
    for keyword in keywords :
        print(keyword)
        news_counter=0
        total_counter=0
        for i in range(len(paper.articles)):
            #print(i)
            if total_counter <=total_news :
                if news_counter<=number_of_news:
                    article=paper.articles[i]
                    try :
                        article.download()
                        article.parse()
                        article.nlp()
                        language_article=detect(article.text)
                        keyword_list=listToLower(article.keywords)
                        if language_article=='en':
                            if article.publish_date :
                                if (keyword in keyword_list) or (keyword in (article.title).lower()) or (keyword in (article.text).lower()) :
                                    df=df.append({'title':article.title,'content':article.text,'Summary':article.summary,'Keywords':article.keywords,'Date':article.publish_date,'Name of Newspaper':str(newspaper_link)},ignore_index=True)
                                    news_counter+=1
                                    print(news_counter)
                            #print(article.title,str(newspaper_link))
                            total_counter+=1
                        else :
                            print(language_article)
                    except :
                        print('error',i)
                        pass
                else :
                    break
            else :
                break
    df['upvote ratio']=1
    return df

The above piece of code defines a function that takes a list of keywords as an input and scrapes all the articles pertaining to it from a particular news website ( whose link is also taken as an input ). It then stores all this information into a pandas dataframe.

上面的代码定义了一个函数,该函数将关键字列表作为输入,并从特定新闻网站(其链接也被视为输入)中抓取与该关键字有关的所有文章。 然后将所有这些信息存储到熊猫数据框中。

reddit = praw.Reddit(
    client_id = "",
    client_secret = '',
    password = '',
    username = '',
    user_agent = ''
)


api = PushshiftAPI(reddit)
# for submission in subreddit.hot(limit=10):
start_epoch=int(dt.datetime(2017, 1, 1).timestamp())
def redditsearch(key):
    gen = api.search_submissions(q=key, limit=100, before = start_epoch)
    result  = list(gen)
    data = [[submission.title, submission.selftext, submission.upvote_ratio] for submission in result]
    df = pd.DataFrame(data, columns=['title', 'content', 'upvote ratio'])
    return df

The function defined above does a task ( as its name suggests) of scraping posts from Reddit corresponding to a particular keyword along with its upvote ratios ( To use the above function you will have to generate API keys).

上面定义的功能完成了一个任务(顾名思义),它从Reddit抓取与特定关键字相对应的帖子以及其投票率(要使用上述功能,您将必须生成API密钥)。

So till now, we have scraped all the data necessary from various platforms. Now we will setup code to judge each post ( it’s sentiment) and also to determine how much weightage should be given to each post while calculating the final sentiment using various metrics like upvote ratio ( for Reddit ) and likes/retweets ( for Twitter).

因此,到目前为止,我们已经从各种平台上抓取了所有必要的数据。 现在,我们将设置代码来判断每个帖子(它的情感),并确定在使用各种指标(例如Reddit的点赞率和Twitter的点赞/转发)来计算最终情感时,应该给每个帖子赋予多少权重。

import flair
from flair.data import Sentence


def sentiment_flair(df):
    fid = flair.models.TextClassifier.load('en-sentiment')
    data = []
    for _,s in df.iterrows():
        dic_title = Sentence(s['title'])
        fid.predict(dic_title)
        dic = {'title': s['title'], 'title_score': (dic_title.labels[0].score), 'title_label':dic_title.labels[0].value , 'upvote': s['upvote ratio']}
        if len(s['content']) !=0:
            dic_text  = Sentence(s['content'])
            fid.predict(dic_text)
            dic['text_score'] = dic_text.labels[0].score
            if dic_text.labels[0].value == 'NEGATIVE':
                dic['text_score']*=-1
        if dic_title.labels[0].value == 'NEGATIVE':
            dic['title_score']*=-1
        data.append(dic)
    return pd.DataFrame.from_dict(data)

The above code takes a dataframe as an input and outputs a dataframe containing sentiment value for each post in the dataframe. Now we need to determine the overall sentiment considering the number of retweets/upvotes for each post.

上面的代码将一个数据帧作为输入,并输出一个数据帧,其中包含该数据帧中每个帖子的情感值。 现在,我们需要考虑每个帖子的转发/赞数确定整体情绪。

def scorer(df):
    df=sentiment_flair(df)
    if 'text_score' in df.columns :
        df['Net Sentiment'] = None
        for i in range(len(df)):
            if df.loc[i,'text_score']:
                df.loc[i,'Net Sentiment']=df.loc[i,'text_score']
            else :
                df.loc[i,'Net Sentiment']=df.loc[i,'title_score']
    else :
        df['Net Sentiment']=df['title_score']
    sum_of_sentiments=(df['Net Sentiment']*df['upvote']).sum()
    sum_of_weights=df['upvote'].sum()
    net_sentiment=(100*sum_of_sentiments/sum_of_weights)
    return [net_sentiment,len(df)]

In the above code we have defined a function that takes a dataframe containing all the posts and spits out a list containing the overall sentiment and volume for a topic. It takes a weighted mean of sentiment of each post using the retweets/upvotes/likes as the weights.

在上面的代码中,我们定义了一个函数,该函数获取一个包含所有帖子的数据框,并吐出一个列表,其中包含主题的总体情绪和数量。 它使用转发/赞/喜欢作为权重,采用每个帖子的情感加权平均值。

So now all that is left is to tie up all of the code together! Let’s do so!

所以现在剩下的就是将所有代码捆绑在一起! 来吧!

now=dt.datetime.now()
  yesterday=now-dt.timedelta(days=1)
  day_before_yesterday=now - dt.timedelta(days=2)
  today_date='{0}-{1}-{2}'.format(now.year,now.month,now.day)
  twitter_data_list=[today_date]
  reddit_data_list=[today_date]
  news_data_list=[today_date]
  overall_data_list=[today_date]
  for keyword in keyword_list :
      start_date='{0}-{1}-{2}'.format(day_before_yesterday.year,day_before_yesterday.month,day_before_yesterday.day)
      end_date='{0}-{1}-{2}'.format(yesterday.year,yesterday.month,yesterday.day)
      df=TwitterDataFinal(keyword,start_date,end_date,500,top_tweets=True)
      try :
          list_result=scorer(df)
          print(keyword+' Twitter Done')
      except :
          list_result=[0,0]
          print(keyword +' Twitter Issue')
      twitter_result=list_result
      twitter_data_list=twitter_data_list + twitter_result
      try :
          df=redditsearch(keyword)
          reddit_result=scorer(df)
      except:
          reddit_result=[0,0]
      list_result=list_result + reddit_result
      reddit_data_list=reddit_data_list + reddit_result
      df=pd.DataFrame()
      try :
          for news_link in news_links :
              df=df.append(NewsGatherer([keyword],news_link,20,200),ignore_index=True)
          news_result=scorer(df)
      except :
          news_result=[0,0]
      list_result=list_result + news_result
      news_data_list=news_data_list + news_result
      news_volumes=list_result[5]
      mean_of_all=(list_result[1] + list_result[3])/2
      list_result[5]=mean_of_all
      net_sentiment=(list_result[0]*list_result[1] + list_result[2]*list_result[3] + list_result[4]*list_result[5])/(list_result[1]+list_result[3]+list_result[5])
      net_volume=list_result[1] + list_result[3] + list_result[5]
      list_result=list_result + [net_sentiment,net_volume,news_volumes]
      overall_data_list=overall_data_list+[net_sentiment,net_volume]

In the above code we do the following steps :

在上面的代码中,我们执行以下步骤:

  • Mine data from Twitter, Reddit and Major news websites for the previous day.

    前一天来自Twitter,Reddit和Major新闻网站的数据。
  • Score each and every post and then use weighted mean to calculate the final sentiment for each platform.

    对每个帖子评分,然后使用加权平均值计算每个平台的最终情绪。
  • Calculate a final overall sentiment value using the data obtained for each platform.

    使用从每个平台获得的数据计算最终的总体情感价值。
  • Store all of this data into a list.

    将所有这些数据存储到列表中。
  • Repeat for all the keywords.

    对所有关键字重复此操作。

And Voila! We have our own system which scrapes data from multiple platforms and gives us a final sentiment value for a particular topic.

和瞧! 我们拥有自己的系统,该系统可从多个平台抓取数据,并为我们提供特定主题的最终情感价值。

You can check out the website (a dashboard for all the data collected till now) here and it’s GitHub repo here.

您可以检查出的网站(适用于所有收集到现在的数据仪表盘), 在这里 ,它的GitHub库在这里

Thanks a lot for reading this blog!

非常感谢您阅读此博客!

P.S.Please feel free to connect with me or Harshit (cocreator of the website) for any questions or suggestions.

如有任何问题或建议,请随时与Harshit (网站的创建者)联系。

翻译自: https://towardsdatascience.com/social-media-sentiment-gauging-system-4b765acc1135

测度measurement

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值