微博指定日期舆情数据爬虫获取—基于中文金融词典（python）

二牛兄弟

已于 2023-01-18 23:07:00 修改

阅读量2.5k

点赞数 6

文章标签： python 爬虫金融

于 2023-01-18 22:52:04 首次发布

本文链接：https://blog.csdn.net/m0_51299665/article/details/128730388

版权

最近在做一个微博舆情与金融市场分析的课题，需要爬取微博的舆情数据。因此需要根据关键词爬取指定日期的微博。现在代码写完了，实测还行，sleep设置得比较久，但爬取的数据很完整。硕士三年在CSDN学到了很多，可以说比在学校学的还多，本着回馈社会的想法，把代码共享出来给有需要的朋友进行参考。后续在集成学习、不平衡数据、网络舆情分析等相关文章投稿成功后也会进行分享，欢迎大家的关注。

这份源码的主要特点有以下几方面：

1、可以获取指定日期的微博关键词搜索结果。通常移动端的微博关键词搜索获取的博客数量有限，年份一远就不会展示出来。因此这篇源码从微博网页版的高级搜索进行数据搜索与爬取。由于技术限制，每天的微博最多包含50页，每页最多包含10条微博。

2、基于中文金融词典获取舆情数据。词典来源见此中文金融情感词典发布啦 | 附代码_邓旭东HIT的博客-CSDN博客

3、使用了两种词典构建方式计算舆情指标，第一种词典为普通情感词典+金融词典，第二种词典为单独的金融词典

4、爬取博客文本的同时计算舆情指标。这样做的优点在于不用将所有的文本数据都保存下来，可以减轻内存压力。例如，爬取了一天的微博共500条文本数据，由于计算的情感指标以天为单位，那么保存下来的数据其实只有1条，大大节省内存开销。

一、代码解析

(1)相关资源库的导入

import sys
import traceback
from time import sleep
import requests
from lxml import etree
from cnsenti import Sentiment
import numpy as np
import xlwt
import time

（2）参数设置

begin_year,end_year为爬取的年份时间段，分享出来的的代码仅可以年为单位进行检索，若需要以月、日为单位，可进行二次开发。

name为需要爬取的关键词，例如，我们研究中需要贵州茅台的相关关键词作为搜索结果。

senti和senti1分别为构建的两种词典

host与ident为微博网页版高级搜索的url不变组成部分

begin_year = 2020
end_year = 2021  #比如爬2021与2022年的数据则需要begin_year = 2021,end_year = 2022
name = '贵州茅台'
cookie = {'Cookie': 'cookie'}  # 将引号内的cookie替换成自己的cookie，如何获取cookie这里不做介绍
senti = Sentiment(pos='unformal_pos.txt',  neg='unformal_neg.txt', merge=True,   encoding='utf-8')  # 两txt均为utf-8编码
senti1 = Sentiment(pos='unformal_pos.txt',  neg='unformal_neg.txt',   merge=False, encoding='utf-8')  # 两txt均为utf-8编码
ident = '?q=%s&scope=ori&suball=1'%name
host = 'https://s.weibo.com/weibo'

（3）爬虫类的定义初始化

self.count记录爬取的天数

self.num_data记录第一种词典规则下爬取的有效博客条数

self.num_data_f记录第纯金融词典规则下爬取的有效博客条数

class pachong(object):
    def __init__(self, name,begin_year,end_year):
        self.count ,self.num_data,self.num_data_f= 0,0,0
        self.name = name
        self.begin_year = begin_year
        self.end_year = end_year
        self.workbook = xlwt.Workbook(encoding='utf-8')  # 创建一个表格
        self.worksheet = self.workbook.add_sheet('%s'%self.name)

（4）函数一：获取单条微博文本的情感分数

tips:1、若文本中没有包含积极或消极词汇，则情感分数暂时记为-1，后续将不做保留

2、情感分数的计算公式为 $\frac{numpos}{numpos+numneg}$ ,也就是积极情感词汇的个数除以积极情感词汇个数与消极情感词汇个数的和。

def get_emotion(self,sentence,finacial):
        if finacial:
            result = senti1.sentiment_count(sentence)
        else:
            result = senti.sentiment_count(sentence)
        if result['pos']+result['neg'] == 0:
#             print('没有情感词汇')
            score = -1
        else:
            score = result['pos']/(result['pos']+result['neg'])
        return score

（5）函数二：获取单页微博的情感分数

tips:1、将单页微博的selector传入函数解析，调用get_emotion函数获取情感分数，再存入list并返回

2、使用的xpath路径包含了多余的未完全显示的博客，如果“展示”在文本中，则该文本不做保留与处理，这不会导致我们获取的博客数量减少

3、文本的清洗这行代码是借用的哪位博主的不记得了，年代有些久远了，如有侵权、请告知

def get_onepage_emotion_score_list(self,selector):
        l1,l2 = [],[]
        info_text = selector.xpath('//*[@class="txt"]')
#         print('博客数量',len(info_text))
        for i in range(len(info_text)):
            text = info_text[i].xpath('string(.)').replace(u'\u200b', '').encode( sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding)
            if u'展开c' in text :
                nothing = 1
            else:
                s1 = self.get_emotion(text,False)
                s2 = self.get_emotion(text,True)
                if s1 != -1:
                    l1.append(s1)
                if s2 != -1:
                    l2.append(s2)
        return l1,l2

（6）函数三：一天单位的舆情指标获取

tips：1、将一天单位的情感分数传入，构建以日为单位的两个舆情指标

2、指标一为一天内所有博客情感分数的均值

3、指标二为一天内所有博客情感分数的方差

4、由于有两个词典，所以最后的指标其实有四个

def preprocessing_score_list(self,emotion_score_list,emotion_score_f_list):
        e_list,e_list_f = [],[]
        for i in range(len(emotion_score_list)):
            for j in range(len(emotion_score_list[i])):
                e_list.append(emotion_score_list[i][j])
        for i in range(len(emotion_score_f_list)):
            for j in range(len(emotion_score_f_list[i])):
                e_list_f.append(emotion_score_f_list[i][j])
        mean_emotion_score = np.mean(e_list)
        var_emotion_score = np.var(e_list)
        mean_emotion_score_f = np.mean(e_list_f)
        var_emotion_score_f = np.var(e_list_f)
        return mean_emotion_score,var_emotion_score,mean_emotion_score_f,var_emotion_score_f

（7）函数四：处理url的日期段

tips：传入年月日，整理后返回微博高级搜索url的日期段的字符串

    def get_date_range(self,year,month,day,day_max):
        if day < day_max:
            date_range = '%d-%s-%s-0:%d-%s-%s-0'%(year,str(month).zfill(2),str(day).zfill(2),year,str(month).zfill(2),str(day+1).zfill(2))
        else:
            if month <12:
                date_range = '%d-%s-%s-0:%d-%s-%s-0'%(year,str(month).zfill(2),str(day).zfill(2),year,str(month+1).zfill(2),str(1).zfill(2))
            else:
                date_range = '%d-%s-%s-0:%d-%s-%s-0'%(year,str(month).zfill(2),str(day).zfill(2),year+1,str(1).zfill(2),str(1).zfill(2))
        return date_range

（8）函数五：获取某年某月的最大日期

这个按道理来说是有快捷的实现方式的，但我不知道，反正造这个轮子也不费劲，就干脆自己写一个了

    def get_day_max(self,year,month):
        if month<8:
            if month%2 == 1:
                day_max = 31
            elif month == 2:
                if year%4 == 0:
                    day_max = 29
                else:
                    day_max = 28
            else:
                day_max = 30
        else:
            if month%2 == 0:
                day_max = 31
            else:
                day_max = 30
        return day_max

（9）函数六：url的获取

tips：微博高级搜索url的格式为host+ident+scope，其中indent决定搜索的方式，例如是否原创，是否包含大v等，scope包含日期范围、页数等信息

 def preprocessing_url(self,year,month,day,day_max,page):
        date_range = self.get_date_range(year,month,day,day_max)
        scope = '&timescope=custom:%s&Refer=g&page=%s' % (date_range, page)
        url = host+ident+scope
        return url

（10）函数七：url的请求

    def deal_html(self,url):
        html = requests.get(url, cookies=cookie).content
        selector = etree.HTML(html)
        return selector

（11）函数八：数据保存

按天进行保存，因为是按照日期获取文本，所以没有抓取微博内的发布日期等信息

def save_one_day(self,data):
        try:
            self.count = self.count+1
            self.worksheet.write(self.count, 0, label=data.get('year'))
            self.worksheet.write(self.count, 1, label=data.get('month'))
            self.worksheet.write(self.count, 2, label=data.get('day'))
            self.worksheet.write(self.count, 3, label=data.get('mean_emotion_score'))
            self.worksheet.write(self.count, 4, label=data.get('var_emotion_score'))
            self.worksheet.write(self.count, 5, label=data.get('mean_emotion_score_f'))
            self.worksheet.write(self.count, 6, label=data.get('var_emotion_score_f'))
            print('num_data:',self.num_data,'num_data_f:',self.num_data_f,'count:',self.count,data)
            self.workbook.save('%s.xls'%self.name)
        except Exception as e:
            print('%d年%d月%d日 保存失败'%(data.get('year'), data.get('month'), data.get('day')))
            print(e)

（12）函数九：爬虫主函数

tips：1、通过for循环驱动日期

2、从首页获取当天微博的页数

3、若遭遇微博拒绝访问，则休眠10秒后重试，直至成功请求

4、保存的文件名为搜索关键词.csv

  def begin(self):
        for year in range(self.begin_year,self.end_year+1):
            for month in range(1,13):
                day_max = self.get_day_max(year,month)
                for day in range(1,day_max+1):
                    date_range = self.get_date_range(year,month,day,day_max)
                    emotion_score_list = []
                    emotion_score_f_list = []
                    url = self. preprocessing_url(year,month,day,day_max,1)
                    loop =True
                    while loop:
                        try:
                            selector = self.deal_html(url)
                            loop = False
                        except Exception as e:
                            print(e)
                            print('Error: ', '获取页码失败，重试中') 
                            time.sleep(10)
                    try:
                        q = 1
                        while(len(selector.xpath('//*[@id="pl_feedlist_index"]/div[3]/div/span/ul/li')) == 0):
                            q = q+1
                            url = self. preprocessing_url(year,month,day,day_max,q)
                            loop =True
                            while loop:
                                try:
                                    selector = self.deal_html(url)
                                    loop = False
                                except Exception as e:
                                    print(e)
                                    print('Error: ', '获取页码失败，重试中') 
                                    time.sleep(10)
                        page_num =  len(selector.xpath('//*[@id="pl_feedlist_index"]/div[3]/div/span/ul/li'))
                        for page in range(1,page_num+1):
                            url =  self.preprocessing_url(year,month,day,day_max,page)
                            loop = True
                            while loop:
                                try:
                                    selector = self.deal_html(url)
                                    loop = False
                                except Exception as e:
                                    print(e)
                                    print('Error: ', '获取分页失败，重试中') 
                                    print('%d年 %d月 %d日 %d页'%(year,month,day,page))
                                    time.sleep(10)    
                            l1 ,l2 = self.get_onepage_emotion_score_list(selector)
                            self.num_data = self.num_data+len(l1)
                            self.num_data_f = self.num_data_f + len(l2)
                            if len(l1)>0:
                                emotion_score_list.append(l1)
                            if len(l2)>0:
                                emotion_score_f_list.append(l2)
                            time.sleep(1)
                        mean_emotion_score,var_emotion_score,mean_emotion_score_f,var_emotion_score_f =  self.preprocessing_score_list(emotion_score_list,emotion_score_f_list)
                        data = {
                                    'year': year,
                                    'month': month,
                                    'day': day,
                                    'mean_emotion_score': mean_emotion_score,
                                    'var_emotion_score': var_emotion_score,
                                    'mean_emotion_score_f': mean_emotion_score_f,
                                    'var_emotion_score_f': var_emotion_score_f  }
                        self.save_one_day(data)
                    except Exception as e:
                         print(e)

二、数据展示

成功运行时会展示如下

网页请求失败会显示如下，并在休眠后重试，请耐心等待

csv文件中保存的数据如下图所示分别为年、月、日，第一种词典的日平均情感分数、第一种词典的日情感分数方差，纯金融词典的日平均情感分数、纯金融词典的日情感分数方差。

三、所有代码展示

如您无其它特殊二次开发需求，只需要按提示设置相关参数、安装相应词典与开源库即可运行

import random
import sys
import traceback
from time import sleep
import requests
from lxml import etree
from cnsenti import Sentiment
import numpy as np
import xlwt
import time

begin_year = 2020
end_year = 2021  #比如爬2021与2022年的数据则需要begin_year = 2021,end_year = 2022
name = '贵州茅台'
cookie = {'Cookie': 'cookie'}  # 将your cookie替换成自己的cookie


senti = Sentiment(pos='unformal_pos.txt',  neg='unformal_neg.txt', merge=True,   encoding='utf-8')  # 两txt均为utf-8编码
senti1 = Sentiment(pos='unformal_pos.txt',  neg='unformal_neg.txt',   merge=False, encoding='utf-8')  # 两txt均为utf-8编码
ident = '?q=%s&scope=ori&suball=1'%name
host = 'https://s.weibo.com/weibo'

class pachong(object):
    def __init__(self, name,begin_year,end_year):
        self.count ,self.num_data,self.num_data_f= 0,0,0
        self.name = name
        self.begin_year = begin_year
        self.end_year = end_year
        self.workbook = xlwt.Workbook(encoding='utf-8')  # 创建一个表格
        self.worksheet = self.workbook.add_sheet('%s'%self.name)
        
    def get_day_max(self,year,month):
        if month<8:
            if month%2 == 1:
                day_max = 31
            elif month == 2:
                if year%4 == 0:
                    day_max = 29
                else:
                    day_max = 28
            else:
                day_max = 30
        else:
            if month%2 == 0:
                day_max = 31
            else:
                day_max = 30
        return day_max
    
    def get_date_range(self,year,month,day,day_max):
        if day < day_max:
            date_range = '%d-%s-%s-0:%d-%s-%s-0'%(year,str(month).zfill(2),str(day).zfill(2),year,str(month).zfill(2),str(day+1).zfill(2))
        else:
            if month <12:
                date_range = '%d-%s-%s-0:%d-%s-%s-0'%(year,str(month).zfill(2),str(day).zfill(2),year,str(month+1).zfill(2),str(1).zfill(2))
            else:
                date_range = '%d-%s-%s-0:%d-%s-%s-0'%(year,str(month).zfill(2),str(day).zfill(2),year+1,str(1).zfill(2),str(1).zfill(2))
        return date_range
    
    def get_emotion(self,sentence,finacial):
        if finacial:
            result = senti1.sentiment_count(sentence)
        else:
            result = senti.sentiment_count(sentence)
        if result['pos']+result['neg'] == 0:
#             print('没有情感词汇')
            score = -1
        else:
            score = result['pos']/(result['pos']+result['neg'])
        return score
    
    def get_onepage_emotion_score_list(self,selector):
        l1,l2 = [],[]
        info_text = selector.xpath('//*[@class="txt"]')
#         print('博客数量',len(info_text))
        for i in range(len(info_text)):
            text = info_text[i].xpath('string(.)').replace(u'\u200b', '').encode( sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding)
            if u'展开c' in text :
                nothing = 1
            else:
                s1 = self.get_emotion(text,False)
                s2 = self.get_emotion(text,True)
                if s1 != -1:
                    l1.append(s1)
                if s2 != -1:
                    l2.append(s2)
        return l1,l2
    
    def preprocessing_url(self,year,month,day,day_max,page):
        date_range = self.get_date_range(year,month,day,day_max)
        scope = '&timescope=custom:%s&Refer=g&page=%s' % (date_range, page)
        url = host+ident+scope
        return url
    
    def deal_html(self,url):
        html = requests.get(url, cookies=cookie).content
        selector = etree.HTML(html)
        return selector

        
    def preprocessing_score_list(self,emotion_score_list,emotion_score_f_list):
        e_list,e_list_f = [],[]
        for i in range(len(emotion_score_list)):
            for j in range(len(emotion_score_list[i])):
                e_list.append(emotion_score_list[i][j])
        for i in range(len(emotion_score_f_list)):
            for j in range(len(emotion_score_f_list[i])):
                e_list_f.append(emotion_score_f_list[i][j])
        mean_emotion_score = np.mean(e_list)
        var_emotion_score = np.var(e_list)
        mean_emotion_score_f = np.mean(e_list_f)
        var_emotion_score_f = np.var(e_list_f)
        return mean_emotion_score,var_emotion_score,mean_emotion_score_f,var_emotion_score_f
    
    def save_one_day(self,data):
        try:
            self.count = self.count+1
            self.worksheet.write(self.count, 0, label=data.get('year'))
            self.worksheet.write(self.count, 1, label=data.get('month'))
            self.worksheet.write(self.count, 2, label=data.get('day'))
            self.worksheet.write(self.count, 3, label=data.get('mean_emotion_score'))
            self.worksheet.write(self.count, 4, label=data.get('var_emotion_score'))
            self.worksheet.write(self.count, 5, label=data.get('mean_emotion_score_f'))
            self.worksheet.write(self.count, 6, label=data.get('var_emotion_score_f'))
            print('num_data:',self.num_data,'num_data_f:',self.num_data_f,'count:',self.count,data)
            self.workbook.save('%s.xls'%self.name)
        except Exception as e:
            print('%d年%d月%d日 保存失败'%(data.get('year'), data.get('month'), data.get('day')))
            print(e)
        

                
                
    
    def begin(self):
        for year in range(self.begin_year,self.end_year+1):
            for month in range(1,13):
                day_max = self.get_day_max(year,month)
                for day in range(1,day_max+1):
                    date_range = self.get_date_range(year,month,day,day_max)
                    emotion_score_list = []
                    emotion_score_f_list = []
                    url = self. preprocessing_url(year,month,day,day_max,1)
                    loop =True
                    while loop:
                        try:
                            selector = self.deal_html(url)
                            loop = False
                        except Exception as e:
                            print(e)
                            print('Error: ', '获取页码失败，重试中') 
                            time.sleep(10)
                    try:
                        q = 1
                        while(len(selector.xpath('//*[@id="pl_feedlist_index"]/div[3]/div/span/ul/li')) == 0):
                            q = q+1
                            url = self. preprocessing_url(year,month,day,day_max,q)
                            loop =True
                            while loop:
                                try:
                                    selector = self.deal_html(url)
                                    loop = False
                                except Exception as e:
                                    print(e)
                                    print('Error: ', '获取页码失败，重试中') 
                                    time.sleep(10)
                        page_num =  len(selector.xpath('//*[@id="pl_feedlist_index"]/div[3]/div/span/ul/li'))
                        for page in range(1,page_num+1):
                            url =  self.preprocessing_url(year,month,day,day_max,page)
                            loop = True
                            while loop:
                                try:
                                    selector = self.deal_html(url)
                                    loop = False
                                except Exception as e:
                                    print(e)
                                    print('Error: ', '获取分页失败，重试中') 
                                    print('%d年 %d月 %d日 %d页'%(year,month,day,page))
                                    time.sleep(10)    
                            l1 ,l2 = self.get_onepage_emotion_score_list(selector)
                            self.num_data = self.num_data+len(l1)
                            self.num_data_f = self.num_data_f + len(l2)
                            if len(l1)>0:
                                emotion_score_list.append(l1)
                            if len(l2)>0:
                                emotion_score_f_list.append(l2)
                            time.sleep(1)
                        mean_emotion_score,var_emotion_score,mean_emotion_score_f,var_emotion_score_f =  self.preprocessing_score_list(emotion_score_list,emotion_score_f_list)
                        data = {
                                    'year': year,
                                    'month': month,
                                    'day': day,
                                    'mean_emotion_score': mean_emotion_score,
                                    'var_emotion_score': var_emotion_score,
                                    'mean_emotion_score_f': mean_emotion_score_f,
                                    'var_emotion_score_f': var_emotion_score_f  }
                        self.save_one_day(data)
                    except Exception as e:
                         print(e) 


model = pachong(name,begin_year,end_year)
model.begin()

url的请求重试逻辑仍不完整，后期将进行更新

二牛兄弟

关注

6
点赞
踩
62

收藏

觉得还不错? 一键收藏
3
评论
微博指定日期舆情数据爬虫获取—基于中文金融词典（python）

1、可以获取指定日期的微博关键词搜索结果。通常移动端的微博关键词搜索获取的博客数量有限，年份一远就不会展示出来。因此这篇源码从微博网页版的高级搜索进行数据搜索与爬取。由于技术限制，每天的微博最多包含50页，每页最多包含10条微博。2、基于中文金融词典获取舆情数据。词典来源见此中文金融情感词典发布啦 | 附代码_邓旭东HIT的博客-CSDN博客3、使用了两种词典构建方式计算舆情指标，第一种词典为普通情感词典+金融词典，第二种词典为单独的金融词典4、爬取博客文本的同时计算舆情指标。这样做的优点在
复制链接

扫一扫