python爬虫爬取股票评论，调用百度AI进行语义分析， matlab观察股票涨跌和评论的关系

本文链接：https://blog.csdn.net/zhyl4669/article/details/88742684

文章自己写的，代码自己调试的，但是思想是拿来的哈哈，不能叫严格意义上的原创哦
一、爬股票的评论

环境：win7 aconda2python2.7，pycharm3.5 professional
1、分析爬虫结构
http://guba.eastmoney.com/list,600570_2.html 在这里点击进入一个评论帖子，

http://guba.eastmoney.com/news,600570,810797250.html主要是爬下来这里面的评论区，标题没有爬。
需要两层寻找爬虫位置，所以代码里面有 get_url 得到第二个网页的地址，和 get_comments 得到评论内容两个主要功能。
通过查看源代码通过html 分析得到评论句子所在结构，关键这两句话（可能会变化，如果变需要自己分析，主要是观察div标签）：
urls = text.xpath(’//div[@id=“articlelistnew”]/div[@class=“articleh normal_post”]/span[3]/a/@href’)
times1 = text1.xpath(’//div[@class=“zwlitx”]/div/div[2]/text()’)

2、写入 csv，一直不能成功追加、写入中文字符到excel和，网上参考很多但是我的不成功。改成了写入 csv，可以设置写入参数。csv 目前可以避免乱码。
3、csv 不用自己开始就建，自己动创建，并追加写入的。

# -*- coding:UTF-8 -*-
import sys
import importlib
reload(sys)
sys.setdefaultencoding( "utf-8" )

import re, requests, codecs, time, random
import pandas as pd
from lxml import html

# proxies={"http" : "123.53.86.133:61234"}
proxies = None
headers = {
    'Host': 'guba.eastmoney.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}


def get_url(page):
    stocknum = 600570
    url = 'http://guba.eastmoney.com/list,' + str(stocknum) + '_' + str(page) + '.html'
    try:
        text = requests.get(url, headers=headers, proxies=proxies, timeout=50)
        requests.adapters.DEFAULT_RETRIES = 5
        s = requests.session()
        s.keep_alive = False
        text2 = html.fromstring(text.text)
        text = html.fromstring(text.text)
        urls = text.xpath('//div[@id="articlelistnew"]/div[@class="articleh normal_post"]/span[3]/a/@href')

    except Exception as e:
        print(e)
        time.sleep(random.random() + random.randint(0, 3))
        urls = ''
    return urls

def get_comments(urls):
    for newurl in urls[0:10]:
        newurl1 = 'http://guba.eastmoney.com' + newurl

        # try:
        text1 = requests.get(newurl1, headers=headers, proxies=proxies, timeout=50)
        requests.adapters.DEFAULT_RETRIES = 5
        s = requests.session()
        s.keep_alive = False
        text1 = html.fromstring(text1.text)
        # times1 = text1.xpath('//div[@class="zwli clearfix"]/div[3]/div/div[2]/text()')
        times1 = text1.xpath('//div[@class="zwlitx"]/div/div[2]/text()')

        if not times1 is None:
            # times = '!'.join(re.sub(re.compile('fabiao| '), '', x)[:13] for x in times1).split('!')
            times0 = '!'.join(re.sub(re.compile('| '), '', x)[3:14] for x in times1).split('!')
            # times=list(map(lambda x:re.sub(re.compile('fabiao| '),'',x)[:10],times))
            # comments1 = text1.xpath('//div[@class="zwli clearfix"]/div[3]/div/div[3]/text()')
            comments1 = text1.xpath('//div[@class="zwlitx"]/div/div[3]/div[1]/text()')
            comments0 = '!'.join(w.strip() for w in comments1).split('!')
            save_to_file(times0, comments0)

            # for  i  in  range(0,len(times)-1) :
            #
            #     dic = dict(zip(times[i], comments[i]))
            #     print  times[i], comments[i]
            #     save_to_file(dic)
            #     time.sleep(random.random() + random.randint(0, 3))
            #

            # #
            # dic = dict(zip(times, comments))
            # save_to_file(times,comments)
            # time.sleep(random.random() + random.randint(0, 3))

    # except
    #
    #     print('no comment!!!!')
    #     # time.sleep(random.random() + random.randint(0, 3))
    #     # print(dic)
    #     # if times and comments:
    #     # dic.append({'time':times,'comment':comments})
    #     # return dic


def save_to_file(times,comments):
    # if dic:
        # dic=dic
        # print(dic)
        df=pd.DataFrame([times,comments]).T

        # df.to_excel('eastnoney.xlsx')
        # df.to_csv('eastnoney.csv',encoding="utf_8_sig")
        df.to_csv('eastmoney.csv', encoding="utf_8_sig",mode='a', header=False)
        # print('xiele')

        # for i, j in dic.items():
        #     output = '{}\t{}\n'.format(i, j)
        #     f = codecs.open('eastmoney.xls', 'a+', 'utf-8')
        #     # f = codecs.open('eastmoney.xls')
        #     f.write(output)
        #     f.close()
for page in range(306, 1000):
    print('Crawling to page {}'.format(page))
    urls = get_url(page)
    get_comments(urls)

二、百度AI的使用
1、
注册https://developer.baidu.com/
建设应用，建好会获取到 client = AipNlp(APP_ID, API_KEY, SECRET_KEY) 这里的三参数
https://console.bce.baidu.com/?_=1553240966654&fromai=1#/aip/overview

要用你的应用类型选择应用去建立才行，建错了应用位置，或导致这三个参数不对，报。{“error_code”:14,“error_msg”:“IAM Certification failed”} 。哈哈关于这个讨论不多，也是卡住了找了很久忽然里面有句话来了灵感哈哈，自己就曾经建错了，貌似建设到了其他工程，还有人脸识别、图像识别、语音识别啥的，以后也就可以试试。
这里要在自然语言处理建应用才行哈哈。 http://ai.baidu.com/forum/es/search?title={“error_code”:14,“error_msg”:“IAM Certification failed”}

在这里插入图片描述

2、excel 里面有爬下来的评论，格式是这样的，不需要标题。在这里插入图片描述
调用百度AI 对爬取的excel的评论进行情感分析：

# -*- coding:UTF-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
# import myPython1.CommonTools.codeZh

import pandas as pd
import datetime
from aip import AipNlp
import codecs

startdate = datetime.date(2015, 4, 9).strftime('%Y-%m-%d')
enddate = datetime.date(2019, 3, 22).strftime('%Y-%m-%d')
APP_ID = '15823897'
API_KEY = 'WcQ2RvbjzrTerW2GcHgBnjdb'
SECRET_KEY = 'RfsCp6Wsik7bCWhKPj02a4TUxiCehoQX'

client = AipNlp(APP_ID, API_KEY, SECRET_KEY)
def get_sentiments(text, dates):
    try:
        sitems = client.sentimentClassify(text)['items'][0]  # 情感分析

        # sitems = client.sentimentClassify(text)
        positive = sitems['positive_prob']  # 积极概率
        confidence = sitems['confidence']  # 置信度
        sentiment = sitems['sentiment']  # 0表示消极，1表示中性，2表示积极
        # tagitems = client.commentTag(text, {'type': 9})  # 评论观点
        # propertys=tagitems['prop']#属性
        # adj=tagitems['adj']#描述词
        output = '{}\t{}\t{}\t{}\n'.format(dates, positive, confidence, sentiment)
        f = codecs.open('sentiment.xls', 'a+', 'utf-8')
        f.write(output)
        f.close()
        # print('Done')
    except Exception as e:
        print(e)



def get_content():
    data = pd.DataFrame(pd.read_excel('eastmoney.xlsx', sheet_name=0))
    data.columns = ['Dates', 'viewpoints']  # 重设表头
    # data = data.sort_values(by=['Dates'])  # 按日期排列
    # vdata = data[data.Dates >= startdate]  # 提取对应日期的数据
    # newvdata = vdata.groupby('Dates').agg(lambda x: list(x))  # 按日期分组，把同一天的评论并到一起

    newvdata = data.groupby('Dates').agg(lambda x: list(x))  # 按日期分组，把同一天的评论并到一起
    return newvdata


if __name__ == "__main__":
    viewdata = get_content()
for i in range(viewdata.shape[0]):
    print('{} is being processed,{} remains'.format(i, viewdata.shape[0] - 1))
    dates = viewdata.index[i]
    for view in viewdata.viewpoints[i]:
        # print(view)
        get_sentiments(view, dates)

print('ok')

3、情感分析结果文件
在这里插入图片描述

其他的没什么，直接python程序就对。但问题是情感分析结果上，每天好多条，每天几个sentiment结果，也不知道大侠怎么绘图时候处理的，反正画出了图，对应每天一个数值。

三、matlab 对数据继续处理。

此处只好把同一date的 positive, confidence, sentiment 三个参数直接求平均了。得到一天只有一对的三个参数值。以下程序实现此功能，strcmparrayTime.m 。

本宫工作现在直接常用的matlab，python 自己写嫌累，还要摸索繁琐，套路还要重新写，只好还用matlab暴力处理数据了。
win7 matlab的2016b 破解版本。以下为MATLAB程序strcmparrayTime.m 代码：

[data,tex,dataall]=xlsread(‘sentiment.xlsx’)
data=tdays( ‘2015/04/09’,‘2019/03/22’)
dat=dataall(:,1);
Datet=cellfun(@(x) {strrep(x,’-0’,’/’) },dat);
Datet=cellfun(@(x) {strrep(x,’-’,’/’) },Datet);
Datet=cellfun(@(x) {strrep(x,’ ‘,’’) },Datet);
dataal2=cell(size(dataall));
dataal2(1,:)=dataall(1,:);

for i= 1: size(data,1)
ind =strcmp(Datet,data{i,1});
dataal2{i+1,1}=data{i,1};
dataal2{i+1,2}=mean(cell2mat(dataall(ind,2)));
dataal2{i+1,3}=mean(cell2mat(dataall(ind,3)));
dataal2{i+1,4}=mean(cell2mat(dataall(ind,4)));
end

xlswrite(‘sentiment.xlsx’,dataal2,‘合并分析’);
在这里插入图片描述

四、画图比较、下面excel 可以完成。
股票价格获取大智慧和wind 可以直接粘贴过来价格数据，收盘价就可以。画图比较部分 excel 完成吧，工具就是工具，管他low不low。大智慧和wind 可以直接粘贴过来价格数据，收盘价就可以。
在这里插入图片描述

评论分析结果和股票当日收益率比较，还是有一定相似的。
在这里插入图片描述

以下为参考来源，关于爬虫原作者写的很全面，
爬虫部分我是一步步根据这个来的，还有其他报错的朋友推荐参考。
https://blog.csdn.net/lbship/article/details/79721480

最后：初学者绝对的耐心才行哈，
不然各种版本、包、评论结构分析一个接一个报错哈。
本宫用了一天 20190321，14到22 点才解决好爬虫部分了。
百度API 20190322 上午2 小时才搞定，
嘟嘟， matlab 数据和excel图部分，下午才弄好。
此刻20190322：18:40 ，彻底over。
已经接近两天，并且在熟练此处matlab功能、python有点基础、粗略的知道HTML有啥，并且还是感觉努力了的样子，外加还是在没有被领导安排其他较多繁琐任务情况下。
所以，祝你好运吧

代码打包下 https://download.csdn.net/download/zhyl4669/11050153