A股市场指数的抓取以及相关性分析_怎么爬取a股行业指数-CSDN博客

本文链接：https://blog.csdn.net/weixin_42255757/article/details/133945376

近期的市场表现是真的差啊，我想着利用业余时间把市场的指数给抓下来，并分析一下相关性，看看沪深指数之间是否存在某种关联。获取指数数据的方法有很多种，可以借由一些第三方平台获取，比如新浪财经，腾讯财经接口等等，也可以通过聚宽之类平台提供的API来获取。我这次就先通过网页来抓取一下。

python非常适合从网页抓取数据，最简单的方法就是使用requests包。

首先找到上证指数历史数据网页，我这里是在搜狐网上找到了一个链接：上证指数(000001) - 历史行情 - 股票行情中心 - 搜狐证券 (sohu.com)

通过浏览器的网络分析功能，可以看到查询请求的完整地址：

https://q.stock.sohu.com/hisHq?code=zs_000001&start=20230616&end=20231017&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.5518942061888761&0.21127149858625272

其中，对我而言，只有三个参数需要修改，code，start和end，分别代表指数的代码以及查新的起止时间，我在程序中只要修改成我自己需要的就行了，代码如下：

# 抓取指数历史数据

def get_index_history(index_code, start, end):
    """从搜狐网上抓取的指数数据"""
    url = 'https://q.stock.sohu.com/hisHq?code=zs_{}&start={}&end={}&stat=1&order=D&period=d&callback=historySearchHandler&rt=jsonp&r=0.7795196987720288&0.6939581161974786'.format(
        index_code, start, end
    )
    import requests
    response = requests.get(url)
    response.encoding = 'gbk'
    data = response.text.replace('historySearchHandler(','')
    return data[:-2] # 过滤掉最后一个')'

if __name__ == '__main__':
    data = get_index_history('000001', '20210615', '20231017')
    print(data)

可以看到，数据已经被取到了，我们去除掉源数据的前后的字符串，获取到其中的JSON部分。

接下来，我获取到上证指数，深成指数和创业板指数，并转成pandas DataFrame，使用numpy的corrcoef方法获取对应指数的相关性，代码如下：

def analysis():
    df_h = [] # 沪指历史
    df_s = [] # 深指历史
    df_c = [] # 创业板指历史

    h_data = get_index_history('000001', '20210615', '20231017')
    df_h = convert(h_data)
    s_data = get_index_history('399001', '20210615', '20231017')
    df_s = convert(s_data)
    c_data = get_index_history('399006', '20210615', '20231017')
    df_c = convert(c_data)
    # 沪指和深指的相关性计算
    r_hs = np.corrcoef(df_h['close'].astype('float'), df_s['close'].astype('float'))
    print(r_hs)
    # 沪指和创业板指的相关性计算
    r_hc = np.corrcoef(df_h['close'].astype('float'), df_c['close'].astype('float'))
    print(r_hc)
    # 深指和创业板指的相关性计算
    r_sc = np.corrcoef(df_s['close'].astype('float'), df_c['close'].astype('float'))
    print(r_sc)

    h_list = convert_list(h_data)
    s_list = convert_list(s_data)
    c_list = convert_list(c_data)
    df = pd.DataFrame({'h':h_list, 's':s_list, 'c':c_list})
    print(df)
    r2 = np.corrcoef(df)
    print(r2)
    fig = sns.pairplot(df)
    plt.show()
    fig.savefig('test.png', dpi = 400) # seaborn保存图片

沪指和深指的相关性矩阵为：

[[1. 0.93048987]

[0.93048987 1. ]]

可见，沪深指数之间的相关性是比较高的。

沪指和创业板指数的相关性矩阵为：

[[1. 0.8779242]

[0.8779242 1. ]]

可见，沪指和创业板指数之间的相关性并不如沪深指数之间高。

而深成指和创业板指数的相关性矩阵为：

[[1. 0.98847916]

[0.98847916 1. ]]

这个就太高了，快接近1了，说明深成指和创业板指数高度相关的，换成人话就是，要涨一起涨，要跌一起跌。