利用Tushare进行股票因子计算

最新推荐文章于 2024-08-21 22:54:26 发布

Juanxx

最新推荐文章于 2024-08-21 22:54:26 发布

阅读量2.1k

点赞数 2

本文链接：https://blog.csdn.net/Juanxx/article/details/117050135

版权

利用Tushare进行股票因子计算

开发背景
一、获取沪深300成分股
二、计算市值对数
三、计算收益率

开发背景

笔者tushare ID：414988

一、获取沪深300成分股

基本思路：

先获取全部股票
获取沪深300股票代码
根据沪深300股票代码，提取对应的ts代码
根据ts代码获取沪深300股票列表

1. 获取全部股票代码

stock_list = pro.stock_basic(exchange='',list_status='L', fields='ts_code,symbol,name,area,industry,market,exchange')
stock_list

ts_code	symbol	name	area	industry	market	exchange
0	000001.SZ	000001	平安银行	深圳	银行	主板	SZSE
1	000002.SZ	000002	万科A	深圳	全国地产	主板	SZSE
2	000004.SZ	000004	国华网安	深圳	软件服务	主板	SZSE
3	000005.SZ	000005	世纪星源	深圳	环境保护	主板	SZSE
4	000006.SZ	000006	深振业A	深圳	区域地产	主板	SZSE
...	...	...	...	...	...	...	...
4231	688777.SH	688777	中控技术	浙江	软件服务	科创板	SSE
4232	688788.SH	688788	科思科技	深圳	通信设备	科创板	SSE
4233	688819.SH	688819	天能股份	浙江	电气设备	科创板	SSE
4234	688981.SH	688981	中芯国际	上海	半导体	科创板	SSE
4235	689009.SH	689009	九号公司-UWD	北京	专用机械	CDR	SSE
4236 rows × 7 columns

2. 获取沪深300成分股代码

hs300s = ts.get_hs300s() 
hs300s

	date	code	name	weight
0	2021-03-31	600000	浦发银行	0.68
1	2021-03-31	600004	白云机场	0.08
2	2021-03-31	600009	上海机场	0.29
3	2021-03-31	600010	包钢股份	0.19
4	2021-03-31	600011	华能国际	0.10
...	...	...	...	...
295	2021-03-31	300498	温氏股份	0.40
296	2021-03-31	300529	健帆生物	0.16
297	2021-03-31	300601	康泰生物	0.25
298	2021-03-31	300628	亿联网络	0.10
299	2021-03-31	300676	华大基因	0.13
300 rows × 4 columns

hs300_symbols = hs300s.loc[:]['code']
hs300_symbols

0      600000
1      600004
2      600009
3      600010
4      600011
        ...  
295    300498
296    300529
297    300601
298    300628
299    300676
Name: code, Length: 300, dtype: object

3. 根据沪深300股票代码，提取对应的ts代码

# 获取沪深300的所有股票的ts代码
hs300_code_list = []

for i in range(stock_list.shape[0]):
    if stock_list.loc[i]['symbol'] in hs300_symbols.values:
        hs300_code_list.append(stock_list.loc[i]['ts_code'])

4. 根据ts代码获取沪深300股票列表

注意：tushare平台的pro.stock_basic一分钟只能访问200次，因此300只股票数据可以分为两次读取，中间间隔一分钟。

n_flag = 0
hs300_stock_list = pd.DataFrame()
for code in hs300_code_list:
    hs300_stock_list = pd.concat([hs300_stock_list, pro.stock_basic(ts_code=code)], axis=0)
    if n_flag == 99:
        n_flag = 0
        time.sleep(60)

# 重建索引
hs300_stock_list = hs300_stock_list.reset_index(drop=True)
hs300_stock_list

# 将300只股票的权重合并到对应的hs300_stock_list中
hs300_weights = np.zeros(hs300s.shape[0])
for i in range(hs300s.shape[0]):
    index = hs300_stock_list[hs300_stock_list.symbol==hs300s.loc[i]['code']].index.tolist()[0]
    hs300_weights[index] = (hs300s.loc[i]['weight'])
    
hs300_weights = pd.DataFrame(hs300_weights)
hs300_weights.columns = ['weight']
hs300_stock_list = pd.concat([hs300_stock_list, hs300_weights['weight']], axis=1)
hs300_stock_list

	ts_code	symbol	name	area	industry	market	list_date	weight
0	000001.SZ	000001	平安银行	深圳	银行	主板	19910403	1.12
1	000002.SZ	000002	万科A	深圳	全国地产	主板	19910129	1.07
2	000063.SZ	000063	中兴通讯	深圳	通信设备	主板	19971118	0.42
3	000066.SZ	000066	中国长城	深圳	IT设备	主板	19970626	0.14
4	000069.SZ	000069	华侨城A	深圳	旅游景点	主板	19970910	0.22
...	...	...	...	...	...	...	...	...
295	603993.SH	603993	洛阳钼业	河南	小金属	主板	20121009	0.2
296	688008.SH	688008	澜起科技	上海	半导体	科创板	20190722	0.15
297	688009.SH	688009	中国通号	北京	运输设备	科创板	20190722	0.08
298	688012.SH	688012	中微公司	上海	专用机械	科创板	20190722	0.12
299	688036.SH	688036	传音控股	深圳	通信设备	科创板	20190930	0.18
300 rows × 8 columns

二、计算市值对数

pro.daily_basic接口同样一分钟只允许访问200次。
采用一样的方法，在循环中sleep一分钟。
其中total_mv为总市值，circ_mv为流通市值

n_flag = 0
indi_dic = pd.DataFrame()
for code in hs300_code_list:
    indi = pro.daily_basic(ts_code=code, start_date='20060101', end_date='20210401', fields='ts_code,trade_date,turnover_rate,total_mv,circ_mv')
    indi_dic = pd.concat([indi_dic, indi], axis=0)
    if n_flag == 150:
        n_flag = 0
        time.sleep(60)

# 统计一下是否完整收集300只股票
len(indi_dic['ts_code'].unique())
3

使用apply和lambda计算市值对数

import math
indi_dic['totalmv_size_factor'] = indi_dic['total_mv'].apply(lambda x: math.log(x))
indi_dic['circmv_size_factor'] = indi_dic['circ_mv'].apply(lambda x: math.log(x))
# 改个名字
indi_dic = indi_dic.rename(columns={'totalmv_size_factor':'ln_total_capital', 'circmv_size_factor':'ln_circ_capital'})
indi_dic

	ts_code	trade_date	turnover_rate	total_mv	circ_mv	ln_total_capital	ln_circ_capital
0	000001.SZ	20210401	0.2806	4.226609e+07	4.226573e+07	17.559496	17.559487
1	000001.SZ	20210331	0.4004	4.271243e+07	4.271207e+07	17.570000	17.569992
2	000001.SZ	20210330	0.3806	4.255718e+07	4.255682e+07	17.566359	17.566351
3	000001.SZ	20210329	0.4049	4.170332e+07	4.170297e+07	17.546091	17.546083
4	000001.SZ	20210326	0.4236	4.102411e+07	4.102376e+07	17.529671	17.529662
...	...	...	...	...	...	...	...
359	688036.SH	20191011	19.1273	3.805600e+06	3.404245e+05	15.151984	12.737949
360	688036.SH	20191010	18.9791	3.980000e+06	3.560252e+05	15.196792	12.782757
361	688036.SH	20191009	21.2596	3.842400e+06	3.437164e+05	15.161608	12.747572
362	688036.SH	20191008	32.4079	4.019200e+06	3.595318e+05	15.206593	12.792558
363	688036.SH	20190930	67.9496	4.624000e+06	4.136333e+05	15.346771	12.932735
787958 rows × 7 columns

三、计算收益率

获取月线数据，每分钟最多访问该接口120次，中间sleep

n_flag = 0
monthly_line = pd.DataFrame()
for code in hs300_code_list:
    mon = pro.monthly(ts_code=code, start_date='20060101', end_date='20210331', fields='ts_code,trade_date,open,close')
    monthly_line = pd.concat([monthly_line, mon], axis=0)
    if n_flag == 99:
        n_flag = 0
        time.sleep(60)
monthly_line

	ts_code	trade_date	close	open
0	000001.SZ	20210331	22.01	21.54
1	000001.SZ	20210226	21.38	23.00
2	000001.SZ	20210129	23.09	19.10
3	000001.SZ	20201231	19.34	19.70
4	000001.SZ	20201130	19.74	17.65
...	...	...	...	...
8	300015.SZ	20200731	45.30	43.90
9	300015.SZ	20200630	43.45	39.20
10	300015.SZ	20200529	39.16	44.31
11	300015.SZ	20200430	44.39	39.25
12	300015.SZ	20200331	39.38	40.55
1300 rows × 4 columns

设置索引为日期

monthly_line0 = monthly_line
monthly_line0 = monthly_line0.set_index('trade_date')
monthly_line0.index = pd.to_datetime(monthly_line0.index,format='%Y%m%d').to_period('M')
monthly_line0

	ts_code	close	open
trade_date			
2021-03	000001.SZ	22.01	21.54
2021-02	000001.SZ	21.38	23.00
2021-01	000001.SZ	23.09	19.10
2020-12	000001.SZ	19.34	19.70
2020-11	000001.SZ	19.74	17.65
...	...	...	...
2020-07	688036.SH	100.10	70.00
2020-06	688036.SH	71.00	50.66
2020-05	688036.SH	50.00	53.50
2020-04	688036.SH	54.50	42.38
2020-03	688036.SH	42.81	58.60
3893 rows × 3 columns

用groupby分组操作

price_month = (monthly_line0.groupby(['trade_date', 'ts_code']).sum())
price_month

		close	open
trade_date	ts_code		
2020-03	000001.SZ	12.80	14.55
000002.SZ	25.65	29.90
000063.SZ	42.80	52.00
000066.SZ	11.94	14.13
000069.SZ	6.39	6.59
...	...	...	...
2021-03	603993.SH	5.28	6.72
688008.SH	61.20	75.00
688009.SH	5.67	5.93
688012.SH	107.51	126.07
688036.SH	209.38	200.57
3893 rows × 2 columns

计算收益率

price_month['return'] = (price_month['close'] - price_month['open']) / price_month['open']
price_month

		close	open	return
trade_date	ts_code			
2020-03	000001.SZ	12.80	14.55	-0.120275
000002.SZ	25.65	29.90	-0.142140
000063.SZ	42.80	52.00	-0.176923
000066.SZ	11.94	14.13	-0.154989
000069.SZ	6.39	6.59	-0.030349
...	...	...	...	...
2021-03	603993.SH	5.28	6.72	-0.214286
688008.SH	61.20	75.00	-0.184000
688009.SH	5.67	5.93	-0.043845
688012.SH	107.51	126.07	-0.147220
688036.SH	209.38	200.57	0.043925
3893 rows × 3 columns