AlphaNet:股票因子挖掘神经网络构建——华泰证券人工智能系列32
研报主要内容介绍
最近在私募实习挖量价因子,挖不出来新的因子很痛苦,只能看研报和b乎找灵感,发现这篇研报虽然难以在公司的平台上运行,但是idea还是非常有意思的。大致说的是把股票的量价数据当成一张张“数据图片”,借鉴图像识别中的卷积神经网络,通过自定义的运算符提取数据图片的特征,然后输入到全连接层(v1版本)或者LSTM/GRU层(v2版本),其大致流程如下图
输入的数据的结构如下,其中Label为t到t+m时间区间的收益率
个人看来,这篇研报的并没有牵扯到很深的深度学习内容,主要难点在于特征提取层的实现,详细原因后文阐述。华泰曾在曾在《基于遗传规划的选股因子挖掘》这篇研报中有过相似的自定义运算符的应用,那篇研报中给的方法是直接修改gplearn的源码实现。由于笔者能力有限,只能另辟蹊径。
自定义特征提取层—自定义运算符介绍
特征提取层主要有以下几个函数ts_corr,ts_cov,ts_std,ts_mean等,其具体含义如下图
该层最大的特点是:个股的量价数据不一定要按照图表 6 的方式(open,high,low…)排布,可以按照任意的方式排布。 具体原因以ts_cov(X,Y,d)为例
1、ts_corr(X,Y,d)
设d=3,下图展示了 ts_corr(X, Y, 3)网络层的工作机制。ts_corr(X, Y, 3)会在时间维度上和特征维度上对二维的数据进行遍历运算,与 CNN 类似,步进大小 stride 是可调参数,例如 stride=1 时,下一次计算在时间维度上往右步进一步。在特征维度上的计算则体现了与 CNN 卷积的不同之处,CNN 卷积运算只能进行局部感知,但是 ts_corr(X, Y, 3)会对所有类型的数据进行遍历,其计算区域不一定要相邻,例如在图表8中,会遍历
(
9
2
)
{9\choose2}
(29)=36次。避免了 CNN 中局部感知所带来的数据排布问题,可以充分提取数据中的特征。 ts_corr(X, Y, 3)的运算结果是一张二维的“特征图片”,针对该图片,可以直接展平(flatten)输入到全连接神经网络中,也可以继续在此基础上进行特征提取或池化。如果继续进行特征提取,则可实现运算符的嵌套,如:ts_corr(ts_corr(X, Y, 3), ts_corr(Z, W, 3),3)。与 ts_corr(X, Y, d)相似的还有 ts_cov(X, Y, d),这里不再赘述
2、ts_stddev(X, d)
设 d=3,图表 9 展示了 ts_stddev(X, 3)网络层的工作机制。该网络层的机制较为简单,类似于CNN中的1*3卷积。其他网络层例如ts_zscore(X, d)、ts_return(X, d)等与ts_stddev(X,d)类似,这里不再赘述。
特征提取层自定义运算符的实现
由于ts_corr(X,Y,d)与ts_cov(X,Y,d)需要两组不同的数据X,Y,而X,Y的选取共有 ( 9 2 ) {9\choose2} (29)=36种,因此首先想到的是定义一个组合数数对的产生器。可以无脑循环产生,也可以使用递归进行定义,循环的优点在于简单,循环的缺点在于你从9列中选两个数需要套两个循环,选3个数就需要套三个循环,复杂度过高。
1、通过递归产生组合数数对
'''无脑循环法'''
def generateC(n):
v = []
for i in range(n):
for j in range(n):
if all([i!=j,i<j]):
v.append([i+1,j+1])
return v
'''递归法'''
def generateC(l1):
'''组合数数对产生器,l1应为list,例如产生${5\choose2}$的所有可能结果,l1=[0,1,2,3,4]'''
if len(l1) == 1:
return []
v = [[l1[0],i] for i in l1[1:]]
l1 = l1[1:]
return v+generateC(l1)
2、个股“数据图片”的ts_cov(X,Y,d)
有了上述的组合数产生器,我们就可以选取不同的组合计算协方差,同样如果根据常规二维数据,直接无脑循环也可以很容易计算出来,如下:由于在写函数这部分时只当作是个测试,默认的stride=10,没有考虑能否整除的问题。
def covar(X,Y,d,stride):
size = (len(X) - d)/stride + 1
v = []
for i in range(int(size)):
x = X[i*stride:i*stride+stride]
y = Y[i*stride:i*stride+stride]
cor = np.cov(x,y)[0][1]
v.append(cor)
return v
def ts_covar(inputs,d,num,stride = 10):
'''num:组合数'''
v = []
for n in num:
c = covar(inputs[n[0]-1],inputs[n[1]-1],d,stride)
v.append(c)
s = np.array(v).reshape(len(num),-1)
return torch.from_numpy(s)
这个函数可以提取个股的特征,但我们知道本CNN输入数据是4维[batch size,channels,height,width],不可能一只股票一只股票的遍历计算,耗时太久且不现实。那有没有一种方法对4维数据每个batch并行的计算ts_cov?笔者卡在这个问题上多日,有一天在公交车上突然茅塞顿开。为什么卷积神经网络中可以在每个batch对所有channels同时卷积,原因在于卷积神经网络产生了[out_channels,in_channels,kernel_size,kernel_size]的卷积核。例如:
import torch
import torch.nn as nn
input_data = torch.randn(10,3,28,28) #产生输入数据
conv2d = nn.Conv2d(in_channels=3, out_channels=16, kernel_size = (4,4),bias=False) # 定义卷积神经网络
output = conv2d(input_data)
print(conv2d.weight.size())
print(output.size())
输出结果为:
torch.Size([16, 3, 4, 4])
torch.Size([10, 16, 25, 25])
3、用“卷积”的方法并行计算协方差
那么我们可不可以在计算协方差的时候,产生一组卷积核?当然可以,我们从协方差计算方式入手
C
o
v
(
X
,
Y
)
=
E
(
X
−
X
‾
)
(
Y
−
Y
‾
)
=
∑
i
=
1
N
(
X
i
−
X
‾
)
(
Y
i
−
Y
‾
)
N
−
1
{\mathit Cov(X,Y)}=E(X-\overline{X})(Y-\overline{Y})=\sum_{i=1}^N \frac{(X_i-\overline{X})(Y_i-\overline{Y})}{N-1}
Cov(X,Y)=E(X−X)(Y−Y)=i=1∑NN−1(Xi−X)(Yi−Y)
观察到上式的最右边有点像卷积的运算方式,举个例子,我们有一组数据:
[
x
1
x
2
x
3
⋯
x
m
y
1
y
2
y
3
⋯
y
m
]
\left[ \begin{matrix} x_1 & x_2 & x_3 \cdots&x_m\\ y_1 & y_2 & y_3 \cdots&y_m\\ \end{matrix} \right]
[x1y1x2y2x3⋯y3⋯xmym]
分别计算
x
x
x的均值
x
‾
\overline x
x,
y
y
y的均值
y
‾
\overline y
y,然后
x
x
x列与
y
y
y列同时减去对应的均值,变为:
[
x
1
−
x
‾
x
2
−
x
‾
x
3
−
x
‾
⋯
x
m
−
x
‾
y
1
−
y
‾
y
2
−
y
‾
y
3
−
y
‾
⋯
y
m
−
y
‾
]
\left[ \begin{matrix} x_1-\overline x& x_2-\overline x & x_3-\overline x \cdots&x_m-\overline x\\ y_1-\overline y & y_2-\overline y & y_3-\overline y \cdots&y_m-\overline y\\ \end{matrix} \right]
[x1−xy1−yx2−xy2−yx3−x⋯y3−y⋯xm−xym−y]
将得到矩阵翻转:
[
y
1
−
y
‾
y
2
−
y
‾
y
3
−
y
‾
⋯
y
m
−
y
‾
x
1
−
x
‾
x
2
−
x
‾
x
3
−
x
‾
⋯
x
m
−
x
‾
]
\left[ \begin{matrix} y_1-\overline y & y_2-\overline y & y_3-\overline y \cdots&y_m-\overline y\\ x_1-\overline x& x_2-\overline x & x_3-\overline x \cdots&x_m-\overline x\\ \end{matrix} \right]
[y1−yx1−xy2−yx2−xy3−y⋯x3−x⋯ym−yxm−x]
上述两个矩阵对应元素相乘,同时相加除以m-1即位协方差
[
x
1
−
x
‾
x
2
−
x
‾
x
3
−
x
‾
⋯
x
m
−
x
‾
y
1
−
y
‾
y
2
−
y
‾
y
3
−
y
‾
⋯
y
m
−
y
‾
]
\left[ \begin{matrix} x_1-\overline x& x_2-\overline x & x_3-\overline x \cdots&x_m-\overline x\\ y_1-\overline y & y_2-\overline y & y_3-\overline y \cdots&y_m-\overline y\\ \end{matrix} \right]
[x1−xy1−yx2−xy2−yx3−x⋯y3−y⋯xm−xym−y]
×
\times
×
[
y
1
−
y
‾
y
2
−
y
‾
y
3
−
y
‾
⋯
y
m
−
y
‾
x
1
−
x
‾
x
2
−
x
‾
x
3
−
x
‾
⋯
x
m
−
x
‾
]
\left[ \begin{matrix} y_1-\overline y & y_2-\overline y & y_3-\overline y \cdots&y_m-\overline y\\ x_1-\overline x& x_2-\overline x & x_3-\overline x \cdots&x_m-\overline x\\ \end{matrix} \right]
[y1−yx1−xy2−yx2−xy3−y⋯x3−x⋯ym−yxm−x]
例如
import numpy as np
data = np.random.randn(3,1,2,10) # 数据产生
mean = data.mean(axis = 3, keepdims=True) # keepdims一定要指定为True+
x1 = data[:,:,[0,1],:]-mean
y1 = data[:,:,[1,0],:]-mean # 矩阵翻转
coef = (x1*y1).sum(axis = 3, keepdims=True)/(10-1)
print(coef)
输出结果为
[[[[-0.40902133]
[-0.40902133]]]
[[[-0.27561585]
[-0.27561585]]]
[[[ 0.38524211]
[ 0.38524211]]]]
使用np.cov函数验算:
test_data = data[0].reshape(2,10)
print(np.cov(test_data))
输出结果为:
[[ 1.27875212 -0.40902133]
[-0.40902133 0.64090397]]
结果一致。
这个时候我们第一步产生的组合数对就有作用了,首先我们先计算
(
9
2
)
{9\choose2}
(29)的可能结果,然后计算一个翻转的数对列表(例如:[0,1]变成[1,0])
data=np.random.uniform(10,100,(2000,1,9,30))
def generateC(l1):
if len(l1) == 1:
return []
v = [[l1[0],i] for i in l1[1:]]
l1 = l1[1:]
return v+generateC(l1)
feat_nums = data.shape[2]
list1 = list(range(feat_nums)) # 4为特征数
num = generateC(list1)
num_rev = [] # 数组反转,PS:不要用[l.reverse() for l in num],因为reverse函数会直接修改原List,不会产生新的结果。
for l in num:
l1 = l.copy()
l1.reverse()
num_rev.append(l1)
print(num,'\n',num_rev)
输出结果为:
[[0, 1], [0, 2], [0, 3], [0, 4], [0, 5], [0, 6], [0, 7], [0, 8], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6], [1, 7], [1, 8], [2, 3], [2, 4], [2, 5], [2, 6], [2, 7], [2, 8], [3, 4], [3, 5], [3, 6], [3, 7], [3, 8], [4, 5], [4, 6], [4, 7], [4, 8], [5, 6], [5, 7], [5, 8], [6, 7], [6, 8], [7, 8]]
[[1, 0], [2, 0], [3, 0], [4, 0], [5, 0], [6, 0], [7, 0], [8, 0], [2, 1], [3, 1], [4, 1], [5, 1], [6, 1], [7, 1], [8, 1], [3, 2], [4, 2], [5, 2], [6, 2], [7, 2], [8, 2], [4, 3], [5, 3], [6, 3], [7, 3], [8, 3], [5, 4], [6, 4], [7, 4], [8, 4], [6, 5], [7, 5], [8, 5], [7, 6], [8, 6], [8, 7]]
注意两个列表一定要一一对应!,即[a,b]一定要对应[b,a]
有了上述准备,我们就可以对输入数据进行操作,在这里我要特别强调下numpy在读取数据时候的特点,假设有个4维[N,C,H,W]的array,那么我们在读取数据时array[:,:,0,:],其维数是[N,C,W]表示取第三维第一组数据,array[:,:,[0,1],:]在读取数据时是同事读取第三维第一行与第二行,其切片数据的大小为[N,C,2,W],如果我们指定list=[[0,1],[0,2]]读取时,其切片读取数据的大小为:[N,C,2,2,W],第三维的2是你切片的List决定的。代码如下:
import numpy as np
data = np.random.randn(300,4,9,30) #产生数据
num = generate(list(range(9))) #产生组合数对列表
data_cut = data[:,:,num,:]
print(data_cut.shape)
输出结果为:
(300, 4, 36, 2, 30)
有了上述的说明,读者大概明白了产生一个反转的组合数对列表的原因了:
numpy获取切片数据的时候是与你读取数据时的顺序有关,例如array[:,:,[0,1],:]获取的结果就是第一行在上,第二行在下,array[:,:,[1,0],:]获取的结果就是第二行在上,第一行在下,代码如下:
import numpy as np
arr = np.arange(64).reshape(2,1,4,8)
print(arr[:,:,[0,1],:])
print(arr[:,:,[1,0],:])
输出为:
array([[[[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15]]],
[[[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47]]]])
array([[[[ 8, 9, 10, 11, 12, 13, 14, 15],
[ 0, 1, 2, 3, 4, 5, 6, 7]]],
[[[40, 41, 42, 43, 44, 45, 46, 47],
[32, 33, 34, 35, 36, 37, 38, 39]]]])
有了上述的基础就可以编写ts_cov(X,Y,d)函数,函数如下
def ts_cov4d(data,num,num_rev,stride):
'''计算4维数据的协方差'''
'''data:[N,C,H,W],,W:price length,N:batch size
num:组合数对列表,num_rev:num的翻转列表'''
# 构建步长列表,如果数据长度不能整除,则取剩下长度,如果剩下长度小于5,则与上一步结合一起
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
#计算的过程中务必保持keepdims=True
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,num,start:end]
sub_data2 = data[:,:,num_rev,start:end]
mean1 = sub_data1.mean(axis = 4,keepdims = True)
mean2 = sub_data2.mean(axis = 4,keepdims = True)
spread1 = sub_data1 - mean1
spread2 = sub_data2 - mean2
cov = ((spread1*spread2).sum(axis = 4,keepdims = True)/(sub_data1.shape[4] - 1)).mean(axis = 3,keepdims = True)
l.append(cov)
conv_feat = len(num)
corr = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
return torch.from_numpy(corr)
测试代码运行速度
import time
data = np.random.uniform(10,100,(2000,1,9,30))#随机产生2000只股票的数据图片
start = time.time()
ts_cov = ts.cov4d(data,num,num_rev,10)
end = time.time()
print(end-start)
print(ts_cov.shape)
输出结果:
0.16455984115600586
torch.Size([2000, 1, 36, 3])
以上述的分析流程,我们可以定义出其他运算符,计算方式与ts_cov4d类似,特征提取层的代码合计如下:
def ts_cov4d(data,num,num_rev,stride):
'''计算4维数据的协方差'''
'''data:[N,C,H,W],,W:price length,N:batch size
num:组合数对列表,num_rev:num的翻转列表'''
# 构建步长列表,如果数据长度不能整除,则取剩下长度,如果剩下长度小于5,则与上一步结合一起
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,num,start:end]
sub_data2 = data[:,:,num_rev,start:end]
mean1 = sub_data1.mean(axis = 4,keepdims = True)
mean2 = sub_data2.mean(axis = 4,keepdims = True)
spread1 = sub_data1 - mean1
spread2 = sub_data2 - mean2
cov = ((spread1*spread2).sum(axis = 4,keepdims = True)/(sub_data1.shape[4] - 1)).mean(axis = 3,keepdims = True)
l.append(cov)
conv_feat = len(num)
corr = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
return torch.from_numpy(corr)
def ts_corr4d(data,num,num_rev,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,num,start:end]
sub_data2 = data[:,:,num_rev,start:end]
std1 = sub_data1.std(axis = 4,keepdims = True)
std2 = sub_data2.std(axis = 4,keepdims = True)
std = (std1*std2).mean(axis = 3,keepdims = True)
l.append(std)
conv_feat = len(num)
std = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
cov = ts_cov4d(data,num,num_rev,stride)
fct = (sub_data1.shape[4]-1)/sub_data1.shape[4]
return (cov/torch.from_numpy(std))*fct
def ts_stddev4d(data,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,:,start:end]
std1 = sub_data1.std(axis = 3,keepdims = True)
l.append(std1)
std = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
return torch.from_numpy(std)
def ts_zscore(data,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,:,start:end]
mean = sub_data1.mean(axis = 3,keepdims = True)
std = sub_data1.std(axis = 3,keepdims = True)
z_score = mean/std
l.append(z_score)
z_score = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
# z_data = np.squeeze(np.array(l)).transpose(1,2,0,3).reshape(-1,1,feat_num,data_length)
return torch.from_numpy(z_score)
def ts_return(data,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
global l
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,:,start:end]
ret = sub_data1[:,:,:,-1]/sub_data1[:,:,:,0] - 1
l.append(ret)
z_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
return torch.from_numpy(z_data)
def ts_decaylinear(data,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
global l
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
time_spread = end - start
weight = np.arange(1,time_spread+1)
weight = weight/(weight.sum())
sub_data1 = (data[:,:,:,start:end]*weight).sum(axis = 3,keepdims = True)
l.append(sub_data1)
decay_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
return torch.from_numpy(decay_data)
def ts_pool(data,stride,method):
if type(data) == torch.Tensor:
data = data.detach().numpy()
if data.shape[-1] <= stride:
step_list = [0,data.shape[-1]]
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
global l
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
if method == 'max':
sub_data1 = data[:,:,:,start:end].max(axis = 3,keepdims = True)
if method == 'min':
sub_data1 = data[:,:,:,start:end].min(axis = 3,keepdims = True)
if method == 'mean':
sub_data1 = data[:,:,:,start:end].mean(axis = 3,keepdims = True)
l.append(sub_data1)
try:
pool_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list) - 1)
except:
pool_data = np.squeeze(np.array(l)).reshape(-1,1,feat_num,len(step_list) - 1)
return torch.from_numpy(pool_data)
构建AlphaNet
AlphaNet的简单流程如下
我们以torch.nn内置函数nn.Linear(),nn.BarchNorm2d()等先简单实现一次输出,随后在定义为AlphaNet
特征提取
# 特征提取
data = np.random.uniform(10,100,(2000,1,9,30)) #模拟产生2000只股票的数据图片
batch = nn.BatchNorm2d(1) #批标准层
conv1 = ts_cov4d(data,num,num_rev,10).to(torch.float)
bc1 = batch(conv1)
conv2 = ts_corr4d(data,num,num_rev,10).to(torch.float)
bc2 = batch(conv2)
conv3 = ts_stddev4d(data,10).to(torch.float)
bc3 = batch(conv3)
conv4 = ts_decaylinear(data,10).to(torch.float)
bc4 = batch(conv4)
conv5 = ts_zscore(data,10).to(torch.float)
bc5 = batch(conv5)
conv6 = ts_return(data,10).to(torch.float)
bc6 = batch(conv6)
# 特征聚合
feat_cat = torch.cat([bc1,bc2,bc3,bc4,bc5,bc6],axis = 2)
print(bc1.size())
print(bc2.size())
print(bc3.size())
print(bc4.size())
print(bc5.size())
print(bc6.size())
print(feat_cat.size())
输出结果:
torch.Size([2000, 1, 36, 3])
torch.Size([2000, 1, 36, 3])
torch.Size([2000, 1, 9, 3])
torch.Size([2000, 1, 9, 3])
torch.Size([2000, 1, 9, 3])
torch.Size([2000, 1, 9, 3])
torch.Size([2000, 1, 108, 3])
池化层
ts_max = ts_pool(feat_cat ,3,method = 'max')
ts_max = batch(ts_max)
ts_min = ts_pool(feat_cat ,3,method = 'min')
ts_min = batch(ts_min)
ts_mean = ts_pool(feat_cat ,3,method = 'mean')
ts_mean = batch(ts_mean)
#聚合
data_pool = torch.cat([ts_max,ts_min,ts_mean],axis = 2)
# 特征展平
data_pool = data_pool .flatten(start_dim = 1)
print(data_pool.size())
输出结果:
torch.Size([2000, 324])
全连接层
pipline = nn.Sequential(nn.Linear(324,30),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(30,1))
output = pipline(data_fin)
print(output,'\n',output.size())
输出结果:
tensor([[0.6031],
[0.0815],
[0.2301],
...,
[0.1955],
[0.3751],
[0.3509]], grad_fn=<AddmmBackward>)
torch.Size([2000, 1])
熟悉了上述流程后,便可以封装为完整的神经网络了,完整代码如下
class AlphaNet(nn.Module):
def __init__(self,input_channel,fc1_neuron,fc2_neuron,fcast_neuron):
super(AlphaNet,self).__init__()
self.fc1_neuron = fc1_neuron
self.fc2_neuron = fc2_neuron
self.fcast_neuron = fcast_neuron
self.batchnorm = nn.BatchNorm2d(input_channel)
self.dropout = nn.Dropout(0.5)
self.fc1 = nn.Linear(self.fc1_neuron,self.fc2_neuron)
self.out = nn.Linear(self.fc2_neuron,self.fcast_neuron)
self.relu = nn.ReLU()
def forward(self,data,num,num_rev):
conv1 = self.ts_cov4d(data,num,num_rev,10).to(torch.float)
bc1 = self.batchnorm(conv1)
conv2 = self.ts_corr4d(data,num,num_rev,10).to(torch.float)
bc2 = self.batchnorm(conv2)
conv3 = self.ts_stddev4d(data,10).to(torch.float)
bc3 = self.batchnorm(conv3)
conv4 = self.ts_decaylinear(data,10).to(torch.float)
bc4 = self.batchnorm(conv4)
conv5 = self.ts_zscore(data,10).to(torch.float)
bc5 = self.batchnorm(conv5)
conv6 = self.ts_return(data,10).to(torch.float)
bc6 = self.batchnorm(conv6)
data_conv = torch.cat([bc1,bc2,bc3,bc4,bc5,bc6],axis = 2)
ts_max = self.ts_pool(data_conv,3,method = 'max')
ts_max = self.batchnorm(ts_max)
ts_min = self.ts_pool(data_conv,3,method = 'min')
ts_min = self.batchnorm(ts_min)
ts_mean = self.ts_pool(data_conv,3,method = 'mean')
ts_mean = self.batchnorm(ts_mean)
data_fin = torch.cat([ts_max,ts_min,ts_mean],axis = 2)
data_fin = data_fin.flatten(start_dim = 1)
input_size = data_fin.size(1)
ful_connect = self.dropout(self.relu(self.fc1(data_fin)))
output = self.out(ful_connect)
return output.to(torch.float)
def ts_cov4d(self,data,num,num_rev,stride):
'''Caculate the covariance of four-dimension data'''
'''data:[N,C,H,W],H:feature of stock data picture,W:price length,C:stock numbers
num:combination pair,reverse of num'''
# 构建步长列表,如果数据长度不能整除,则取剩下长度,如果剩下长度小于5,则与上一步结合一起
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,num,start:end]
sub_data2 = data[:,:,num_rev,start:end]
mean1 = sub_data1.mean(axis = 4,keepdims = True)
mean2 = sub_data2.mean(axis = 4,keepdims = True)
spread1 = sub_data1 - mean1
spread2 = sub_data2 - mean2
cov = ((spread1*spread2).sum(axis = 4,keepdims = True)/(sub_data1.shape[4] - 1)).mean(axis = 3,keepdims = True)
l.append(cov)
conv_feat = len(num)
corr = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
return torch.from_numpy(corr)
def ts_corr4d(self,data,num,num_rev,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,num,start:end]
sub_data2 = data[:,:,num_rev,start:end]
std1 = sub_data1.std(axis = 4,keepdims = True)
std2 = sub_data2.std(axis = 4,keepdims = True)
std = (std1*std2).mean(axis = 3,keepdims = True)
l.append(std)
conv_feat = len(num)
std = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
cov = self.ts_cov4d(data,num,num_rev,stride)
fct = (sub_data1.shape[4]-1)/sub_data1.shape[4]
return (cov/torch.from_numpy(std))*fct
def ts_stddev4d(self,data,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,:,start:end]
std1 = sub_data1.std(axis = 3,keepdims = True)
l.append(std1)
std = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
return torch.from_numpy(std)
def ts_zscore(self,data,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,:,start:end]
mean = sub_data1.mean(axis = 3,keepdims = True)
std = sub_data1.std(axis = 3,keepdims = True)
z_score = mean/std
l.append(z_score)
z_score = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
# z_data = np.squeeze(np.array(l)).transpose(1,2,0,3).reshape(-1,1,feat_num,data_length)
return torch.from_numpy(z_score)
def ts_return(self,data,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
global l
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
sub_data1 = data[:,:,:,start:end]
ret = sub_data1[:,:,:,-1]/sub_data1[:,:,:,0] - 1
l.append(ret)
z_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
return torch.from_numpy(z_data)
def ts_decaylinear(self,data,stride):
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
global l
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
time_spread = end - start
weight = np.arange(1,time_spread+1)
weight = weight/(weight.sum())
sub_data1 = (data[:,:,:,start:end]*weight).sum(axis = 3,keepdims = True)
l.append(sub_data1)
decay_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
return torch.from_numpy(decay_data)
def ts_pool(self,data,stride,method):
if type(data) == torch.Tensor:
data = data.detach().numpy()
if data.shape[-1] <= stride:
step_list = [0,data.shape[-1]]
if len(data.shape)!=4:
raise Exception('Input data dimensions should be [N,C,H,W]')
data_length = data.shape[3]
feat_num = data.shape[2]
conv_feat = len(num)
if data_length % stride == 0:
step_list = list(range(0,data_length+stride,stride))
elif data_length % stride<=5:
mod = data_length % stride
step_list = list(range(0,data_length-stride,stride))+[data_length]
else:
mod = data_length % stride
step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
global l
l = []
for i in range(len(step_list)-1):
start = step_list[i]
end = step_list[i+1]
if method == 'max':
sub_data1 = data[:,:,:,start:end].max(axis = 3,keepdims = True)
if method == 'min':
sub_data1 = data[:,:,:,start:end].min(axis = 3,keepdims = True)
if method == 'mean':
sub_data1 = data[:,:,:,start:end].mean(axis = 3,keepdims = True)
l.append(sub_data1)
try:
pool_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list) - 1)
except:
pool_data = np.squeeze(np.array(l)).reshape(-1,1,feat_num,len(step_list) - 1)
return torch.from_numpy(pool_data)
训练神经网络
Batch Size=2000
优化器RMSProb,
l
e
a
r
n
i
n
g
r
a
t
e
=
0.0001
learning\ rate=0.0001
learning rate=0.0001
损失函数:MSELoss
定义数据集
在pytorch里面使用自定义数据集,需要继承DataSet基类(PS:由于没有很高效的股票数据获取方式,所以暂时先以随机数替代,读者可以自行尝试)
from torch.utils.data import Dataset, DataLoader
data = np.random.uniform(10,100,(3124,1,9,30))
label = np.random.randn(3124,1)
class Testdataset(Dataset):
def __init__(self, data,label):
self.data = torch.from_numpy(data)
self.label = torch.from_numpy(label)
def __getitem__(self, index):
return self.data[index], self.label[index]
def __len__(self):
return len(self.data)
trainset = Testdataset(data,label)
trainloader = DataLoader(trainset,batch_size = 2000,shuffle = False)
for i,data in enumerate(trainloader):
input_data,label = data
print(i,input_data.size())
输出结果:
0 torch.Size([2000, 1, 9, 30])
1 torch.Size([1124, 1, 9, 30])
简单的训练流程
import torch.optim as optim
alphanet = AlphaNet(1,324,30,1)
criterion = nn.MSELoss()
LR = 0.0001
optimizer = optim.RMSprop(alphanet.parameters(), lr=LR, alpha=0.9)
epoch_num = 20
loss_list = []
for epoch in range(epoch_num ):
train_loss = 0.0
for data,label in trainloader:
out_put = alphanet(data.detach().numpy(),num,num_rev)
loss = criterion(out_put,label.to(torch.float))
optimizer.zero_grad()
train_loss += loss
loss.backward()
optimizer.step()
loss_list.append(train_loss.item())
print("current epoch time:",epoch+1)
再次提示一边,由于没有高效的获取股票数据的方式,所以本文并没有做实证分析,只是搭建了一个可行的因子挖掘的网络,读者可以自行修改代码使其更完整。还有一点就是笔者接触深度学习不过一两月有余,所以对于pytorch框架不是很熟悉,以及对华泰官方数据读取有些不明白,尤其是原文中:“每次训练都使用过去1500个交易日的数据作为样本内数据,每隔两天采样一次”,两天采样一次是何意。所以整个网络的搭建可能存在一些问题,希望看官老爷可以给些意见,谢谢!
总结
本文的主要目的不是为了挖掘因子,而是介绍AlphaNet自定义特征提取层与池化层的一种构建方法。后续读者可以自行完善网络部分,例如将全连接层替换为LSTM/RNN/GRU等时序数据的神经网络,同时使用股票数据进行因子测试,也欢迎各位交流,我的微信:QR_ZhangYX。