AlphaNet:股票因子挖掘神经网络构建

研报主要内容介绍

最近在私募实习挖量价因子,挖不出来新的因子很痛苦,只能看研报和b乎找灵感,发现这篇研报虽然难以在公司的平台上运行,但是idea还是非常有意思的。大致说的是把股票的量价数据当成一张张“数据图片”,借鉴图像识别中的卷积神经网络,通过自定义的运算符提取数据图片的特征,然后输入到全连接层(v1版本)或者LSTM/GRU层(v2版本),其大致流程如下图
图1 Alphanet结构
输入的数据的结构如下,其中Label为t到t+m时间区间的收益率
在这里插入图片描述

个人看来,这篇研报的并没有牵扯到很深的深度学习内容,主要难点在于特征提取层的实现,详细原因后文阐述。华泰曾在曾在《基于遗传规划的选股因子挖掘》这篇研报中有过相似的自定义运算符的应用,那篇研报中给的方法是直接修改gplearn的源码实现。由于笔者能力有限,只能另辟蹊径。

自定义特征提取层—自定义运算符介绍

特征提取层主要有以下几个函数ts_corr,ts_cov,ts_std,ts_mean等,其具体含义如下图
组件具体含义
该层最大的特点是:个股的量价数据不一定要按照图表 6 的方式(open,high,low…)排布,可以按照任意的方式排布。 具体原因以ts_cov(X,Y,d)为例

1、ts_corr(X,Y,d)

设d=3,下图展示了 ts_corr(X, Y, 3)网络层的工作机制。ts_corr(X, Y, 3)会在时间维度上和特征维度上对二维的数据进行遍历运算,与 CNN 类似,步进大小 stride 是可调参数,例如 stride=1 时,下一次计算在时间维度上往右步进一步。在特征维度上的计算则体现了与 CNN 卷积的不同之处,CNN 卷积运算只能进行局部感知,但是 ts_corr(X, Y, 3)会对所有类型的数据进行遍历,其计算区域不一定要相邻,例如在图表8中,会遍历 ( 9 2 ) {9\choose2} (29)=36次。避免了 CNN 中局部感知所带来的数据排布问题,可以充分提取数据中的特征。 ts_corr(X, Y, 3)的运算结果是一张二维的“特征图片”,针对该图片,可以直接展平(flatten)输入到全连接神经网络中,也可以继续在此基础上进行特征提取或池化。如果继续进行特征提取,则可实现运算符的嵌套,如:ts_corr(ts_corr(X, Y, 3), ts_corr(Z, W, 3),3)。与 ts_corr(X, Y, d)相似的还有 ts_cov(X, Y, d),这里不再赘述
在这里插入图片描述

2、ts_stddev(X, d)

设 d=3,图表 9 展示了 ts_stddev(X, 3)网络层的工作机制。该网络层的机制较为简单,类似于CNN中的1*3卷积。其他网络层例如ts_zscore(X, d)、ts_return(X, d)等与ts_stddev(X,d)类似,这里不再赘述。
在这里插入图片描述

特征提取层自定义运算符的实现

由于ts_corr(X,Y,d)与ts_cov(X,Y,d)需要两组不同的数据X,Y,而X,Y的选取共有 ( 9 2 ) {9\choose2} (29)=36种,因此首先想到的是定义一个组合数数对的产生器。可以无脑循环产生,也可以使用递归进行定义,循环的优点在于简单,循环的缺点在于你从9列中选两个数需要套两个循环,选3个数就需要套三个循环,复杂度过高。

1、通过递归产生组合数数对

'''无脑循环法'''
def generateC(n):
    v = []
    for i in range(n):
         for j in range(n):
            if all([i!=j,i<j]):
                v.append([i+1,j+1])
    return v
'''递归法'''
def generateC(l1):
    '''组合数数对产生器,l1应为list,例如产生${5\choose2}$的所有可能结果,l1=[0,1,2,3,4]'''
    if len(l1) == 1:
        return []
    v = [[l1[0],i] for i in l1[1:]]
    l1 = l1[1:]
    return v+generateC(l1)

2、个股“数据图片”的ts_cov(X,Y,d)

有了上述的组合数产生器,我们就可以选取不同的组合计算协方差,同样如果根据常规二维数据,直接无脑循环也可以很容易计算出来,如下:由于在写函数这部分时只当作是个测试,默认的stride=10,没有考虑能否整除的问题。

def covar(X,Y,d,stride):
    size = (len(X) - d)/stride + 1
    v = []
    for i in range(int(size)):
        x = X[i*stride:i*stride+stride]
        y = Y[i*stride:i*stride+stride]
        cor = np.cov(x,y)[0][1]
        v.append(cor)
    return v
def ts_covar(inputs,d,num,stride = 10):
    '''num:组合数'''
    v = []
    for n in num:
        c = covar(inputs[n[0]-1],inputs[n[1]-1],d,stride)
        v.append(c)
    s = np.array(v).reshape(len(num),-1)
    return torch.from_numpy(s)

这个函数可以提取个股的特征,但我们知道本CNN输入数据是4维[batch size,channels,height,width],不可能一只股票一只股票的遍历计算,耗时太久且不现实。那有没有一种方法对4维数据每个batch并行的计算ts_cov?笔者卡在这个问题上多日,有一天在公交车上突然茅塞顿开。为什么卷积神经网络中可以在每个batch对所有channels同时卷积,原因在于卷积神经网络产生了[out_channels,in_channels,kernel_size,kernel_size]的卷积核。例如:

import torch
import torch.nn as nn

input_data = torch.randn(10,3,28,28) #产生输入数据
conv2d = nn.Conv2d(in_channels=3, out_channels=16, kernel_size = (4,4),bias=False) # 定义卷积神经网络
output = conv2d(input_data)
print(conv2d.weight.size())
print(output.size())

输出结果为:

torch.Size([16, 3, 4, 4])
torch.Size([10, 16, 25, 25])

3、用“卷积”的方法并行计算协方差

那么我们可不可以在计算协方差的时候,产生一组卷积核?当然可以,我们从协方差计算方式入手
C o v ( X , Y ) = E ( X − X ‾ ) ( Y − Y ‾ ) = ∑ i = 1 N ( X i − X ‾ ) ( Y i − Y ‾ ) N − 1 {\mathit Cov(X,Y)}=E(X-\overline{X})(Y-\overline{Y})=\sum_{i=1}^N \frac{(X_i-\overline{X})(Y_i-\overline{Y})}{N-1} Cov(X,Y)=E(XX)(YY)=i=1NN1(XiX)(YiY)
观察到上式的最右边有点像卷积的运算方式,举个例子,我们有一组数据:
[ x 1 x 2 x 3 ⋯ x m y 1 y 2 y 3 ⋯ y m ] \left[ \begin{matrix} x_1 & x_2 & x_3 \cdots&x_m\\ y_1 & y_2 & y_3 \cdots&y_m\\ \end{matrix} \right] [x1y1x2y2x3y3xmym]
分别计算 x x x的均值 x ‾ \overline x x, y y y的均值 y ‾ \overline y y,然后 x x x列与 y y y列同时减去对应的均值,变为:
[ x 1 − x ‾ x 2 − x ‾ x 3 − x ‾ ⋯ x m − x ‾ y 1 − y ‾ y 2 − y ‾ y 3 − y ‾ ⋯ y m − y ‾ ] \left[ \begin{matrix} x_1-\overline x& x_2-\overline x & x_3-\overline x \cdots&x_m-\overline x\\ y_1-\overline y & y_2-\overline y & y_3-\overline y \cdots&y_m-\overline y\\ \end{matrix} \right] [x1xy1yx2xy2yx3xy3yxmxymy]
将得到矩阵翻转:
[ y 1 − y ‾ y 2 − y ‾ y 3 − y ‾ ⋯ y m − y ‾ x 1 − x ‾ x 2 − x ‾ x 3 − x ‾ ⋯ x m − x ‾ ] \left[ \begin{matrix} y_1-\overline y & y_2-\overline y & y_3-\overline y \cdots&y_m-\overline y\\ x_1-\overline x& x_2-\overline x & x_3-\overline x \cdots&x_m-\overline x\\ \end{matrix} \right] [y1yx1xy2yx2xy3yx3xymyxmx]
上述两个矩阵对应元素相乘,同时相加除以m-1即位协方差
[ x 1 − x ‾ x 2 − x ‾ x 3 − x ‾ ⋯ x m − x ‾ y 1 − y ‾ y 2 − y ‾ y 3 − y ‾ ⋯ y m − y ‾ ] \left[ \begin{matrix} x_1-\overline x& x_2-\overline x & x_3-\overline x \cdots&x_m-\overline x\\ y_1-\overline y & y_2-\overline y & y_3-\overline y \cdots&y_m-\overline y\\ \end{matrix} \right] [x1xy1yx2xy2yx3xy3yxmxymy] × \times × [ y 1 − y ‾ y 2 − y ‾ y 3 − y ‾ ⋯ y m − y ‾ x 1 − x ‾ x 2 − x ‾ x 3 − x ‾ ⋯ x m − x ‾ ] \left[ \begin{matrix} y_1-\overline y & y_2-\overline y & y_3-\overline y \cdots&y_m-\overline y\\ x_1-\overline x& x_2-\overline x & x_3-\overline x \cdots&x_m-\overline x\\ \end{matrix} \right] [y1yx1xy2yx2xy3yx3xymyxmx]
例如

import numpy as np

data = np.random.randn(3,1,2,10) # 数据产生
mean = data.mean(axis = 3, keepdims=True) # keepdims一定要指定为True+
x1 = data[:,:,[0,1],:]-mean
y1 = data[:,:,[1,0],:]-mean  # 矩阵翻转
coef = (x1*y1).sum(axis = 3, keepdims=True)/(10-1)
print(coef)

输出结果为

[[[[-0.40902133]
   [-0.40902133]]]
 [[[-0.27561585]
   [-0.27561585]]]
 [[[ 0.38524211]
   [ 0.38524211]]]]

使用np.cov函数验算:

test_data = data[0].reshape(2,10)
print(np.cov(test_data))
输出结果为:
[[ 1.27875212 -0.40902133]
 [-0.40902133  0.64090397]]

结果一致。
这个时候我们第一步产生的组合数对就有作用了,首先我们先计算 ( 9 2 ) {9\choose2} (29)的可能结果,然后计算一个翻转的数对列表(例如:[0,1]变成[1,0])

data=np.random.uniform(10,100,(2000,1,9,30))
def generateC(l1):
    if len(l1) == 1:
        return []
    v = [[l1[0],i] for i in l1[1:]]
    l1 = l1[1:]
    return v+generateC(l1)
feat_nums = data.shape[2]
list1 = list(range(feat_nums))  # 4为特征数
num = generateC(list1)
num_rev = [] # 数组反转,PS:不要用[l.reverse() for l in num],因为reverse函数会直接修改原List,不会产生新的结果。
for l in num:
    l1 = l.copy()
    l1.reverse()
    num_rev.append(l1)
 print(num,'\n',num_rev)
 输出结果为:
 [[0, 1], [0, 2], [0, 3], [0, 4], [0, 5], [0, 6], [0, 7], [0, 8], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6], [1, 7], [1, 8], [2, 3], [2, 4], [2, 5], [2, 6], [2, 7], [2, 8], [3, 4], [3, 5], [3, 6], [3, 7], [3, 8], [4, 5], [4, 6], [4, 7], [4, 8], [5, 6], [5, 7], [5, 8], [6, 7], [6, 8], [7, 8]] 
 [[1, 0], [2, 0], [3, 0], [4, 0], [5, 0], [6, 0], [7, 0], [8, 0], [2, 1], [3, 1], [4, 1], [5, 1], [6, 1], [7, 1], [8, 1], [3, 2], [4, 2], [5, 2], [6, 2], [7, 2], [8, 2], [4, 3], [5, 3], [6, 3], [7, 3], [8, 3], [5, 4], [6, 4], [7, 4], [8, 4], [6, 5], [7, 5], [8, 5], [7, 6], [8, 6], [8, 7]]

注意两个列表一定要一一对应!,即[a,b]一定要对应[b,a]
有了上述准备,我们就可以对输入数据进行操作,在这里我要特别强调下numpy在读取数据时候的特点,假设有个4维[N,C,H,W]的array,那么我们在读取数据时array[:,:,0,:],其维数是[N,C,W]表示取第三维第一组数据,array[:,:,[0,1],:]在读取数据时是同事读取第三维第一行与第二行,其切片数据的大小为[N,C,2,W],如果我们指定list=[[0,1],[0,2]]读取时,其切片读取数据的大小为:[N,C,2,2,W],第三维的2是你切片的List决定的。代码如下:

import numpy as np
data = np.random.randn(300,4,9,30) #产生数据
num = generate(list(range(9))) #产生组合数对列表
data_cut = data[:,:,num,:]
print(data_cut.shape)
输出结果为:
(300, 4, 36, 2, 30)

有了上述的说明,读者大概明白了产生一个反转的组合数对列表的原因了:
numpy获取切片数据的时候是与你读取数据时的顺序有关,例如array[:,:,[0,1],:]获取的结果就是第一行在上,第二行在下,array[:,:,[1,0],:]获取的结果就是第二行在上,第一行在下,代码如下:

import numpy as np

arr = np.arange(64).reshape(2,1,4,8)
print(arr[:,:,[0,1],:])
print(arr[:,:,[1,0],:])
输出为:
array([[[[ 0,  1,  2,  3,  4,  5,  6,  7],
         [ 8,  9, 10, 11, 12, 13, 14, 15]]],
       [[[32, 33, 34, 35, 36, 37, 38, 39],
         [40, 41, 42, 43, 44, 45, 46, 47]]]])
         
array([[[[ 8,  9, 10, 11, 12, 13, 14, 15],
         [ 0,  1,  2,  3,  4,  5,  6,  7]]],
       [[[40, 41, 42, 43, 44, 45, 46, 47],
         [32, 33, 34, 35, 36, 37, 38, 39]]]])

有了上述的基础就可以编写ts_cov(X,Y,d)函数,函数如下

def ts_cov4d(data,num,num_rev,stride):
    '''计算4维数据的协方差'''
    '''data:[N,C,H,W],,W:price length,N:batch size
    num:组合数对列表,num_rev:num的翻转列表'''
    # 构建步长列表,如果数据长度不能整除,则取剩下长度,如果剩下长度小于5,则与上一步结合一起
    if len(data.shape)!=4:
        raise Exception('Input data dimensions should be [N,C,H,W]')
    data_length = data.shape[3]
    feat_num = data.shape[2]
    conv_feat = len(num)
    if data_length % stride == 0:
        step_list = list(range(0,data_length+stride,stride))
    elif data_length % stride<=5:
        mod = data_length % stride
        step_list = list(range(0,data_length-stride,stride))+[data_length]
    else:
        mod = data_length % stride
        step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
    l = []
    #计算的过程中务必保持keepdims=True
    for i in range(len(step_list)-1):
        start = step_list[i]
        end = step_list[i+1]
        sub_data1 = data[:,:,num,start:end]
        sub_data2 = data[:,:,num_rev,start:end]
        mean1 = sub_data1.mean(axis = 4,keepdims = True)
        mean2 = sub_data2.mean(axis = 4,keepdims = True)
        spread1 = sub_data1 - mean1
        spread2 = sub_data2 - mean2
        cov = ((spread1*spread2).sum(axis = 4,keepdims = True)/(sub_data1.shape[4] - 1)).mean(axis = 3,keepdims = True)
        l.append(cov)
    conv_feat = len(num)
    corr = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
    return torch.from_numpy(corr)

测试代码运行速度

import time
data = np.random.uniform(10,100,(2000,1,9,30))#随机产生2000只股票的数据图片
start = time.time()
ts_cov = ts.cov4d(data,num,num_rev,10)
end = time.time()
print(end-start)
print(ts_cov.shape)
输出结果:
0.16455984115600586
torch.Size([2000, 1, 36, 3])

以上述的分析流程,我们可以定义出其他运算符,计算方式与ts_cov4d类似,特征提取层的代码合计如下:

def ts_cov4d(data,num,num_rev,stride):
    '''计算4维数据的协方差'''
    '''data:[N,C,H,W],,W:price length,N:batch size
    num:组合数对列表,num_rev:num的翻转列表'''
    # 构建步长列表,如果数据长度不能整除,则取剩下长度,如果剩下长度小于5,则与上一步结合一起
    if len(data.shape)!=4:
        raise Exception('Input data dimensions should be [N,C,H,W]')
    data_length = data.shape[3]
    feat_num = data.shape[2]
    conv_feat = len(num)
    if data_length % stride == 0:
        step_list = list(range(0,data_length+stride,stride))
    elif data_length % stride<=5:
        mod = data_length % stride
        step_list = list(range(0,data_length-stride,stride))+[data_length]
    else:
        mod = data_length % stride
        step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
    l = []
    for i in range(len(step_list)-1):
        start = step_list[i]
        end = step_list[i+1]
        sub_data1 = data[:,:,num,start:end]
        sub_data2 = data[:,:,num_rev,start:end]
        mean1 = sub_data1.mean(axis = 4,keepdims = True)
        mean2 = sub_data2.mean(axis = 4,keepdims = True)
        spread1 = sub_data1 - mean1
        spread2 = sub_data2 - mean2
        cov = ((spread1*spread2).sum(axis = 4,keepdims = True)/(sub_data1.shape[4] - 1)).mean(axis = 3,keepdims = True)
        l.append(cov)
    conv_feat = len(num)
    corr = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
    return torch.from_numpy(corr)
    
def ts_corr4d(data,num,num_rev,stride):
    if len(data.shape)!=4:
        raise Exception('Input data dimensions should be [N,C,H,W]')
    data_length = data.shape[3]
    feat_num = data.shape[2]
    conv_feat = len(num)
    if data_length % stride == 0:
        step_list = list(range(0,data_length+stride,stride))
    elif data_length % stride<=5:
        mod = data_length % stride
        step_list = list(range(0,data_length-stride,stride))+[data_length]
    else:
        mod = data_length % stride
        step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
    l = []
    for i in range(len(step_list)-1):
        start = step_list[i]
        end = step_list[i+1]
        sub_data1 = data[:,:,num,start:end]
        sub_data2 = data[:,:,num_rev,start:end]
        std1 = sub_data1.std(axis = 4,keepdims = True)
        std2 = sub_data2.std(axis = 4,keepdims = True)
        std = (std1*std2).mean(axis = 3,keepdims = True)
        l.append(std)
    conv_feat = len(num)
    std = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
    cov = ts_cov4d(data,num,num_rev,stride)
    fct = (sub_data1.shape[4]-1)/sub_data1.shape[4]
    return (cov/torch.from_numpy(std))*fct
    
def ts_stddev4d(data,stride):
    if len(data.shape)!=4:
        raise Exception('Input data dimensions should be [N,C,H,W]')
    data_length = data.shape[3]
    feat_num = data.shape[2]
    conv_feat = len(num)
    if data_length % stride == 0:
        step_list = list(range(0,data_length+stride,stride))
    elif data_length % stride<=5:
        mod = data_length % stride
        step_list = list(range(0,data_length-stride,stride))+[data_length]
    else:
        mod = data_length % stride
        step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
    l = []
    for i in range(len(step_list)-1):
        start = step_list[i]
        end = step_list[i+1]
        sub_data1 = data[:,:,:,start:end]
        std1 = sub_data1.std(axis = 3,keepdims = True)
        l.append(std1)
    std = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
    return torch.from_numpy(std)
    
def ts_zscore(data,stride):
    if len(data.shape)!=4:
        raise Exception('Input data dimensions should be [N,C,H,W]')
    data_length = data.shape[3]
    feat_num = data.shape[2]
    conv_feat = len(num)
    if data_length % stride == 0:
        step_list = list(range(0,data_length+stride,stride))
    elif data_length % stride<=5:
        mod = data_length % stride
        step_list = list(range(0,data_length-stride,stride))+[data_length]
    else:
        mod = data_length % stride
        step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
    l = []
    for i in range(len(step_list)-1):
        start = step_list[i]
        end = step_list[i+1]
        sub_data1 = data[:,:,:,start:end]
        mean = sub_data1.mean(axis = 3,keepdims = True)
        std = sub_data1.std(axis = 3,keepdims = True)
        z_score = mean/std
        l.append(z_score)
    z_score = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
#     z_data = np.squeeze(np.array(l)).transpose(1,2,0,3).reshape(-1,1,feat_num,data_length)
    return torch.from_numpy(z_score)
    
def ts_return(data,stride):
    if len(data.shape)!=4:
        raise Exception('Input data dimensions should be [N,C,H,W]')
    data_length = data.shape[3]
    feat_num = data.shape[2]
    if data_length % stride == 0:
        step_list = list(range(0,data_length+stride,stride))
    elif data_length % stride<=5:
        mod = data_length % stride
        step_list = list(range(0,data_length-stride,stride))+[data_length]
    else:
        mod = data_length % stride
        step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
    global l
    l = []
    for i in range(len(step_list)-1):
        start = step_list[i]
        end = step_list[i+1]
        sub_data1 = data[:,:,:,start:end]
        ret = sub_data1[:,:,:,-1]/sub_data1[:,:,:,0] - 1
        l.append(ret)
    z_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
    return torch.from_numpy(z_data)
    
def ts_decaylinear(data,stride):
    if len(data.shape)!=4:
        raise Exception('Input data dimensions should be [N,C,H,W]')
    data_length = data.shape[3]
    feat_num = data.shape[2]
    if data_length % stride == 0:
        step_list = list(range(0,data_length+stride,stride))
    elif data_length % stride<=5:
        mod = data_length % stride
        step_list = list(range(0,data_length-stride,stride))+[data_length]
    else:
        mod = data_length % stride
        step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
    global l
    l = []
    for i in range(len(step_list)-1):
        start = step_list[i]
        end = step_list[i+1]
        time_spread = end - start
        weight = np.arange(1,time_spread+1)
        weight = weight/(weight.sum())
        sub_data1 = (data[:,:,:,start:end]*weight).sum(axis = 3,keepdims = True)
        l.append(sub_data1)
    decay_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
    return torch.from_numpy(decay_data)
    
def ts_pool(data,stride,method):
    if type(data) == torch.Tensor:
        data = data.detach().numpy()
    if data.shape[-1] <= stride:
        step_list = [0,data.shape[-1]]
    if len(data.shape)!=4:
        raise Exception('Input data dimensions should be [N,C,H,W]')
    data_length = data.shape[3]
    feat_num = data.shape[2]
    conv_feat = len(num)
    if data_length % stride == 0:
        step_list = list(range(0,data_length+stride,stride))
    elif data_length % stride<=5:
        mod = data_length % stride
        step_list = list(range(0,data_length-stride,stride))+[data_length]
    else:
        mod = data_length % stride
        step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
    global l
    l = []
    for i in range(len(step_list)-1):
        start = step_list[i]
        end = step_list[i+1]
        if method == 'max':
            sub_data1 = data[:,:,:,start:end].max(axis = 3,keepdims = True)
        if method == 'min':
            sub_data1 = data[:,:,:,start:end].min(axis = 3,keepdims = True)
        if method == 'mean':
            sub_data1 = data[:,:,:,start:end].mean(axis = 3,keepdims = True)
        l.append(sub_data1)
    try:
        pool_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list) - 1)
    except:
        pool_data = np.squeeze(np.array(l)).reshape(-1,1,feat_num,len(step_list) - 1)
    return torch.from_numpy(pool_data)

构建AlphaNet

AlphaNet的简单流程如下

Created with Raphaël 2.3.0 原始数据[N,C,H,W] 特征提取层 BatchNorm层 池化层(ts_pool) 全连接层(30个神经元) 输出(一个神经元)

我们以torch.nn内置函数nn.Linear(),nn.BarchNorm2d()等先简单实现一次输出,随后在定义为AlphaNet

特征提取

# 特征提取
data = np.random.uniform(10,100,(2000,1,9,30)) #模拟产生2000只股票的数据图片
batch = nn.BatchNorm2d(1) #批标准层
conv1 = ts_cov4d(data,num,num_rev,10).to(torch.float)
bc1 = batch(conv1)
conv2 = ts_corr4d(data,num,num_rev,10).to(torch.float)
bc2 = batch(conv2)
conv3 = ts_stddev4d(data,10).to(torch.float)
bc3 = batch(conv3)
conv4 = ts_decaylinear(data,10).to(torch.float)
bc4 = batch(conv4)
conv5 = ts_zscore(data,10).to(torch.float)
bc5 = batch(conv5)
conv6 = ts_return(data,10).to(torch.float)
bc6 = batch(conv6)
# 特征聚合
feat_cat = torch.cat([bc1,bc2,bc3,bc4,bc5,bc6],axis = 2)
print(bc1.size())
print(bc2.size())
print(bc3.size())
print(bc4.size())
print(bc5.size())
print(bc6.size())
print(feat_cat.size())
输出结果:
torch.Size([2000, 1, 36, 3])
torch.Size([2000, 1, 36, 3])
torch.Size([2000, 1, 9, 3])
torch.Size([2000, 1, 9, 3])
torch.Size([2000, 1, 9, 3])
torch.Size([2000, 1, 9, 3])
torch.Size([2000, 1, 108, 3])

池化层

ts_max = ts_pool(feat_cat ,3,method = 'max')
ts_max = batch(ts_max)
ts_min = ts_pool(feat_cat ,3,method = 'min')
ts_min = batch(ts_min)
ts_mean = ts_pool(feat_cat ,3,method = 'mean')
ts_mean = batch(ts_mean)
#聚合
data_pool = torch.cat([ts_max,ts_min,ts_mean],axis = 2)
# 特征展平
data_pool = data_pool .flatten(start_dim = 1)
print(data_pool.size())
输出结果:
torch.Size([2000, 324])

全连接层

pipline = nn.Sequential(nn.Linear(324,30),
                        nn.ReLU(),
                      nn.Dropout(0.5),
                       nn.Linear(30,1))
output = pipline(data_fin)
print(output,'\n',output.size())
输出结果:
tensor([[0.6031],
        [0.0815],
        [0.2301],
        ...,
        [0.1955],
        [0.3751],
        [0.3509]], grad_fn=<AddmmBackward>) 
 torch.Size([2000, 1])

熟悉了上述流程后,便可以封装为完整的神经网络了,完整代码如下

class AlphaNet(nn.Module):
    def __init__(self,input_channel,fc1_neuron,fc2_neuron,fcast_neuron):
        super(AlphaNet,self).__init__()
        self.fc1_neuron = fc1_neuron
        self.fc2_neuron = fc2_neuron
        self.fcast_neuron = fcast_neuron
        self.batchnorm = nn.BatchNorm2d(input_channel)
        self.dropout = nn.Dropout(0.5)
        self.fc1 = nn.Linear(self.fc1_neuron,self.fc2_neuron)
        self.out = nn.Linear(self.fc2_neuron,self.fcast_neuron)
        self.relu = nn.ReLU()
    def forward(self,data,num,num_rev):
        conv1 = self.ts_cov4d(data,num,num_rev,10).to(torch.float)
        bc1 = self.batchnorm(conv1)
        conv2 = self.ts_corr4d(data,num,num_rev,10).to(torch.float)
        bc2 = self.batchnorm(conv2)
        conv3 = self.ts_stddev4d(data,10).to(torch.float)
        bc3 = self.batchnorm(conv3)
        conv4 = self.ts_decaylinear(data,10).to(torch.float)
        bc4 = self.batchnorm(conv4)
        conv5 = self.ts_zscore(data,10).to(torch.float)
        bc5 = self.batchnorm(conv5)
        conv6 = self.ts_return(data,10).to(torch.float)
        bc6 = self.batchnorm(conv6)
        data_conv = torch.cat([bc1,bc2,bc3,bc4,bc5,bc6],axis = 2)
        ts_max = self.ts_pool(data_conv,3,method = 'max')
        ts_max = self.batchnorm(ts_max)
        ts_min = self.ts_pool(data_conv,3,method = 'min')
        ts_min = self.batchnorm(ts_min)
        ts_mean = self.ts_pool(data_conv,3,method = 'mean')
        ts_mean = self.batchnorm(ts_mean)
        data_fin = torch.cat([ts_max,ts_min,ts_mean],axis = 2)
        data_fin = data_fin.flatten(start_dim = 1)
        input_size = data_fin.size(1)
        ful_connect = self.dropout(self.relu(self.fc1(data_fin)))
        output = self.out(ful_connect)
        return output.to(torch.float)
    def ts_cov4d(self,data,num,num_rev,stride):
        '''Caculate the covariance of four-dimension data'''
        '''data:[N,C,H,W],H:feature of stock data picture,W:price length,C:stock numbers
        num:combination pair,reverse of num'''
        # 构建步长列表,如果数据长度不能整除,则取剩下长度,如果剩下长度小于5,则与上一步结合一起
        if len(data.shape)!=4:
            raise Exception('Input data dimensions should be [N,C,H,W]')
        data_length = data.shape[3]
        feat_num = data.shape[2]
        conv_feat = len(num)
        if data_length % stride == 0:
            step_list = list(range(0,data_length+stride,stride))
        elif data_length % stride<=5:
            mod = data_length % stride
            step_list = list(range(0,data_length-stride,stride))+[data_length]
        else:
            mod = data_length % stride
            step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
        l = []
        for i in range(len(step_list)-1):
            start = step_list[i]
            end = step_list[i+1]
            sub_data1 = data[:,:,num,start:end]
            sub_data2 = data[:,:,num_rev,start:end]
            mean1 = sub_data1.mean(axis = 4,keepdims = True)
            mean2 = sub_data2.mean(axis = 4,keepdims = True)
            spread1 = sub_data1 - mean1
            spread2 = sub_data2 - mean2
            cov = ((spread1*spread2).sum(axis = 4,keepdims = True)/(sub_data1.shape[4] - 1)).mean(axis = 3,keepdims = True)
            l.append(cov)
        conv_feat = len(num)
        corr = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
        return torch.from_numpy(corr)
    def ts_corr4d(self,data,num,num_rev,stride):
        if len(data.shape)!=4:
            raise Exception('Input data dimensions should be [N,C,H,W]')
        data_length = data.shape[3]
        feat_num = data.shape[2]
        conv_feat = len(num)
        if data_length % stride == 0:
            step_list = list(range(0,data_length+stride,stride))
        elif data_length % stride<=5:
            mod = data_length % stride
            step_list = list(range(0,data_length-stride,stride))+[data_length]
        else:
            mod = data_length % stride
            step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
        l = []
        for i in range(len(step_list)-1):
            start = step_list[i]
            end = step_list[i+1]
            sub_data1 = data[:,:,num,start:end]
            sub_data2 = data[:,:,num_rev,start:end]
            std1 = sub_data1.std(axis = 4,keepdims = True)
            std2 = sub_data2.std(axis = 4,keepdims = True)
            std = (std1*std2).mean(axis = 3,keepdims = True)
            l.append(std)
        conv_feat = len(num)
        std = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,conv_feat,len(step_list)-1)
        cov = self.ts_cov4d(data,num,num_rev,stride)
        fct = (sub_data1.shape[4]-1)/sub_data1.shape[4]
        return (cov/torch.from_numpy(std))*fct
    def ts_stddev4d(self,data,stride):
        if len(data.shape)!=4:
            raise Exception('Input data dimensions should be [N,C,H,W]')
        data_length = data.shape[3]
        feat_num = data.shape[2]
        conv_feat = len(num)
        if data_length % stride == 0:
            step_list = list(range(0,data_length+stride,stride))
        elif data_length % stride<=5:
            mod = data_length % stride
            step_list = list(range(0,data_length-stride,stride))+[data_length]
        else:
            mod = data_length % stride
            step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
        l = []
        for i in range(len(step_list)-1):
            start = step_list[i]
            end = step_list[i+1]
            sub_data1 = data[:,:,:,start:end]
            std1 = sub_data1.std(axis = 3,keepdims = True)
            l.append(std1)
        std = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
        return torch.from_numpy(std)
    def ts_zscore(self,data,stride):
        if len(data.shape)!=4:
            raise Exception('Input data dimensions should be [N,C,H,W]')
        data_length = data.shape[3]
        feat_num = data.shape[2]
        conv_feat = len(num)
        if data_length % stride == 0:
            step_list = list(range(0,data_length+stride,stride))
        elif data_length % stride<=5:
            mod = data_length % stride
            step_list = list(range(0,data_length-stride,stride))+[data_length]
        else:
            mod = data_length % stride
            step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
        l = []
        for i in range(len(step_list)-1):
            start = step_list[i]
            end = step_list[i+1]
            sub_data1 = data[:,:,:,start:end]
            mean = sub_data1.mean(axis = 3,keepdims = True)
            std = sub_data1.std(axis = 3,keepdims = True)
            z_score = mean/std
            l.append(z_score)
        z_score = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
    #     z_data = np.squeeze(np.array(l)).transpose(1,2,0,3).reshape(-1,1,feat_num,data_length)
        return torch.from_numpy(z_score)
    def ts_return(self,data,stride):
        if len(data.shape)!=4:
            raise Exception('Input data dimensions should be [N,C,H,W]')
        data_length = data.shape[3]
        feat_num = data.shape[2]
        if data_length % stride == 0:
            step_list = list(range(0,data_length+stride,stride))
        elif data_length % stride<=5:
            mod = data_length % stride
            step_list = list(range(0,data_length-stride,stride))+[data_length]
        else:
            mod = data_length % stride
            step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
        global l
        l = []
        for i in range(len(step_list)-1):
            start = step_list[i]
            end = step_list[i+1]
            sub_data1 = data[:,:,:,start:end]
            ret = sub_data1[:,:,:,-1]/sub_data1[:,:,:,0] - 1
            l.append(ret)
        z_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
        return torch.from_numpy(z_data)
    def ts_decaylinear(self,data,stride):
        if len(data.shape)!=4:
            raise Exception('Input data dimensions should be [N,C,H,W]')
        data_length = data.shape[3]
        feat_num = data.shape[2]
        if data_length % stride == 0:
            step_list = list(range(0,data_length+stride,stride))
        elif data_length % stride<=5:
            mod = data_length % stride
            step_list = list(range(0,data_length-stride,stride))+[data_length]
        else:
            mod = data_length % stride
            step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
        global l
        l = []
        for i in range(len(step_list)-1):
            start = step_list[i]
            end = step_list[i+1]
            time_spread = end - start
            weight = np.arange(1,time_spread+1)
            weight = weight/(weight.sum())
            sub_data1 = (data[:,:,:,start:end]*weight).sum(axis = 3,keepdims = True)
            l.append(sub_data1)
        decay_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list)-1)
        return torch.from_numpy(decay_data)
    def ts_pool(self,data,stride,method):
        if type(data) == torch.Tensor:
            data = data.detach().numpy()
        if data.shape[-1] <= stride:
            step_list = [0,data.shape[-1]]
        if len(data.shape)!=4:
            raise Exception('Input data dimensions should be [N,C,H,W]')
        data_length = data.shape[3]
        feat_num = data.shape[2]
        conv_feat = len(num)
        if data_length % stride == 0:
            step_list = list(range(0,data_length+stride,stride))
        elif data_length % stride<=5:
            mod = data_length % stride
            step_list = list(range(0,data_length-stride,stride))+[data_length]
        else:
            mod = data_length % stride
            step_list = list(range(0,data_length+stride-mod,stride))+[data_length]
        global l
        l = []
        for i in range(len(step_list)-1):
            start = step_list[i]
            end = step_list[i+1]
            if method == 'max':
                sub_data1 = data[:,:,:,start:end].max(axis = 3,keepdims = True)
            if method == 'min':
                sub_data1 = data[:,:,:,start:end].min(axis = 3,keepdims = True)
            if method == 'mean':
                sub_data1 = data[:,:,:,start:end].mean(axis = 3,keepdims = True)
            l.append(sub_data1)
        try:
            pool_data = np.squeeze(np.array(l)).transpose(1,2,0).reshape(-1,1,feat_num,len(step_list) - 1)
        except:
            pool_data = np.squeeze(np.array(l)).reshape(-1,1,feat_num,len(step_list) - 1)
        return torch.from_numpy(pool_data)

训练神经网络

Batch Size=2000
优化器RMSProb, l e a r n i n g   r a t e = 0.0001 learning\ rate=0.0001 learning rate=0.0001
损失函数:MSELoss

定义数据集

在pytorch里面使用自定义数据集,需要继承DataSet基类(PS:由于没有很高效的股票数据获取方式,所以暂时先以随机数替代,读者可以自行尝试)

from torch.utils.data import Dataset, DataLoader
data = np.random.uniform(10,100,(3124,1,9,30))
label = np.random.randn(3124,1)
class Testdataset(Dataset):
    def __init__(self, data,label):
        self.data = torch.from_numpy(data)
        self.label = torch.from_numpy(label)
    def __getitem__(self, index):
        return self.data[index], self.label[index]
    def __len__(self):
        return len(self.data)
trainset = Testdataset(data,label)
trainloader = DataLoader(trainset,batch_size = 2000,shuffle = False)
for i,data in enumerate(trainloader):
    input_data,label = data
    print(i,input_data.size())
输出结果:
0 torch.Size([2000, 1, 9, 30])
1 torch.Size([1124, 1, 9, 30])

简单的训练流程


import torch.optim as optim
alphanet = AlphaNet(1,324,30,1)
criterion = nn.MSELoss()
LR = 0.0001
optimizer = optim.RMSprop(alphanet.parameters(), lr=LR, alpha=0.9)
epoch_num = 20
loss_list = []
for epoch in range(epoch_num ):
    train_loss = 0.0
    for data,label in trainloader:
        out_put = alphanet(data.detach().numpy(),num,num_rev)
        loss = criterion(out_put,label.to(torch.float))
        optimizer.zero_grad()
        train_loss += loss
        loss.backward()
        optimizer.step()
    loss_list.append(train_loss.item())
    print("current epoch time:",epoch+1)

再次提示一边,由于没有高效的获取股票数据的方式,所以本文并没有做实证分析,只是搭建了一个可行的因子挖掘的网络,读者可以自行修改代码使其更完整。还有一点就是笔者接触深度学习不过一两月有余,所以对于pytorch框架不是很熟悉,以及对华泰官方数据读取有些不明白,尤其是原文中:“每次训练都使用过去1500个交易日的数据作为样本内数据,每隔两天采样一次”,两天采样一次是何意。所以整个网络的搭建可能存在一些问题,希望看官老爷可以给些意见,谢谢!

总结

本文的主要目的不是为了挖掘因子,而是介绍AlphaNet自定义特征提取层与池化层的一种构建方法。后续读者可以自行完善网络部分,例如将全连接层替换为LSTM/RNN/GRU等时序数据的神经网络,同时使用股票数据进行因子测试,也欢迎各位交流,我的微信:QR_ZhangYX。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值