基于LSTM多维度特征的资金流预测

辰阳星宇

已于 2023-01-06 21:46:32 修改

阅读量7.9k

点赞数 4

分类专栏：深度学习 # Tensorflow 文章标签： LSTM 深度学习资金预测 Tensorflow 时间序列

于 2019-05-27 22:40:41 首次发布

本文链接：https://blog.csdn.net/qq_41094332/article/details/89049610

版权

深度学习同时被 2 个专栏收录

18 篇文章

订阅专栏

Tensorflow

8 篇文章

订阅专栏

LSTM多维度特征的资金流预测

项目背景

项目背景

本篇主要是使用天池比赛中余额宝资金流的数据，已知在2013年7月1日至2014年8月31日内余额宝每日申购和赎回的资金流，通过使用python对数据进行处理之后用LSTM进行回归预测，再进行回测来检验模型。

观察数据进行处理

首先是对这427天数据的整体分布进行观察，申购的资金流分布如下：
这里插入图片描述
可以看出数据的整体趋势是先逐渐上升，到200天之后有大幅度的跌涨（可能跟春节有关），然后在经过一段降幅后逐渐趋于稳定。根据此情况，可将数据按时间分为三个阶段，即增长期（2013.7—2013.12），降幅期（2014.1—2014.3）和平稳期（2014.4—2014.8）。

然后，观察用户操作情况，发现每月的用户数都在增长，但有一部分用户却在427天内未做过任何操作，便将这一部分的用户作为“异常值”而删除。将其余用户进行保留，尽管其余用户有的操作金额可能并不多，但这也是一部分信息的体现，就先将其进行保留处理。

再导入其余的三张表：mfd_day_share_interest,user_profile_table和mfd_bank_shibor，发现mdf_day_share_interest和user_profile_table中的数值正常无明显的缺失值和异常值，而mfd_bank_shibor在周六周日含有缺失值，要想保障数据的准确，不论是取平均值还是别的方式发现都不太令人满意，因此也就暂未处理（如果有朋友能有合适的方法多多留言哦~）。

在user_profile_table表中发现了，此次交易用户来自7座城市，因此可以考虑将用户按城市进行划分，表中也含有12星座的信息，可能也跟星座有关（人的性格啊等等等），还含有男性和女性的标记，也可按性别进行划分。

然后，我们将user_balance_table,user_profile_table和mdf_day_share_interest这三张表进行合并，对user_profile_table表中的相关信息进行One-hot编码，便于后续的相关性分析。

特征选择

对已合并的表用Sperman方法进行相关性分析，发现user_balance_table中的其余属性分别与申购和赎回的相关性都大于0.6而user_profile_table和mfd_bank_shibor中的属性与申购和赎回的相关性小于0.6，因此就选择相关性大于0.6的强相关性属性作为特征。

预测purchase和redeem时，均选取report_date、tBalance、yBalance、direct_purchase_amt、purchase_bal_amt、purchase_bank_amt、consume_amt、transafer_amt、tftobal_amt、tftocard_amt、share_amt、total_purchase_amt这十一个属性作为特征来进行预测。

代码实现

这里只放上了purchase的预测代码，redeem与此相同

# 加载数据分析常用库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#% matplotlib inline
import os
import tensorflow as tf
from sklearn.preprocessing import StandardScaler


tf.reset_default_graph()

pd.set_option('display.max_columns',20)
pd.set_option('display.max_rows',31)

trian_path =  './test_data_purchase.csv'
df11 = pd.read_csv(trian_path)
//获取11个属性和目标值
data = df11.iloc[:,1:13]
ss = StandardScaler()
data_sd = ss.fit_transform(data)
plt.figure(figsize=(24,8))
plt.plot(data_sd[:,:11])
plt.plot(data_sd[:,11],label = 'target',color='red')
plt.legend(loc = 'upper left',fontsize = 24)
plt.show()

data = np.array(df11.ix[:,1:13])
# 清除图层，否则会报错
tf.reset_default_graph()

然后，设置参数

rnn_unit = 10  # 隐层数量
input_size = 11
output_size = 1
lr = 0.0006  # 学习率
epochs = 500

构建获取训练集数据的函数，使用均差的方法对数据进行标准化

# 获取训练集
def get_train_data(batch_size=14, time_step=2,train_begin=0, train_end=len(data)):
    batch_index = []
    data_train = data[train_begin:train_end]
    normalized_train_data = (
        data_train-np.mean(data_train, axis=0))/np.std(data_train, axis=0)  # 标准化
    train_x, train_y = [], []  # 训练集
    for i in range(len(normalized_train_data)-time_step):
        if i % batch_size == 0:
            batch_index.append(i)
        x = normalized_train_data[i:i+time_step, :11]
        y = normalized_train_data[i:i+time_step, 11, np.newaxis]
        train_x.append(x.tolist())
        train_y.append(y.tolist())
    batch_index.append((len(normalized_train_data)-time_step))
    return batch_index, train_x, train_y

定义输入输出层权重、偏置和LSTM

# 输入层、输出层权重、偏置
weights = {
    'in': tf.Variable(tf.random_normal([input_size, rnn_unit])),
    'out': tf.Variable(tf.random_normal([rnn_unit, 1]))
    }
biases = {
    'in': tf.Variable(tf.constant(0.1, shape=[rnn_unit, ])),
    'out': tf.Variable(tf.constant(0.1, shape=[1, ]))
    }

def lstm(X):
    batch_size = tf.shape(X)[0]
    time_step = tf.shape(X)[1]
    w_in = weights['in']
    b_in = biases['in']
    input = tf.reshape(X, [-1, input_size])  # 需要将tensor转成2维进行计算，计算后的结果作为隐藏层的输入
    input_rnn = tf.matmul(input, w_in)+b_in
    # 将tensor转成3维，作为lstm cell的输入
    input_rnn = tf.reshape(input_rnn, [-1, time_step, rnn_unit])
    cell = tf.contrib.rnn.BasicLSTMCell(rnn_unit)
    init_state = cell.zero_state(batch_size, dtype=tf.float32)
    output_rnn, final_states = tf.nn.dynamic_rnn(
        cell, input_rnn, initial_state=init_state, dtype=tf.float32)
    output = tf.reshape(output_rnn, [-1, rnn_unit])
    w_out = weights['out']
    b_out = biases['out']
    pred = tf.matmul(output, w_out)+b_out
    return pred, final_states

构建训练Lstm的函数，用最小二乘loss作为目标函数来预测

def train_lstm(batch_size=14, time_step=2,epochs=epochs, train_begin=0, train_end=len(data)):
    X = tf.placeholder(tf.float32, shape=[None, time_step, input_size])
    Y = tf.placeholder(tf.float32, shape=[None, time_step, output_size])
    batch_index, train_x, train_y = get_train_data(batch_size, time_step, train_begin, train_end)
    with tf.variable_scope("sec_lstm"):
        pred, _ = lstm(X)
    loss = tf.reduce_mean(
        tf.square(tf.reshape(pred, [-1])-tf.reshape(Y, [-1])))
    train_op = tf.train.AdamOptimizer(lr).minimize(loss)
    saver = tf.train.Saver(tf.global_variables(), max_to_keep=15)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for i in range(epochs):  # 这个迭代次数，可以更改，越大预测效果会更好，但需要更长时间
            for step in range(len(batch_index)-1):
                _, loss_ = sess.run([train_op, loss], feed_dict={X: train_x[batch_index[
                                    step]:batch_index[step+1]], Y: train_y[batch_index[step]:batch_index[step+1]]})
            if (i+1)%50==0:
                print("Number of epochs:", i+1, " loss:", loss_)
                print("model_save: ", saver.save(sess, 'model_save/modle.ckpt'))
        print("The train has finished")

开始训练

with tf.variable_scope('train'):
    train_lstm()

获取测试集的数据

def test_get_test_data(time_step=14,data=data,test_begin=0):
    data_test = data[test_begin:]
    mean = np.mean(data_test, axis=0)
    std = np.std(data_test, axis=0)
    normalized_test_data = (data_test-mean)/std  # 标准化
    size = (len(normalized_test_data)+time_step-1)//time_step  # 有size个sample
    test_x = []
    for i in range(size-1):
        x = normalized_test_data[i*time_step:(i+1)*time_step, :11]        
        test_x.append(x.tolist())    
    test_x.append((normalized_test_data[(i+1)*time_step:, :11]).tolist())
    return test_x,mean,std

构建预测模型

# 预测模型
def prediction(time_step=14):
    X=tf.placeholder(tf.float32, shape=[None,time_step,input_size])
    test_x,mean,std = test_get_test_data(time_step=14,data=data,test_begin=0)
    
    with tf.variable_scope("sec_lstm",reuse=True):
        pred,_=lstm(X)
    saver=tf.train.Saver(tf.global_variables())
    with tf.Session() as sess:
        #参数恢复
        module_file = tf.train.latest_checkpoint('model_save')
        saver.restore(sess, module_file)
        # 测试训练
        prev_seq=test_x[0]
        test_predict=[]
        for step in range(len(test_x)-1):   
          if step <= len(test_x)-1:
              prev_seq=test_x[step]
          
          prob=sess.run(pred,feed_dict={X:[prev_seq]})   
          

          predict=prob.reshape((-1))
          test_predict.extend(predict)
          
        test_y=np.array(test_y)*std[11]+mean[11]
        test_predict=np.array(test_predict)*std[11]+mean[11]
        acc=np.average(np.abs(test_predict[:-30]-test_y[:len(test_predict)-30]))  #mean absolute error
        print("The MAE of this predict:",acc)
        #以折线图表示结果
        plt.figure(figsize=(15,12))
        plt.plot(list(range(len(test_predict))), test_predict, color='b',label = 'prediction')
        plt.plot(list(range(len(test_y))), test_y,  color='r',label = 'origin')
        plt.legend(fontsize=15)
        plt.show()
        return test_predict

开始预测

with tf.variable_scope('train',reuse=True):
    result = prediction()

得到结果预测图

在这里插入图片描述
赎回的原始数据和预测数据的情况

赎回的原始数据和预测数据的情况

回测的效果看起来不错，不过如果要用这个方法做未来30天的预测的话，也就需要能得到特征值尽量准确或者提高模型的泛化能力。

参考了以下博客的框架：
Tensorflow实例：利用LSTM预测股票每日最高价（一）