基于LTSM的多变量(Features)多输入\多时间跨度(Timesteps)的股票预测模型构建（Keras, Tensorflow, python）

本文链接：https://blog.csdn.net/sqh4587/article/details/81262768

网上有很多LSTM模型的示例代码，但让人头疼的是，大部分都是用v(t-n)的feature来预测v(t)，那么对于有预测v(t+1), v(t+2)...v(t+n)需求的时候，估计就会有人采用笨办法，既先预测v(t)，之后再用v(t)预测v(t+1)，愚公移山，慢慢挖。。。(*@ο@*) 哇～既然写，就写点和别人不一样的，那么我们今天的模型就是用v(t-n)来预测v(t+m)时刻感兴趣的特征值。

模型背景：每只股票在股市开盘后，每天都会产生几个比较重要的特征，例如股票代码index_code，日期date，开盘价open，收盘价close，一天内的最高价、最低价high\low，交易量volume，交易金额money，换手率change。今天这个模型，将会使用以上这些feature作为预测未来某只股票开盘价的依据，通过某股票前三天(in_timestep=3)的feature预测第四天(out_timestep=1)的开盘价。模型构建使用Keras框架开发，Tensorflow作为后台，我们测试的股票为sh000001，训练数据集4091组，测试集数据为2014组，总共记录了大概十来年的数据。如果您手头没有数据，也没关系，看完全文后，看第十步，写一个小爬虫程序扒点数据下来（这个爬虫代码趴下来的表头和我示例中不同，记得在程序中修改再使用）

闲话不多说，开整。

第一步：导入需要使用的库

import numpy as np
import time
import argparse
import json
from math import sqrt, ceil
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

第二步：加载文件数据，我使用的是.csv的原数据，这只股票相关的feature截图如下图所示（如果没有这数据，可以通过修改自己手头数据的列名称搞定）。加载.csv的函数中，需要设定一些变量，如file_path:文件路径, header_row_index:列表头所在行位置(一般都是第一行0), index_col_name:这个没有直接设置为None, col_to_predict：需要预测的列名, cols_to_drop：要丢掉列名，比如这里的index_code和date就直接丢掉，如果不想设置太多参数，可以使用被注释掉的带默认参数值的函数定义。数据加载的过程中，会按照设定删除不需要的列，此外，还会将要预测的列调整到第0列的位置。最终返回col_names：使用到的feature列名组，values：各feature的值（float32格式）,values.shape[1]：模型建模中用到的feature数量, output_col_name：预测feature名（本例是open）

# load data set
#def load_dataset(file_path='dataset.csv', header_row_index=0, index_col_name =None, col_to_predict, cols_to_drop=None):
def _load_dataset(file_path, header_row_index, index_col_name, col_to_predict, cols_to_drop):
    
    """
    file_path: the csv file path
    header_row_index: the header row index in the csv file
    index_col_name: the index column (can be None if no index is there)
    col_to_predict: the column name/index to predict
    cols_to_drop: the column names/indices to drop (single label or list-like)
    """
    # read dataset from disk
    dataset = read_csv(file_path, header=header_row_index, index_col=False)
    #print(dataset)

    # set index col，设置索引列，参数输入列的名字列表
    if index_col_name:
        dataset.set_index(index_col_name, inplace=True)
    
    # drop nonused colums，删除不需要的列，参数输入列的名字列表
    '''if cols_to_drop:
        if type(cols_to_drop[0]) == int:
            dataset.drop(index=cols_to_drop, axis=0, inplace=True)
        else:
            dataset.drop(columns=cols_to_drop, axis=1, inplace=True)'''
    if cols_to_drop:
        dataset.drop(cols_to_drop, axis =1, inplace = True)
    
    #print('\nprint data set again\n',dataset)
    # get rows and column names
    col_names = dataset.columns.values.tolist()
    values = dataset.values
    #print(col_names, '\n values\n', values)
    
    # move the column to predict to be the first col: 把预测列调至第一列
    col_to_predict_index = col_to_predict if type(col_to_predict) == int else col_names.index(col_to_predict)
    output_col_name = col_names[col_to_predict_index]
    if col_to_predict_index > 0:
        col_names = [col_names[col_to_predict_index]] + col_names[:col_to_predict_index] + col_names[col_to_predict_index+1:]
    values = np.concatenate((values[:, col_to_predict_index].reshape((values.shape[0], 1)), values[:,:col_to_predict_index], values[:,col_to_predict_index+1:]), axis=1)
    #print(col_names, '\n values2\n', values)
    # ensure all data is float
    values = values.astype("float32")
    #print(col_names, '\n values3\n', values)
    return col_names, values,values.shape[1], output_col_name

第三步：数据归一化，本例scale_range是（0，1）

# scale dataset
#def _scale_dataset(values, scale_range = (0,1)):
def _scale_dataset(values, scale_range):
    """
    values: dataset values
    scale_range: scale range to fit data in
    """
    # normalize features
    scaler = MinMaxScaler(feature_range=scale_range or (0, 1))
    scaled = scaler.fit_transform(values)

    return (scaler, scaled)

第四步：将数据格式转化为监督学习的格式。在本例中，我们将n_in_timestep和n_out_timestep设置为3和1，意味着我们将使用T-3, T-2, T-1时刻的所有feature数据来预测T时刻的open值。在生成新的格式的时候，会有一部分数据格被填充Nan，通过设置dropnan参数为True把这部分行直接删除。

# convert series to supervised learning (ex: var1(t)_row1 = var1(t-1)_row2)，列表打印出来一看就明白了
#def _series_to_supervised(values, n_in=3, n_out=1, dropnan=True, col_names, verbose=True):
def _series_to_supervised(values, n_in, n_out, dropnan, col_names, verbose):
    """
    values: dataset scaled values
    n_in: number of time lags (intervals) to use in each neuron, 与多少个之前的time_step相关,和后面的n_intervals是一样
    n_out: number of time-steps in future to predict，预测未来多少个time_step
    dropnan: whether to drop rows with NaN values after conversion to supervised learning
    col_names: name of columns for dataset
    verbose: whether to output some debug data
    """

    n_vars = 1 if type(values) is list else values.shape[1]
    if col_names is None: col_names = ["var%d" % (j+1) for j in range(n_vars)]
    df = DataFrame(values)
    cols, names = list(), list()

    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [("%s(t-%d)" % (col_names[j], i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))         #这里循环结束后cols是个列表，每个列表都是一个shift过的矩阵
        if i == 0:
            names += [("%s(t)" % (col_names[j])) for j in range(n_vars)]
        else:
            names += [("%s(t+%d)" % (col_names[j], i)) for j in range(n_vars)]

    # put it all together
    agg = concat(cols, axis=1)    #将cols中的每一行元素一字排开，连接起来，vala t-n_in, valb t-n_in ... valta t, valb t... vala t+n_out-1, valb t+n_out-1
    agg.columns = names

    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)

    if verbose:
        print("\nsupervised data shape:", agg.shape)
    return agg

成功执行后，文本中的数据将会变成如下格式：

第五步：拆分训练集，样本集，及训练量和对应值，大部分样本集的拆分都是这种流程，不赘述，详见代码。训练数据集4091组，测试集数据为2014组。

# split into train and test sets
#def _split_data_to_train_test_sets(values, n_intervals=3, n_features, train_percentage=0.67, verbose=True):
def _split_data_to_train_test_sets(values, n_intervals, n_features, train_percentage, verbose):
    """
    values: dataset supervised values
    n_intervals: number of time lags (intervals) to use in each neuron
    n_features: number of features (variables) per neuron
    train_percentage: percentage of train data related to the dataset series size; (1-train_percentage) will be for test data
    verbose: whether to output some debug data
    """

    n_train_intervals = ceil(values.shape[0] * train_percentage) #ceil(x)->得到最接近的一个不小于x的整数，如ceil(2.001)=3
    train = values[:n_train_intervals, :]
    test = values[n_train_intervals:, :]

    # split into input and outputs
    n_obs = n_intervals * n_features
    train_X, train_y = train[:, :n_obs], train[:, -n_features]  #train_Y直接赋值倒数第六列，刚好是t + n_out_timestep-1时刻的0号要预测列
                                                                #train_X此时的shape为[train.shape[0], timesteps * features]
    #print('before reshape\ntrain_X shape:', train_X.shape)
    test_X, test_y = test[:, :n_obs], test[:, -n_features]

    # reshape input to be 3D [samples, timesteps, features]
    train_X = train_X.reshape((train_X.shape[0], n_intervals, n_features))
    test_X = test_X.reshape((test_X.shape[0], n_intervals, n_features))

    if verbose:
        print("")
        print("train_X shape:", train_X.shape)
        print("train_y shape:", train_y.shape)
        print("test_X shape:", test_X.shape)
        print("test_y shape:", test_y.shape)

    return (train_X, train_y, test_X, test_y)

第六步：建模

# create the nn model
#def _create_model(train_X, train_y, test_X, test_y, n_neurons=20, n_batch=50, n_epochs=60, is_stateful=False, has_memory_stack=False, loss_function='mae', optimizer_function='adam', draw_loss_plot=True, output_col_name, verbose=True):
def _create_model(train_X, train_y, test_X, test_y, n_neurons, n_batch, n_epochs, is_stateful, has_memory_stack, loss_function, optimizer_function, draw_loss_plot, output_col_name, verbose):
    """
    train_X: train inputs
    train_y: train targets
    test_X: test inputs
    test_y: test targets
    n_neurons: number of neurons for LSTM nn
    n_batch: nn batch size
    n_epochs: training epochs
    is_stateful: whether the model has memory states
    has_memory_stack: whether the model has memory stack
    loss_function: the model loss function evaluator
    optimizer_function: the loss optimizer function
    draw_loss_plot: whether to draw the loss history plot
    output_col_name: name of the output/target column to be predicted
    verbose: whether to output some debug data
    """

    # design network
    model = Sequential()

    if is_stateful:
        # calculate new compatible batch size
        for i in range(n_batch, 0, -1):
            if train_X.shape[0] % i == 0 and test_X.shape[0] % i == 0:
                if verbose and i != n_batch:
                    print ("\n*In stateful network, batch size should be dividable by training and test sets; had to decrease it to %d." % i)
                n_batch = i
                break

        model.add(LSTM(n_neurons, batch_input_shape=(n_batch, train_X.shape[1], train_X.shape[2]), stateful=True, return_sequences=has_memory_stack))
        if has_memory_stack:
            model.add(LSTM(n_neurons, batch_input_shape=(n_batch, train_X.shape[1], train_X.shape[2]), stateful=True))
    else:
        model.add(LSTM(n_neurons, input_shape=(train_X.shape[1], train_X.shape[2])))

    model.add(Dense(1))

    model.compile(loss=loss_function, optimizer=optimizer_function)

    if verbose:
        print("")

    # fit network
    losses = []
    val_losses = []
    if is_stateful:
        for i in range(n_epochs):
            history = model.fit(train_X, train_y, epochs=1, batch_size=n_batch, 
                                validation_data=(test_X, test_y), verbose=0, shuffle=False)

            if verbose:
                print("Epoch %d/%d" % (i + 1, n_epochs))
                print("loss: %f - val_loss: %f" % (history.history["loss"][0], history.history["val_loss"][0]))

            losses.append(history.history["loss"][0])
            val_losses.append(history.history["val_loss"][0])

            model.reset_states()
    else:
        history = model.fit(train_X, train_y, epochs=n_epochs, batch_size=n_batch, 
                            validation_data=(test_X, test_y), verbose=2 if verbose else 0, shuffle=False)
    
    
    if draw_loss_plot:
        pyplot.plot(history.history["loss"] if not is_stateful else losses, label="Train Loss (%s)" % output_col_name)
        pyplot.plot(history.history["val_loss"] if not is_stateful else val_losses, label="Test Loss (%s)" % output_col_name)
        pyplot.legend()
        pyplot.show()
    
    print(history.history)
    #model.save('./my_model_%s.h5'%datetime.datetime.now())
    return (model, n_batch)

第七步：通过测试集数据的预测来检验模型效果

# make a prediction
#def _make_prediction(model, train_X, train_y, test_X, test_y, compatible_n_batch, n_intervals=3, n_features, scaler=(0,1), draw_prediction_fit_plot=True, output_col_name, verbose=True):
def _make_prediction(model, train_X, train_y, test_X, test_y, compatible_n_batch, n_intervals, n_features, scaler, draw_prediction_fit_plot, output_col_name, verbose):
    """
    train_X: train inputs
    train_y: train targets
    test_X: test inputs
    test_y: test targets
    compatible_n_batch: modified (compatible) nn batch size
    n_intervals: number of time lags (intervals) to use in each neuron
    n_features: number of features (variables) per neuron
    scaler: the scaler object used to invert transformation to real scale
    draw_prediction_fit_plot: whether to draw the the predicted vs actual fit plot
    output_col_name: name of the output/target column to be predicted
    verbose: whether to output some debug data
    """

    if verbose:
        print("")

    yhat = model.predict(test_X, batch_size=compatible_n_batch, verbose = 1 if verbose else 0)
    test_X = test_X.reshape((test_X.shape[0], n_intervals*n_features))

    # invert scaling for forecast
    inv_yhat = np.concatenate((yhat, test_X[:, (1-n_features):]), axis=1)
    inv_yhat = scaler.inverse_transform(inv_yhat)
    inv_yhat = inv_yhat[:,0]

    # invert scaling for actual
    test_y = test_y.reshape((len(test_y), 1))
    inv_y = np.concatenate((test_y, test_X[:, (1-n_features):]), axis=1)
    inv_y = scaler.inverse_transform(inv_y)
    inv_y = inv_y[:,0]

    # calculate RMSE
    rmse = sqrt(mean_squared_error(inv_y, inv_yhat))

    # calculate average error percentage
    avg = np.average(inv_y)
    error_percentage = rmse / avg

    if verbose:
        print("")
        print("Test Root Mean Square Error: %.3f" % rmse)
        print("Test Average Value for %s: %.3f" % (output_col_name, avg))
        print("Test Average Error Percentage: %.2f/100.00" % (error_percentage * 100))

    if draw_prediction_fit_plot:
        pyplot.plot(inv_y, label="Actual (%s)" % output_col_name)
        pyplot.plot(inv_yhat, label="Predicted (%s)" % output_col_name)
        pyplot.legend()
        pyplot.show()

    return (inv_y, inv_yhat, rmse, error_percentage)

第八步：主程序部分，准备起飞啦！

#!input
file_path= 'data2_2_2.csv'
header_row_index = 0
index_col_name = None

col_to_predict ='open'
cols_to_drop = ['index_code','date']



col_names, values,n_features, output_col_name = _load_dataset(file_path, header_row_index, 
                                                              index_col_name, col_to_predict, cols_to_drop)
scaler, values = _scale_dataset(values, None)
print('values before _series_to_supervised\n', values, '\nvalue shape:', values.shape)

#!input
n_in_timestep = 3
n_out_timestep = 1
verbose = 2
dropnan = True
agg1 = _series_to_supervised(values, n_in_timestep, n_out_timestep, dropnan, col_names, verbose)
#agg2 = _series_to_supervised(values, 1, 2, dropnan, col_names, verbose)
#agg3 = _series_to_supervised(values, 2, 1, dropnan, col_names, verbose)
#agg4 = _series_to_supervised(values, 3, 2, dropnan, col_names, verbose)

'''
#不懂_series_to_supervised()中n_in和n_out作用的话把下面被注释掉的列表一打出来就明白了
print('agg1:\n', agg1.columns)
print('agg2:\n', agg2.columns)
print('agg3:\n', agg3.columns)
print('agg4:\n', agg4.columns)
#print(agg1)
agg3
'''
print('agg1.value:\n', agg1.values, '\nagg1.shape:', agg1.shape, '\nagg1.columns:', agg1.columns)   #agg1和agg1.value是不一样的，agg1是DataFrame，agg1.value是np.array
#print('\nagg1\n', agg1)

#!input
train_percentage = 0.67
train_X, train_Y, test_X, test_Y =_split_data_to_train_test_sets(agg1.values, n_in_timestep, n_features, 
                                                                 train_percentage, verbose)

#!input
n_neurons=20
n_batch=50
n_epochs=60
is_stateful=False
has_memory_stack=False
loss_function='mae'
optimizer_function='adam'
draw_loss_plot=True
model, compatible_n_batch = _create_model(train_X, train_Y, test_X, test_Y, n_neurons, n_batch, n_epochs, 
                                          is_stateful, has_memory_stack, loss_function, optimizer_function, 
                                          draw_loss_plot, output_col_name, verbose)
#model.save('./my_model_%s.h5'%datetime.datetime.now())
model.save('./my_model_in time step_%d_out_timestep_%d.h5'%n_in_timestep%n_out_timestep)

#!input
draw_prediction_fit_plot = True
actual_target, predicted_target, error_value, error_percentage = _make_prediction(model, train_X, train_Y, 
                                                                                  test_X, test_Y, compatible_n_batch, 
                                                                                  n_in_timestep, n_features, scaler, 
                                                                                  draw_prediction_fit_plot, output_col_name, 
                                                                                  verbose)

第九步：结果，效果还不错，总共2014组测试数据预测值与原始值变化的总体趋势一致。通过设置不同的n_in_timestep和n_out_timestep可以对比不同的测试效果，愉快的玩耍吧。

第十步：如果你手头没有股票历史数据，也没关系，咱们搞一段爬虫扒点数据下来（这个爬虫代码趴下来的表头和我示例中不同，记得在程序中修改再使用）。

#导入需要使用到的模块
import urllib
import re
import pandas as pd
import os

#爬虫抓取网页函数
def getHtml(url):
    html = urllib.request.urlopen(url).read()
    html = html.decode('gbk')
    return html

#抓取网页股票代码函数
def getStackCode(html):
    s = r'<li><a target="_blank" href="http://quote.eastmoney.com/\S\S(.*?).html">'
    pat = re.compile(s)
    code = pat.findall(html)
    return code

Url = 'http://quote.eastmoney.com/stocklist.html'#东方财富网股票数据连接地址
filepath = 'C:\\Users\\rihang\\Desktop\\my data\\my project\\stock2 prediction\\stock data\\'#定义数据文件保存路径
#实施抓取
code = getStackCode(getHtml(Url)) 
#获取所有股票代码（以6开头的，应该是沪市数据）集合
CodeList = []
for item in code:
    if item[0]=='6':
        CodeList.append(item)
#抓取数据并保存到本地csv文件
for code in CodeList:
    print('正在获取股票%s数据'%code)
    url = 'http://quotes.money.163.com/service/chddata.html?code=0'+code+\
        '&end=20161231&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
    urllib.request.urlretrieve(url, filepath+code+'.csv')

我成功的扒下来了30个，之后估计被服务器识别出是爬虫，拒绝服务了，不过有一条能玩玩程序就行，祝大家玩的开心。

看完点个赞，觉得有用的话欢迎打赏，O(∩_∩)O哈哈~