基于LTSM的多变量(Features)多输入\多时间跨度(Timesteps)的股票预测模型构建(Keras, Tensorflow, python)

网上有很多LSTM模型的示例代码,但让人头疼的是,大部分都是用v(t-n)的feature来预测v(t),那么对于有预测v(t+1), v(t+2)...v(t+n)需求的时候,估计就会有人采用笨办法,既先预测v(t),之后再用v(t)预测v(t+1),愚公移山,慢慢挖。。。(*@ο@*) 哇~既然写,就写点和别人不一样的,那么我们今天的模型就是用v(t-n)来预测v(t+m)时刻感兴趣的特征值。

模型背景:每只股票在股市开盘后,每天都会产生几个比较重要的特征,例如股票代码index_code,日期date,开盘价open,收盘价close,一天内的最高价、最低价high\low,交易量volume,交易金额money,换手率change。今天这个模型,将会使用以上这些feature作为预测未来某只股票开盘价的依据,通过某股票前三天(in_timestep=3)的feature预测第四天(out_timestep=1)的开盘价。模型构建使用Keras框架开发,Tensorflow作为后台,我们测试的股票为sh000001,训练数据集4091组,测试集数据为2014组,总共记录了大概十来年的数据。如果您手头没有数据,也没关系,看完全文后,看第十步,写一个小爬虫程序扒点数据下来(这个爬虫代码趴下来的表头和我示例中不同,记得在程序中修改再使用)

闲话不多说,开整。

第一步:导入需要使用的库

import numpy as np
import time
import argparse
import json
from math import sqrt, ceil
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

第二步:加载文件数据,我使用的是.csv的原数据,这只股票相关的feature截图如下图所示(如果没有这数据,可以通过修改自己手头数据的列名称搞定)。加载.csv的函数中,需要设定一些变量,如file_path:文件路径, header_row_index:列表头所在行位置(一般都是第一行0), index_col_name:这个没有直接设置为None, col_to_predict:需要预测的列名, cols_to_drop:要丢掉列名,比如这里的index_code和date就直接丢掉,如果不想设置太多参数,可以使用被注释掉的带默认参数值的函数定义。数据加载的过程中,会按照设定删除不需要的列,此外,还会将要预测的列调整到第0列的位置。最终返回col_names:使用到的feature列名组,values:各feature的值(float32格式),values.shape[1]:模型建模中用到的feature数量, output_col_name:预测feature名(本例是open)

 

# load data set
#def load_dataset(file_path='dataset.csv', header_row_index=0, index_col_name =None, col_to_predict, cols_to_drop=None):
def _load_dataset(file_path, header_row_index, index_col_name, col_to_predict, cols_to_drop):
    
    """
    file_path: the csv file path
    header_row_index: the header row index in the csv file
    index_col_name: the index column (can be None if no index is there)
    col_to_predict: the column name/index to predict
    cols_to_drop: the column names/indices to drop (single label or list-like)
    """
    # read dataset from disk
    dataset = read_csv(file_path, header=header_row_index, index_col=False)
    #print(dataset)

    # set index col,设置索引列,参数输入列的名字列表
    if index_col_name:
        dataset.set_index(index_col_name, inplace=True)
    
    # drop nonused colums,删除不需要的列,参数输入列的名字列表
    '''if cols_to_drop:
        if type(cols_to_drop[0]) == int:
            dataset.drop(index=cols_to_drop, axis=0, inplace=True)
        else:
            dataset.drop(columns=cols_to_drop, axis=1, inplace=True)'''
    if cols_to_drop:
        dataset.drop(cols_to_drop, axis =1, inplace = True)
    
    #print('\nprint data set again\n',dataset)
    # get rows and column names
    col_names = dataset.columns.values.tolist()
    values = dataset.values
    #print(col_names, '\n values\n', values)
    
    # move the column to predict to be the first col: 把预测列调至第一列
    col_to_predict_index = col_to_predict if type(col_to_predict) == int else col_names.index(col_to_predict)
    output_col_name = col_names[col_to_predict_index]
    if col_to_predict_index > 0:
        col_names = [col_names[col_to_predict_index]] + col_names[:col_to_predict_index] + col_names[col_to_predict_index+1:]
    values = np.concatenate((values[:, col_to_predict_index].reshape((values.shape[0], 1)), values[:,:col_to_predict_index], values[:,col_to_predict_index+1:]), axis=1)
    #print(col_names, '\n values2\n', values)
    # ensure all data is float
    values = values.astype("float32")
    #print(col_names, '\n values3\n', values)
    return col_names, values,values.shape[1], output_col_name

第三步:数据归一化,本例scale_range是(0,1)

# scale dataset
#def _scale_dataset(values, scale_range = (0,1)):
def _scale_dataset(values, scale_range):
    """
    values: dataset values
    scale_range: scale range to fit data in
    """
    # normalize features
    scaler = MinMaxScaler(feature_range=scale_range or (0, 1))
    scaled = scaler.fit_transform(values)

    return (scaler, scaled)

第四步:将数据格式转化为监督学习的格式。在本例中,我们将n_in_timestep和n_out_timestep设置为3和1,意味着我们将使用T-3, T-2, T-1时刻的所有feature数据来预测T时刻的open值。在生成新的格式的时候,会有一部分数据格被填充Nan,通过设置dropnan参数为True把这部分行直接删除。

# convert series to supervised learning (ex: var1(t)_row1 = var1(t-1)_row2),列表打印出来一看就明白了
#def _series_to_supervised(values, n_in=3, n_out=1, dropnan=True, col_names, verbose=True):
def _series_to_supervised(values, n_in, n_out, dropnan, col_names, verbose):
    """
    values: dataset scaled values
    n_in: number of time lags (intervals) to use in each neuron, 与多少个之前的time_step相关,和后面的n_intervals是一样
    n_out: number of time-steps in future to predict,预测未来多少个time_step
    dropnan: whether to drop rows with NaN values after conversion to supervised learning
    col_names: name of columns for dataset
    verbose: whether to output some debug data
    """

    n_vars = 1 if type(values) is list else values.shape[1]
    if col_names is None: col_names = ["var%d" % (j+1) for j in range(n_vars)]
    df = DataFrame(values)
    cols, names = list(), list()

    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [("%s(t-%d)" % (col_names[j], i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))         #这里循环结束后cols是个列表,每个列表都是一个shift过的矩阵
        if i == 0:
            names += [("%s(t)" % (col_names[j])) for j in range(n_vars)]
        else:
            names += [("%s(t+%d)" % (col_names[j], i)) for j in range(n_vars)]

    # put it all together
    agg = concat(cols, axis=1)    #将cols中的每一行元素一字排开,连接起来,vala t-n_in, valb t-n_in ... valta t, valb t... vala t+n_out-1, valb t+n_out-1
    agg.columns = names

    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)

    if verbose:
        print("\nsupervised data shape:", agg.shape)
    return agg

成功执行后,文本中的数据将会变成如下格式:

第五步:拆分训练集,样本集,及训练量和对应值,大部分样本集的拆分都是这种流程,不赘述,详见代码。训练数据集4091组,测试集数据为2014组。

# split into train and test sets
#def _split_data_to_train_test_sets(values, n_intervals=3, n_features, train_percentage=0.67, verbose=True):
def _split_data_to_train_test_sets(values, n_intervals, n_features, train_percentage, verbose):
    """
    values: dataset supervised values
    n_intervals: number of time lags (intervals) to use in each neuron
    n_features: number of features (variables) per neuron
    train_percentage: percentage of train data related to the dataset series size; (1-train_percentage) will be for test data
    verbose: whether to output some debug data
    """

    n_train_intervals = ceil(values.shape[0] * train_percentage) #ceil(x)->得到最接近的一个不小于x的整数,如ceil(2.001)=3
    train = values[:n_train_intervals, :]
    test = values[n_train_intervals:, :]

    # split into input and outputs
    n_obs = n_intervals * n_features
    train_X, train_y = train[:, :n_obs], train[:, -n_features]  #train_Y直接赋值倒数第六列,刚好是t + n_out_timestep-1时刻的0号要预测列
                                                                #train_X此时的shape为[train.shape[0], timesteps * features]
    #print('before reshape\ntrain_X shape:', train_X.shape)
    test_X, test_y = test[:, :n_obs], test[:, -n_features]

    # reshape input to be 3D [samples, timesteps, features]
    train_X = train_X.reshape((train_X.shape[0], n_intervals, n_features))
    test_X = test_X.reshape((test_X.shape[0], n_intervals, n_features))

    if verbose:
        print("")
        print("train_X shape:", train_X.shape)
        print("train_y shape:", train_y.shape)
        print("test_X shape:", test_X.shape)
        print("test_y shape:", test_y.shape)

    return (train_X, train_y, test_X, test_y)

第六步:建模

# create the nn model
#def _create_model(train_X, train_y, test_X, test_y, n_neurons=20, n_batch=50, n_epochs=60, is_stateful=False, has_memory_stack=False, loss_function='mae', optimizer_function='adam', draw_loss_plot=True, output_col_name, verbose=True):
def _create_model(train_X, train_y, test_X, test_y, n_neurons, n_batch, n_epochs, is_stateful, has_memory_stack, loss_function, optimizer_function, draw_loss_plot, output_col_name, verbose):
    """
    train_X: train inputs
    train_y: train targets
    test_X: test inputs
    test_y: test targets
    n_neurons: number of neurons for LSTM nn
    n_batch: nn batch size
    n_epochs: training epochs
    is_stateful: whether the model has memory states
    has_memory_stack: whether the model has memory stack
    loss_function: the model loss function evaluator
    optimizer_function: the loss optimizer function
    draw_loss_plot: whether to draw the loss history plot
    output_col_name: name of the output/target column to be predicted
    verbose: whether to output some debug data
    """

    # design network
    model = Sequential()

    if is_stateful:
        # calculate new compatible batch size
        for i in range(n_batch, 0, -1):
            if train_X.shape[0] % i == 0 and test_X.shape[0] % i == 0:
                if verbose and i != n_batch:
                    print ("\n*In stateful network, batch size should be dividable by training and test sets; had to decrease it to %d." % i)
                n_batch = i
                break

        model.add(LSTM(n_neurons, batch_input_shape=(n_batch, train_X.shape[1], train_X.shape[2]), stateful=True, return_sequences=has_memory_stack))
        if has_memory_stack:
            model.add(LSTM(n_neurons, batch_input_shape=(n_batch, train_X.shape[1], train_X.shape[2]), stateful=True))
    else:
        model.add(LSTM(n_neurons, input_shape=(train_X.shape[1], train_X.shape[2])))

    model.add(Dense(1))

    model.compile(loss=loss_function, optimizer=optimizer_function)

    if verbose:
        print("")

    # fit network
    losses = []
    val_losses = []
    if is_stateful:
        for i in range(n_epochs):
            history = model.fit(train_X, train_y, epochs=1, batch_size=n_batch, 
                                validation_data=(test_X, test_y), verbose=0, shuffle=False)

            if verbose:
                print("Epoch %d/%d" % (i + 1, n_epochs))
                print("loss: %f - val_loss: %f" % (history.history["loss"][0], history.history["val_loss"][0]))

            losses.append(history.history["loss"][0])
            val_losses.append(history.history["val_loss"][0])

            model.reset_states()
    else:
        history = model.fit(train_X, train_y, epochs=n_epochs, batch_size=n_batch, 
                            validation_data=(test_X, test_y), verbose=2 if verbose else 0, shuffle=False)
    
    
    if draw_loss_plot:
        pyplot.plot(history.history["loss"] if not is_stateful else losses, label="Train Loss (%s)" % output_col_name)
        pyplot.plot(history.history["val_loss"] if not is_stateful else val_losses, label="Test Loss (%s)" % output_col_name)
        pyplot.legend()
        pyplot.show()
    
    print(history.history)
    #model.save('./my_model_%s.h5'%datetime.datetime.now())
    return (model, n_batch)

第七步:通过测试集数据的预测来检验模型效果

# make a prediction
#def _make_prediction(model, train_X, train_y, test_X, test_y, compatible_n_batch, n_intervals=3, n_features, scaler=(0,1), draw_prediction_fit_plot=True, output_col_name, verbose=True):
def _make_prediction(model, train_X, train_y, test_X, test_y, compatible_n_batch, n_intervals, n_features, scaler, draw_prediction_fit_plot, output_col_name, verbose):
    """
    train_X: train inputs
    train_y: train targets
    test_X: test inputs
    test_y: test targets
    compatible_n_batch: modified (compatible) nn batch size
    n_intervals: number of time lags (intervals) to use in each neuron
    n_features: number of features (variables) per neuron
    scaler: the scaler object used to invert transformation to real scale
    draw_prediction_fit_plot: whether to draw the the predicted vs actual fit plot
    output_col_name: name of the output/target column to be predicted
    verbose: whether to output some debug data
    """

    if verbose:
        print("")

    yhat = model.predict(test_X, batch_size=compatible_n_batch, verbose = 1 if verbose else 0)
    test_X = test_X.reshape((test_X.shape[0], n_intervals*n_features))

    # invert scaling for forecast
    inv_yhat = np.concatenate((yhat, test_X[:, (1-n_features):]), axis=1)
    inv_yhat = scaler.inverse_transform(inv_yhat)
    inv_yhat = inv_yhat[:,0]

    # invert scaling for actual
    test_y = test_y.reshape((len(test_y), 1))
    inv_y = np.concatenate((test_y, test_X[:, (1-n_features):]), axis=1)
    inv_y = scaler.inverse_transform(inv_y)
    inv_y = inv_y[:,0]

    # calculate RMSE
    rmse = sqrt(mean_squared_error(inv_y, inv_yhat))

    # calculate average error percentage
    avg = np.average(inv_y)
    error_percentage = rmse / avg

    if verbose:
        print("")
        print("Test Root Mean Square Error: %.3f" % rmse)
        print("Test Average Value for %s: %.3f" % (output_col_name, avg))
        print("Test Average Error Percentage: %.2f/100.00" % (error_percentage * 100))

    if draw_prediction_fit_plot:
        pyplot.plot(inv_y, label="Actual (%s)" % output_col_name)
        pyplot.plot(inv_yhat, label="Predicted (%s)" % output_col_name)
        pyplot.legend()
        pyplot.show()

    return (inv_y, inv_yhat, rmse, error_percentage)

第八步:主程序部分,准备起飞啦!

#!input
file_path= 'data2_2_2.csv'
header_row_index = 0
index_col_name = None

col_to_predict ='open'
cols_to_drop = ['index_code','date']



col_names, values,n_features, output_col_name = _load_dataset(file_path, header_row_index, 
                                                              index_col_name, col_to_predict, cols_to_drop)
scaler, values = _scale_dataset(values, None)
print('values before _series_to_supervised\n', values, '\nvalue shape:', values.shape)

#!input
n_in_timestep = 3
n_out_timestep = 1
verbose = 2
dropnan = True
agg1 = _series_to_supervised(values, n_in_timestep, n_out_timestep, dropnan, col_names, verbose)
#agg2 = _series_to_supervised(values, 1, 2, dropnan, col_names, verbose)
#agg3 = _series_to_supervised(values, 2, 1, dropnan, col_names, verbose)
#agg4 = _series_to_supervised(values, 3, 2, dropnan, col_names, verbose)

'''
#不懂_series_to_supervised()中n_in和n_out作用的话把下面被注释掉的列表一打出来就明白了
print('agg1:\n', agg1.columns)
print('agg2:\n', agg2.columns)
print('agg3:\n', agg3.columns)
print('agg4:\n', agg4.columns)
#print(agg1)
agg3
'''
print('agg1.value:\n', agg1.values, '\nagg1.shape:', agg1.shape, '\nagg1.columns:', agg1.columns)   #agg1和agg1.value是不一样的,agg1是DataFrame,agg1.value是np.array
#print('\nagg1\n', agg1)

#!input
train_percentage = 0.67
train_X, train_Y, test_X, test_Y =_split_data_to_train_test_sets(agg1.values, n_in_timestep, n_features, 
                                                                 train_percentage, verbose)

#!input
n_neurons=20
n_batch=50
n_epochs=60
is_stateful=False
has_memory_stack=False
loss_function='mae'
optimizer_function='adam'
draw_loss_plot=True
model, compatible_n_batch = _create_model(train_X, train_Y, test_X, test_Y, n_neurons, n_batch, n_epochs, 
                                          is_stateful, has_memory_stack, loss_function, optimizer_function, 
                                          draw_loss_plot, output_col_name, verbose)
#model.save('./my_model_%s.h5'%datetime.datetime.now())
model.save('./my_model_in time step_%d_out_timestep_%d.h5'%n_in_timestep%n_out_timestep)

#!input
draw_prediction_fit_plot = True
actual_target, predicted_target, error_value, error_percentage = _make_prediction(model, train_X, train_Y, 
                                                                                  test_X, test_Y, compatible_n_batch, 
                                                                                  n_in_timestep, n_features, scaler, 
                                                                                  draw_prediction_fit_plot, output_col_name, 
                                                                                  verbose)

第九步:结果,效果还不错,总共2014组测试数据预测值与原始值变化的总体趋势一致。通过设置不同的n_in_timestep和n_out_timestep可以对比不同的测试效果,愉快的玩耍吧。

第十步:如果你手头没有股票历史数据,也没关系,咱们搞一段爬虫扒点数据下来(这个爬虫代码趴下来的表头和我示例中不同,记得在程序中修改再使用)

#导入需要使用到的模块
import urllib
import re
import pandas as pd
import os

#爬虫抓取网页函数
def getHtml(url):
    html = urllib.request.urlopen(url).read()
    html = html.decode('gbk')
    return html

#抓取网页股票代码函数
def getStackCode(html):
    s = r'<li><a target="_blank" href="http://quote.eastmoney.com/\S\S(.*?).html">'
    pat = re.compile(s)
    code = pat.findall(html)
    return code

Url = 'http://quote.eastmoney.com/stocklist.html'#东方财富网股票数据连接地址
filepath = 'C:\\Users\\rihang\\Desktop\\my data\\my project\\stock2 prediction\\stock data\\'#定义数据文件保存路径
#实施抓取
code = getStackCode(getHtml(Url)) 
#获取所有股票代码(以6开头的,应该是沪市数据)集合
CodeList = []
for item in code:
    if item[0]=='6':
        CodeList.append(item)
#抓取数据并保存到本地csv文件
for code in CodeList:
    print('正在获取股票%s数据'%code)
    url = 'http://quotes.money.163.com/service/chddata.html?code=0'+code+\
        '&end=20161231&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
    urllib.request.urlretrieve(url, filepath+code+'.csv')

我成功的扒下来了30个,之后估计被服务器识别出是爬虫,拒绝服务了,不过有一条能玩玩程序就行,祝大家玩的开心。

看完点个赞,觉得有用的话欢迎打赏,O(∩_∩)O哈哈~

  • 25
    点赞
  • 134
    收藏
    觉得还不错? 一键收藏
  • 58
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 58
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值