深度学习项目----用LSTM模型预测股价(包含LSTM网络简介,代码数据均可下载)

前言

  • 前几天在看论文,打算复现,论文用到了LSTM,故这一篇文章是小编学LSTM模型的学习笔记;
  • LSTM感觉很复杂,但是结合代码构建神经网络,又感觉还行;
  • 本次学习的案例数据来源于GitHub,在本文案例前有数据和本人代码文件的网盘链接,想学习的可以下载,当然也希望大家能够批评指针,一起学习。

1、LSTM讲解

由于本人现在没有学RNN模型,故学习LSTM只聚焦于两个模块:

  • LSTM的三种类型门:输入门、遗忘门、输出门;
  • LSTM的隐藏层包含“隐状态”和“记忆元”,只有隐状态会传递到输出层,而记忆元完全属于内部信息;
  • 至于LSTM可以缓解梯度消失和梯度爆炸,就等后面学到RNN之后在详细学习。

1、网络结构

LSTM神经网络简图(用ppt太难画了)

在这里插入图片描述

  • C:记忆细胞,Ct-1,上一个记忆状态,Ct当下记忆状态
  • H:隐藏状态

2、解释

  1. 遗忘门(Forget Gate)

    • 对输入信息x,进行遗忘,选择需要记忆的东西,假如:我们考完了高数,选择需要备考线性代数,这个时候当我们进入这个门时候,需要选择遗忘高数内容(虽然现实不可能)。

    f t = σ ( W f ⋅ [ h t − 1 , x t ] + b f ) f_t=\sigma(W_f\cdot[h_{t-1},x_t]+b_f) ft=σ(Wf[ht1,xt]+bf)

    • 其中,Wf是权重矩阵,bf是偏置项,σ是 Sigmoid 激活函数,用于决定丢弃多少前一个单元状态的信息。
  2. 输入门(Input Gate)

    • It,选择记忆,假如:我们复习线性代数的时候,可能有些知识是不需要记忆的,而这门的作用就是这个,过滤掉没有用的知识。

    i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i ) c ~ t = tanh ⁡ ( W c ⋅ [ h t − 1 , x t ] + b c ) i_t=\sigma(W_i\cdot[h_{t-1},x_t]+b_i)\\\tilde{c}_t=\tanh(W_c\cdot[h_{t-1},x_t]+b_c) it=σ(Wi[ht1,xt]+bi)c~t=tanh(Wc[ht1,xt]+bc)

    • 其中,Wi和 Wc是权重矩阵,bi和 bc*是偏置项,σ 是 Sigmoid 激活函数,tanh⁡是双曲正切激活函数,用于生成候选单元状态。
  3. 单元状态(Cell State)

    • 这个时候,我们记忆力多少呢?这个门相当于我们复习完一次在脑子里还剩下多少知识

    c t = f t ⊙ c t − 1 + i t ⊙ c ~ t c_t=f_t\odot c_{t-1}+i_t\odot\tilde{c}_t ct=ftct1+itc~t

    • 其中,⊙是逐元素乘法(Hadamard product),用于更新单元状态。
  4. 输出门(Output Gate)

    • 输出隐藏维度,相当于我们考试成绩,在神经网络中,它相当于输出多少维度特征

    o t = σ ( W o ⋅ [ h t − 1 , x t ] + b o ) h t = o t ⊙ tanh ⁡ ( c t ) o_t=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)\\h_t=o_t\odot\tanh(c_t) ot=σ(Wo[ht1,xt]+bo)ht=ottanh(ct)

    • 其中,Wo 是权重矩阵,bo 是偏置项,σ 是 Sigmoid 激活函数,tanh是双曲正切激活函数,用于生成当前时间步的隐藏状态。

3、前言

当然,结合案例实战,看代码是如何构建神经网络的才是最重要的,下面就是一个股价预测案例,核心是在于怎么构建LSTM网络结构,怎么进行前向传播

2、案例

数据来源于GitHub,数据和本人代码的文件网盘下载如下:

通过网盘分享的文件:基于LSTM的股价预测(入门).zip
链接: https://pan.baidu.com/s/1ZXFLl_TrhReexyvb5Gp8Xg?pwd=v7t2 提取码: v7t2

1、数据分析

1、导入库

# 导入常用的库
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import torch 
import torch.nn as nn 
# 显示中文
from pylab import mpl
mpl.rcParams["font.sans-serif"] = ["SimHei"]  # 显示中文
plt.rcParams['axes.unicode_minus'] = False		# 显示负号

2、导入数据

dates = pd.date_range('2008-08-25', '2017-10-11', freq='B')
df_main = pd.DataFrame(index=dates)
df_aaxj = pd.read_csv("./data_stock/ETFs/aaxj.us.txt", parse_dates=True, index_col=0) # 索引列为 0
df_main = df_main.join(df_aaxj)   # 按照索引列规定数据范围
df_main
OpenHighLowCloseVolumeOpenInt
2008-08-2544.04444.04443.24843.24818975.00.0
2008-08-2643.80243.80243.47143.6605507.00.0
2008-08-2744.56444.56444.45744.4571675.00.0
2008-08-2844.42144.47544.42144.4756687.00.0
2008-08-2944.22444.22444.17144.171446.00.0
.....................
2017-10-0573.50074.03073.50073.9702134323.00.0
2017-10-0673.47073.65073.22073.5792092100.00.0
2017-10-0973.50073.79573.48073.770879600.00.0
2017-10-1074.15074.49074.15074.4801878845.00.0
2017-10-1174.29074.64574.21074.6101168511.00.0

2383 rows × 6 columns

3、数据预处理

# 查看数据类型
df_main.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2383 entries, 2008-08-25 to 2017-10-11
Freq: B
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Open     2298 non-null   float64
 1   High     2298 non-null   float64
 2   Low      2298 non-null   float64
 3   Close    2298 non-null   float64
 4   Volume   2298 non-null   float64
 5   OpenInt  2298 non-null   float64
dtypes: float64(6)
memory usage: 194.9 KB
  • 总数量:2383,no_null数量:2298,存在缺失值
  • 数据类型:float64
# 查看缺失值数量
df_main.isnull().sum()

输出:

Open       85
High       85
Low        85
Close      85
Volume     85
OpenInt    85
dtype: int64
  • 85 / 2385 大概为3.5%,缺失值有点多;
  • 缺失值类型为随机丢失值,是收集缺失的;
  • 由于该数据是时间序列,且股票价格和前后关系很大,故采用插值方法填充。
# 插值方法填充缺失值
df_main = df_main.interpolate(method='linear')
# 再次查看缺失值的情况
df_main.isnull().sum()

输出:

Open       0
High       0
Low        0
Close      0
Volume     0
OpenInt    0
dtype: int64
# 统计量分析
df_main.describe()

输出:

OpenHighLowCloseVolumeOpenInt
count2383.0000002383.0000002383.0000002383.0000002.383000e+032383.0
mean52.55969552.83565452.21665452.5524547.177284e+050.0
std8.7738098.6875208.9301448.8052417.704731e+050.0
min23.79000024.60500019.69900022.7260001.120000e+020.0
25%48.98850049.31300048.55250048.9815002.789905e+050.0
50%53.65300053.93200053.43200053.6530005.040570e+050.0
75%57.27050057.48400056.98350057.2145008.812500e+050.0
max74.29000074.64500074.21000074.6100001.048028e+070.0
# 相关性分析
df_main.corr()

输出:

OpenHighLowCloseVolumeOpenInt
Open1.0000000.9992560.9971430.9986080.265971NaN
High0.9992561.0000000.9965430.9992760.268923NaN
Low0.9971430.9965431.0000000.9974680.261464NaN
Close0.9986080.9992760.9974681.0000000.264884NaN
Volume0.2659710.2689230.2614640.2648841.000000NaN
OpenIntNaNNaNNaNNaNNaNNaN
  • 结合生活情况,选取特征:open、high、low、close

4、特征选择

# 选取特征:open、high、low、close
sel_features = ['Open', 'High', 'Low', 'Close']
df_main = df_main[sel_features]  # 列索引
# 查看前几条数据
df_main.head(3)

输出:

OpenHighLowClose
2008-08-2544.04444.04443.24843.248
2008-08-2643.80243.80243.47143.660
2008-08-2744.56444.56444.45744.457
# 股价收盘价展示
df_main[['Close']].plot()
plt.title('股价收盘价走势')
plt.ylabel('股票价格')
plt.xlabel('时间')
plt.show()


在这里插入图片描述

5、数据归一化

from sklearn.preprocessing import MinMaxScaler
# 创建归一化
scaler = MinMaxScaler(feature_range=(-1, 1))
# 归一化
for col in sel_features:
    df_main[col] = scaler.fit_transform(df_main[col].values.reshape(-1, 1))  # -1:自动推断长度,列数量
# 数据展示
df_main.head(3)

输出:

OpenHighLowClose
2008-08-25-0.197861-0.223062-0.135991-0.208928
2008-08-26-0.207446-0.232734-0.127809-0.193046
2008-08-27-0.177267-0.202278-0.091633-0.162324

6、构建目标值

由于没有目标值,故需要新建,目标值为下一次收盘价格

# 创建目标值
df_main['target'] = df_main['Close'].shift(-1) # 选取下一个目标值
# 向前移动一位,故最后缺一行
df_main = df_main.dropna()
# 统一数据类型
df_main = df_main.astype(np.float32)
import seaborn as sns
# 计算相关性
corr_matrix = df_main.corr()
# 绘图
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('相关性分析')
plt.show()


在这里插入图片描述

  • 突然感觉这一步很多余,因为股价么,开盘,涨幅,收盘相关性就应该是极强的

7、将数据转化为时间序列数据

由于股价是数据金融数据,不属于时间序列数据,故为了更好预测,需要将数据转化为金融数据。

def create_time_data(data, seq):  # seq时间序列窗口长度
    # 创建存储特征数据、目标检测容器
    data_feat, data_target = [], []
    # index开始,构建长度seq长度数据
    for index in range(len(data) - seq):
        data_feat.append(data[['Open', 'High', 'Low', 'Close']][index: index + seq].values)
        data_target.append(data['target'][index: index + seq])
        
    # 将数据转化为numpy数组
    data_feat = np.array(data_feat)
    data_target = np.array(data_target)
    
    return data_feat, data_target
# 查看转化为时间序列格式
df_main[['Open', 'High', 'Low', 'Close']][0: 20].values
    

输出:

array([[-0.19786139, -0.22306155, -0.1359909 , -0.2089276 ],
       [-0.20744555, -0.23273382, -0.12780906, -0.19304602],
       [-0.17726733, -0.20227818, -0.09163288, -0.16232364],
       [-0.1829307 , -0.20583533, -0.09295372, -0.1616298 ],
       [-0.19073267, -0.21586731, -0.10212617, -0.17334823],
       [-0.19764356, -0.22284172, -0.10755628, -0.17905328],
       [-0.20455445, -0.22981615, -0.11298637, -0.1847583 ],
       [-0.26768318, -0.28892887, -0.17543249, -0.24797626],
       [-0.28574258, -0.3117506 , -0.21487406, -0.28968468],
       [-0.33833665, -0.33721024, -0.2418044 , -0.28833553],
       [-0.27168316, -0.29316548, -0.1908789 , -0.24585614],
       [-0.28011882, -0.30607513, -0.21553448, -0.29249865],
       [-0.3281584 , -0.34580335, -0.24672085, -0.31716907],
       [-0.37619802, -0.38553157, -0.27790722, -0.3418395 ],
       [-0.3779802 , -0.4044764 , -0.2841445 , -0.36458254],
       [-0.40669307, -0.43381295, -0.33151108, -0.41153342],
       [-0.45421782, -0.4803757 , -0.37579572, -0.44086808],
       [-0.472     , -0.49972022, -0.400488  , -0.48681673],
       [-0.47366336, -0.43888888, -0.375172  , -0.38705572],
       [-0.36376238, -0.32893685, -0.26047954, -0.28174388]],
      dtype=float32)

8、训练集和测试集的构建

# 定义划分函数
def train_test(data_feat, data_target, test_size, seq):
    # 训练集大小
    train_size = data_feat.shape[0] - test_size 
    # 划分训练集和测试集,并将数据转化为 张量 格式
    train_x = torch.from_numpy(data_feat[: train_size].reshape(-1, seq, 4)).type(torch.Tensor)
    test_x = torch.from_numpy(data_feat[train_size:].reshape(-1, seq, 4)).type(torch.Tensor)
    train_y = torch.from_numpy(data_target[:train_size].reshape(-1, seq, 1)).type(torch.Tensor)
    test_y  = torch.from_numpy(data_target[train_size:].reshape(-1, seq, 1)).type(torch.Tensor)
    
    # 返回
    return train_x, train_y, test_x, test_y

# 数据定义
data = df_main 
seq = 6   # 窗口大小:这里设置为6,原因:: 股价数据中6天为一周
test_size = int(len(data) * 0.2)

# 创建时间序列数据
feat, target = create_time_data(data, seq)

# 创建划分数据
train_x, train_y, test_x, test_y = train_test(feat, target, test_size, seq)
# 输出维度
train_x.shape, train_y.shape, test_x.shape, test_y.shape

输出:

(torch.Size([1900, 6, 4]),
 torch.Size([1900, 6, 1]),
 torch.Size([476, 6, 4]),
 torch.Size([476, 6, 1]))

9、动态加载数据

from torchvision import transforms, datasets

batch_size = 6   # 每一次那6天数据进行训练

# 加载数据
train_data = torch.utils.data.TensorDataset(train_x, train_y)
test_data = torch.utils.data.TensorDataset(test_x, test_y)

# 动态加载数据
train_dl = torch.utils.data.DataLoader(dataset=train_data,
                                       batch_size=batch_size,
                                       shuffle=True)

test_dl = torch.utils.data.DataLoader(dataset=test_data,
                                      batch_size=batch_size,
                                      shuffle=True)

2、构建LSTM网络

class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers,output_dim):
        super(LSTM, self).__init__()
        # 定义隐藏层维度
        self.hidden_dim = hidden_dim
        # 定义lstm层的数量
        self.num_layers = num_layers
        # 构建lstm模型
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        # 构建全连接层
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        # 初始化隐藏状态和细胞状态
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).requires_grad_()
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).requires_grad_()
        
        # 前向传播lstm
        out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
        
        # 分类
        out = self.fc(out)
        
        # 返回结果
        return out 
# 创建并且打印模型参数
# 输入特征:4,输出特征:1
model = LSTM(input_dim=4, hidden_dim=32, num_layers=2, output_dim=1)
model

输出:

LSTM(
  (lstm): LSTM(4, 32, num_layers=2, batch_first=True)
  (fc): Linear(in_features=32, out_features=1, bias=True)
)

3、模型训练

1、设置超参数

# 创建损失函数
loss_fn = torch.nn.MSELoss()
# 学习率
learn_rate = 0.01
# 创建优化器
optimizer = torch.optim.Adam(model.parameters(), lr=learn_rate)

2、训练集训构建

def train(dataloader, model, loss_fn, optimizer):
    # 获取批次大小
    batch_size = len(dataloader)  # 总数 / 32
    
    # 准确率和损失率
    train_loss = 0
    
    for X, y in dataloader:  # 每一批次的规格请看上面:动态加载数据哪里
        
        # 预测
        pred = model(X)
        # 计算损失
        loss = loss_fn(pred, y)
        
        # 梯度清零
        optimizer.zero_grad()
        # 求导
        loss.backward()
        # 梯度下降法更新
        optimizer.step()
        
        # 误差
        train_loss += loss.item()   # .item 获取数据项
    
    # 计算损失函数和梯度
    train_loss /= batch_size
    
    return train_loss
        

3、测试集构建

def test(dataloader, model, loss_fn):
    batch_size = len(dataloader)
    
    # 准确率和损失率
    test_loss = 0
    
    with torch.no_grad():
        for X, y in dataloader:
            
            # 预测和计算损失
            pred = model(X)
            loss = loss_fn(pred, y)
            
            test_loss += loss.item()
     
    # 计算损失率    
    test_loss /= batch_size
    
    return test_loss

4、正式训练

train_loss = []
test_loss = []

epochs = 15

for epoch in range(epochs):
    model.train()
    epoch_train_loss = train(train_dl, model, loss_fn, optimizer)
    
    model.eval()
    epoch_test_loss = test(test_dl, model, loss_fn)
    
    train_loss.append(epoch_train_loss)
    test_loss.append(epoch_test_loss)
    
    template = ('Epoch:{:2d}, Train_mse:{:.10f}, Test_mse:{:.10f}')
    print(template.format(epoch+1, epoch_train_loss, epoch_test_loss))
Epoch: 1, Train_mse:0.0055270789, Test_mse:0.0028169709
Epoch: 2, Train_mse:0.0014304496, Test_mse:0.0032940961
Epoch: 3, Train_mse:0.0016769003, Test_mse:0.0014444893
Epoch: 4, Train_mse:0.0013827066, Test_mse:0.0023709078
Epoch: 5, Train_mse:0.0013644575, Test_mse:0.0005126200
Epoch: 6, Train_mse:0.0011645519, Test_mse:0.0009766717
Epoch: 7, Train_mse:0.0010370992, Test_mse:0.0026354755
Epoch: 8, Train_mse:0.0011004983, Test_mse:0.0005752990
Epoch: 9, Train_mse:0.0011330271, Test_mse:0.0013168041
Epoch:10, Train_mse:0.0011555004, Test_mse:0.0016195212
Epoch:11, Train_mse:0.0015111874, Test_mse:0.0010681283
Epoch:12, Train_mse:0.0010495648, Test_mse:0.0008801822
Epoch:13, Train_mse:0.0009528522, Test_mse:0.0006430979
Epoch:14, Train_mse:0.0010829600, Test_mse:0.0006819312
Epoch:15, Train_mse:0.0011495422, Test_mse:0.0013490517

4、结果展示

1、损失结果展示

# 绘制损失函数
epoch_range = range(epochs)

plt.plot(epoch_range, train_loss, label='Training Mse')
plt.plot(epoch_range, test_loss, label='Test Mse')
plt.legend(loc='upper right')
plt.title('Mse')
plt.show()


在这里插入图片描述

分析

  • 模型在归一化后的预测效果中,训练集和测试集的mse,均小于1%,说明了该模型对这个数据的预测有效性;
  • 下面将进行反归一化,将预测数据进行可视化展示,可以更直观观测效果。

2、训练集中原始值和预测值展示(反归一化)

y_train_pred = model(train_x)
y_test_pred = model(test_x)

y_train_pred = scaler.inverse_transform(y_train_pred.detach().numpy()[:,-1,0].reshape(-1,1))
y_train = scaler.inverse_transform(train_y.detach().numpy()[:,-1,0].reshape(-1,1))
y_test_pred = scaler.inverse_transform(y_test_pred.detach().numpy()[:,-1,0].reshape(-1,1))
y_test = scaler.inverse_transform(test_y.detach().numpy()[:,-1,0].reshape(-1,1))
# 训练绘图展示
plt.plot(y_train_pred, label="pred_data")
plt.plot(y_train, label="true_data")
plt.legend()
plt.show()


在这里插入图片描述

# 测试绘图展示
plt.plot(y_test_pred, label="pred_data")
plt.plot(y_test, label="true_data")
plt.legend()
plt.show()


在这里插入图片描述

3、误差检验

from sklearn.metrics import mean_squared_error

trainScore = mean_squared_error(y_train, y_train_pred)
testScore = mean_squared_error(y_test, y_test_pred)

print("Trian mse: ", trainScore)
print("Test mse: ", testScore)
Trian mse:  0.60466486
Test mse:  0.8240372

分析

  • Trian mse: 0.61244047,Test mse: 0.8975438,结合原始数据大小,进一步验证了模型的有效性
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值