pytorch数据中心流量预测——数据预处理部分参考文档

1.PyTorch–用循环神经网络LSTM预测时间序列
主要看他的训练模型部分代码

epochs = 150

for i in range(epochs):
    for seq, labels in train_inout_seq:  #主要是这句话!
        optimizer.zero_grad()
        model.hidden_cell = (torch.zeros(1, 1, model.hidden_layer_size),
                        torch.zeros(1, 1, model.hidden_layer_size))

        y_pred = model(seq)

        single_loss = loss_function(y_pred, labels)
        single_loss.backward()
        optimizer.step()

    if i%25 == 1:
        print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')

print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')

倒推train_inout_seq,长这样(就是一列表):[(tensorx1,tensory1),(tensorx2,tensory2),(tensorx3,tensory3)…]

[(tensor([-0.9648, -0.9385, -0.8769, -0.8901, -0.9253, -0.8637, -0.8066, -0.8066,
          -0.8593, -0.9341, -1.0000, -0.9385]), tensor([-0.9516])),
 (tensor([-0.9385, -0.8769, -0.8901, -0.9253, -0.8637, -0.8066, -0.8066, -0.8593,
          -0.9341, -1.0000, -0.9385, -0.9516]),
  tensor([-0.9033])),
 (tensor([-0.8769, -0.8901, -0.9253, -0.8637, -0.8066, -0.8066, -0.8593, -0.9341,
          -1.0000, -0.9385, -0.9516, -0.9033]), tensor([-0.8374])),
 (tensor([-0.8901, -0.9253, -0.8637, -0.8066, -0.8066, -0.8593, -0.9341, -1.0000,
          -0.9385, -0.9516, -0.9033, -0.8374]), tensor([-0.8637])),
 (tensor([-0.9253, -0.8637, -0.8066, -0.8066, -0.8593, -0.9341, -1.0000, -0.9385,
          -0.9516, -0.9033, -0.8374, -0.8637]), tensor([-0.9077]))]

2.Multivariate input LSTM in pytorch
How to Develop LSTM Models for Time Series Forecasting
下面这篇介绍了好几个应用LSTM的案例,但是是用Keras写的,我的项目中主要参考了Multiple Parallel Series的相关思想
上面这篇是Multivariate input LSTM 这个案例的pytorch写法
转载一下Multiple Parallel Series的内容吧(机翻,主要因为我不用Keras,就不认真核对翻得准不准确了)

[[ 10  15  25]
 [ 20  25  45]
 [ 30  35  65]
 [ 40  45  85]
 [ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]
 [ 80  85 165]
 [ 90  95 185]]
#input
10, 15, 25
20, 25, 45
30, 35, 65
#output
40, 45, 85

The split_sequences() function below will split multiple parallel time series with rows for time steps and one series per column into the required input/output shape.下面的split_sequences()函数将把多个平行时间序列的行代表时间步长,每列代表一个序列,分割成所需的输入/输出形状。

# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
	X, y = list(), list()
	for i in range(len(sequences)):
		# find the end of this pattern
		end_ix = i + n_steps
		# check if we are beyond the dataset
		if end_ix > len(sequences)-1:
			break
		# gather input and output parts of the pattern
		seq_x, seq_y = sequences[i:end_ix, :], sequences[end_ix, :]
		X.append(seq_x)
		y.append(seq_y)
	return array(X), array(y)

We can demonstrate this on the contrived problem; the complete example is listed below.我们可以在人为的问题上证明这一点;下面列出了完整的示例。

# multivariate output data prep
from numpy import array
from numpy import hstack

# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
	X, y = list(), list()
	for i in range(len(sequences)):
		# find the end of this pattern
		end_ix = i + n_steps
		# check if we are beyond the dataset
		if end_ix > len(sequences)-1:
			break
		# gather input and output parts of the pattern
		seq_x, seq_y = sequences[i:end_ix, :], sequences[end_ix, :]
		X.append(seq_x)
		y.append(seq_y)
	return array(X), array(y)

# define input sequence
in_seq1 = array([10, 20, 30, 40, 50, 60, 70, 80, 90])
in_seq2 = array([15, 25, 35, 45, 55, 65, 75, 85, 95])
out_seq = array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))])
# convert to [rows, columns] structure
in_seq1 = in_seq1.reshape((len(in_seq1), 1))
in_seq2 = in_seq2.reshape((len(in_seq2), 1))
out_seq = out_seq.reshape((len(out_seq), 1))
# horizontally stack columns
dataset = hstack((in_seq1, in_seq2, out_seq))
# choose a number of time steps
n_steps = 3
# convert into input/output
X, y = split_sequences(dataset, n_steps)
print(X.shape, y.shape)
# summarize the data
for i in range(len(X)):
	print(X[i], y[i])

Running the example first prints the shape of the prepared X and y components.运行该例子首先打印出准备好的X和Y分量的形状。

The shape of X is three-dimensional, including the number of samples (6), the number of time steps chosen per sample (3), and the number of parallel time series or features (3).X的形状是三维的,包括样本的数量(6),每个样本选择的时间步数(3),以及平行时间序列或特征的数量(3)。

The shape of y is two-dimensional as we might expect for the number of samples (6) and the number of time variables per sample to be predicted (3).
y的形状是二维的,正如我们所期望的那样,样本的数量(6)和每个样本要预测的时间变量的数量(3)。

The data is ready to use in an LSTM model that expects three-dimensional input and two-dimensional output shapes for the X and y components of each sample.这些数据已经准备好在LSTM模型中使用,该模型期望每个样本的X和Y分量有三维的输入和二维的输出形状。

Then, each of the samples is printed showing the input and output components of each sample.然后,每个样本被打印出来,显示每个样本的输入和输出成分。

(6, 3, 3) (6, 3)

[[10 15 25]
 [20 25 45]
 [30 35 65]] [40 45 85]
[[20 25 45]
 [30 35 65]
 [40 45 85]] [ 50  55 105]
[[ 30  35  65]
 [ 40  45  85]
 [ 50  55 105]] [ 60  65 125]
[[ 40  45  85]
 [ 50  55 105]
 [ 60  65 125]] [ 70  75 145]
[[ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]] [ 80  85 165]
[[ 60  65 125]
 [ 70  75 145]
 [ 80  85 165]] [ 90  95 185]

We are now ready to fit an LSTM model on this data.我们现在准备在这些数据上拟合一个LSTM模型。

Any of the varieties of LSTMs in the previous section can be used, such as a Vanilla, Stacked, Bidirectional, CNN, or ConvLSTM model.上一节中的任何种类的LSTM都可以使用,比如Vanilla、Stacked、Bidirectional、CNN或ConvLSTM模型。

We will use a Stacked LSTM where the number of time steps and parallel series (features) are specified for the input layer via the input_shape argument. The number of parallel series is also used in the specification of the number of values to predict by the model in the output layer; again, this is three.我们将使用堆叠式LSTM,其中时间步数和平行序列(特征)是通过input_shape参数为输入层指定的。平行序列的数量也被用于指定输出层模型预测的数值数量;同样,这是三个。

...
# define model
model = Sequential()
model.add(LSTM(100, activation='relu', return_sequences=True, input_shape=(n_steps, n_features)))
model.add(LSTM(100, activation='relu'))
model.add(Dense(n_features))
model.compile(optimizer='adam', loss='mse')

We can predict the next value in each of the three parallel series by providing an input of three time steps for each series.我们可以通过为每个系列提供三个时间步长的输入来预测三个平行系列中的每一个中的下一个值。

70, 75, 145
80, 85, 165
90, 95, 185

The shape of the input for making a single prediction must be 1 sample, 3 time steps, and 3 features, or [1, 3, 3].进行单个预测的输入的形状必须是 1 个样本、3 个时间步长和 3 个特征,或 [1, 3, 3].

...
# demonstrate prediction
x_input = array([[70,75,145], [80,85,165], [90,95,185]])
x_input = x_input.reshape((1, n_steps, n_features))
yhat = model.predict(x_input, verbose=0)

We would expect the vector output to be:我们期望向量输出为:
[100, 105, 205]
We can tie all of this together and demonstrate a Stacked LSTM for multivariate output time series forecasting below.我们可以把所有这些联系起来,在下面演示一个用于多变量输出时间序列预测的叠加LSTM。

# multivariate output stacked lstm example
from numpy import array
from numpy import hstack
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense

# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
	X, y = list(), list()
	for i in range(len(sequences)):
		# find the end of this pattern
		end_ix = i + n_steps
		# check if we are beyond the dataset
		if end_ix > len(sequences)-1:
			break
		# gather input and output parts of the pattern
		seq_x, seq_y = sequences[i:end_ix, :], sequences[end_ix, :]
		X.append(seq_x)
		y.append(seq_y)
	return array(X), array(y)

# define input sequence
in_seq1 = array([10, 20, 30, 40, 50, 60, 70, 80, 90])
in_seq2 = array([15, 25, 35, 45, 55, 65, 75, 85, 95])
out_seq = array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))])
# convert to [rows, columns] structure
in_seq1 = in_seq1.reshape((len(in_seq1), 1))
in_seq2 = in_seq2.reshape((len(in_seq2), 1))
out_seq = out_seq.reshape((len(out_seq), 1))
# horizontally stack columns
dataset = hstack((in_seq1, in_seq2, out_seq))
# choose a number of time steps
n_steps = 3
# convert into input/output
X, y = split_sequences(dataset, n_steps)
# the dataset knows the number of features, e.g. 2
n_features = X.shape[2]
# define model
model = Sequential()
model.add(LSTM(100, activation='relu', return_sequences=True, input_shape=(n_steps, n_features)))
model.add(LSTM(100, activation='relu'))
model.add(Dense(n_features))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(X, y, epochs=400, verbose=0)
# demonstrate prediction
x_input = array([[70,75,145], [80,85,165], [90,95,185]])
x_input = x_input.reshape((1, n_steps, n_features))
yhat = model.predict(x_input, verbose=0)
print(yhat)

3.pytorch LSTM的股价预测
参考这篇的很大一个因素是——这一篇是比较难得的讲明白了train_x, test_x, train_y, test_y这四部分应该怎么用的

class GetData:
    def __init__(self, stock_id, save_path):
        self.stock_id = stock_id
        self.save_path = save_path
        self.data = None
 
    def getData(self):
        self.data = ts.get_hist_data(self.stock_id).iloc[::-1]
        self.data = self.data[["open", "close", "high", "low", "volume"]]
        self.close_min = self.data['close'].min()
        self.close_max = self.data["close"].max()
        self.data = self.data.apply(lambda x: (x - min(x)) / (max(x) - min(x)))
        self.data.to_csv(self.save_path)
        # self.data = self.data.apply(lambda x: x-min(x)/(max(x)-min(x)))
        return self.data
 
    def process_data(self, n):
        if self.data is None:
            self.getData()
        feature = [
            self.data.iloc[i: i + n].values.tolist()
            for i in range(len(self.data) - n + 2)
            if i + n < len(self.data)
        ]
        label = [
            self.data.close.values[i + n]
            for i in range(len(self.data) - n + 2)
            if i + n < len(self.data)
        ]
        train_x = feature[:500]
        test_x = feature[500:]
        train_y = label[:500]
        test_y = label[500:]
 
        return train_x, test_x, train_y, test_y

4.然后就是关于LSTM输入可变长序列的相关文档:
pytorch中如何在lstm中输入可变长的序列
这篇的GitHub代码在这里
写的真的很好!主要思想就是不等长的train_x(tensor_x的list)通过使用rnn_utils.pad_sequence,使得每一个batch里每个tensor的长度都和最长的tensor长度一样。
这样做虽然可以解决输入序列的格式问题,但补太多0会浪费算力资源,然后就要通过使用rnn_utils.pack_padded_sequence将刚才padding过的每一个batch再压缩成原始的数(补的0被去掉了),同时batch里原本的多个tensor被融合成一个(这个tensor的排序方式基本上可以总结为tensor0[0],tensor1[0],tensor2[0],tensor0[1],tensor1[1]…你懂我的意思叭),并且格式变为PackedSequence类型的了(这种可以直接输入LSTM中,LSTM检查输入时会先检查是不是PackedSequence类型,如果是的话就可以直接输入了,不然还会去仔细检查每一个batch的size)
p.s rnn_utils.pad_sequence和rnn_utils.pack_padded_sequence其实都可以写在collate_fn这个函数里(下面那个链接里的文档里有写过),这个函数真的超级有用!
↓这是原文给的

def collate_fn(train_data):
    train_data.sort(key=lambda data: len(data), reverse=True)
    data_length = [len(data) for data in train_data]
    train_data = rnn_utils.pad_sequence(train_data, batch_first=True, padding_value=0)
    return train_data, data_length

↓这是我的项目里的

def collate_fn(train_data):  # train_data = MyDataset(train_x, train_y)
    if len(train_data) == 4:  # 这里是为训练集准备的
        train_data_x = [train_data[0][0], train_data[1][0], train_data[2][0], train_data[3][0]]
        train_data_y = [train_data[0][1], train_data[1][1], train_data[2][1], train_data[3][1]]
    if len(train_data) == 2:  # 这里是为测试集准备的
        train_data_x = [train_data[0][0], train_data[1][0]]
        train_data_y = [train_data[0][1], train_data[1][1]]
    train_data_x.sort(key=lambda datax: len(datax), reverse=True)
    data_length_x = [len(datax) for datax in train_data_x]
    train_data_x = pad_sequence(train_data_x, batch_first=True, padding_value=0)  # padding
    train_data_x = pack_padded_sequence(train_data_x, data_length_x, batch_first=True)  # 压缩

    return train_data_x, train_data_y, data_length_x
train_data_loader = DataLoader(train_data, batch_size=4, shuffle=True, collate_fn=collate_fn)
test_data_loader = DataLoader(test_data, batch_size=2, shuffle=True, collate_fn=collate_fn)

同时要注意一下dataset类,(因为其实我不太擅长写类的代码,所以照着人家的代码改的时候废了很大功夫
↓下面是我的代码(我的dataset类要输入两个变量x,y)这里其实还有个问题,因为我还没决定好pred_y和y之间的loss应该怎么定,而且极大可能pred_y和y的长度是不一样的,所以在 __len__里我只返回了len(self.x)。

# 先把原始数据转变成torch.utils.data.Dataset类,随后再把得到的torch.utils.data.Dataset类当作一个参数传递给torch.utils.data.DataLoader类,
# 得到一个数据加载器,这个数据加载器每次可以返回一个Batch的数据供模型训练使用。
class MyDataset(Dataset):  # 是一个包装类,用来将数据包装为Dataset类,然后传入DataLoader中,我们再使用DataLoader这个类来更加快捷的对数据进行操作。
    def __init__(self, x,y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        x1 = self.x[index]
        y1 = self.y[index]
        return x1, y1

后面关于torch.nn.utils.rnn.pack_padded_sequence()的部分我还没认真研究(毕竟还没做到这一步…苦涩
同时,还参考了关于pack_padded_sequence 和 pad_packed_sequence最清楚的解释
如果对collate_fn参数还有一点疑惑的话,可以参考Pytorch技巧:DataLoader的collate_fn参数使用详解
collate_fn:如何取样本的,我们可以定义自己的函数来准确地实现想要的功能。
如果你的数据集的数据预处理部分也很复杂,比如涉及到多维多元输入的问题(预测多个server下一时刻的多个特征,一个时刻还包含很多条数据,苦涩)上面简单的dataset类不足以实现你的功能,那就好好研究一下原理吧:python之TensorDataset和DataLoader

5.下面还有一些杂七杂八的参考过的文档:
PyTorch搭建LSTM实现时间序列预测(负荷预测)
作者附了GitHub代码,对LSTM整体流程可以有一个比较好的了解
lstm pytorch梳理之 batch_first 参数 和torch.nn.utils.rnn.pack_padded_sequence
主要讲了一下torch.nn.utils.rnn.pack_padded_sequence的原理
Pytorch 中如何处理 RNN 输入变长序列 padding
这一篇也是(其实看多了都是大同小异,笑

就酱,最近在搞的数据预处理终于算是告一段落了
虽然只是自己的科研记录,也不知道会不会有人看,希望别有人原文copy到自己的paper里,我明年要毕业的,哭哭

  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值