RNN时序数据的采样

最新推荐文章于 2024-04-26 11:22:44 发布

蜂蜜柚子茶。

最新推荐文章于 2024-04-26 11:22:44 发布

阅读量1.2k

点赞数 2

分类专栏：深度学习研究生文章标签： rnn 深度学习 python

本文链接：https://blog.csdn.net/weixin_43742009/article/details/122951024

版权

深度学习同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

研究生

1 篇文章 0 订阅

订阅专栏

分为随机采样和相邻采样
参考链接

import torch
import random

def seq_data_iter_random(corpus, batch_size, num_steps):  #@save
    """使用随机抽样生成一个小批量子序列"""
    # 从随机偏移量开始对序列进行分区，随机范围包括num_steps-1
    # 从[0,1,2,...]变成了[randint, randint+1, randint+2,...]
    corpus = corpus[random.randint(0, num_steps - 1):]
    print(corpus)
    # 减去1，是因为我们需要考虑标签
    num_subseqs = (len(corpus) - 1) // num_steps
    # 长度为num_steps的子序列的起始索引
    initial_indices = list(range(0, num_subseqs * num_steps, num_steps))
    print('initial_indices:{}'.format(initial_indices))
    # 在随机抽样的迭代过程中，
    # 来自两个相邻的、随机的、小批量中的子序列不一定在原始序列上相邻
    random.shuffle(initial_indices)
    print("After shuffle, initial_indices:{}".format(initial_indices))

    def data(pos):
        # 返回从pos位置开始的长度为num_steps的序列
        print("返回从%d位置开始的长度为%d的序列" % (pos, num_steps))
        return corpus[pos: pos + num_steps]
        

    num_batches = num_subseqs // batch_size
    for i in range(0, batch_size * num_batches, batch_size):
        # 在这里，initial_indices包含子序列的随机起始索引
        
        initial_indices_per_batch = initial_indices[i: i + batch_size]
        print("initial_indices_per_batch=initial_indices[%d: %d+%d]" % (i,i,batch_size))
        X = [data(j) for j in initial_indices_per_batch]
        Y = [data(j + 1) for j in initial_indices_per_batch]
        yield torch.tensor(X), torch.tensor(Y)

my_seq = list(range(30))
for X, Y in seq_data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y)

输出：

initial_indices:[0, 6, 12, 18]
After shuffle, initial_indices:[0, 6, 18, 12]
initial_indices_per_batch=initial_indices[0: 0+2]
返回从0位置开始的长度为6的序列
返回从6位置开始的长度为6的序列
返回从1位置开始的长度为6的序列
返回从7位置开始的长度为6的序列
X:  tensor([[ 3,  4,  5,  6,  7,  8],
        [ 9, 10, 11, 12, 13, 14]])
Y: tensor([[ 4,  5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14, 15]])
initial_indices_per_batch=initial_indices[2: 2+2]
返回从18位置开始的长度为6的序列
返回从12位置开始的长度为6的序列
返回从19位置开始的长度为6的序列
返回从13位置开始的长度为6的序列
X:  tensor([[21, 22, 23, 24, 25, 26],
        [15, 16, 17, 18, 19, 20]])
Y: tensor([[22, 23, 24, 25, 26, 27],
        [16, 17, 18, 19, 20, 21]])

几处解释

生成一个长为30的序列，索引为[0,29]，batch_size=2，time_steps=6。能生成几组数据呢？
假设不考虑batch_size的话，很显然，能生成(30-1) // time_steps = 29 // 6 = 44组数据。因为标签是输入索引+1，所以只有30-1个数能用。
考虑batch_size=2的话，每一个batch有2个样本，一共有4组，那么只能构成4 // 2 = 2组batch。也就是num_batches=2。
那么4组样本的起始索引分别是多少呢？initial_indices = list(range(0, num_subseqs * num_steps, num_steps))如何理解？

也就是initial_indices = list(range(0,4*6,6))也就是[0,6,12,18]。然后可以打乱它们的顺序，这就是随机采样。
随后一个难理解的地方是for i in range(0, batch_size * num_batches, batch_size):
也就是for i in range(0,4,2)，所以i=0,2。（假设initial_indices=[0,6,18,12]：
当i==0时，我们要得到的X=[[0,1,2,3,4,5],[6,7,8,9,10,11]]，为什么最终输出是tensor([[ 3, 4, 5, 6, 7, 8],[ 9, 10, 11, 12, 13, 14]])呢？这是因为在函数开始时，从随机偏移量开始对序列进行分区了，所以序列是从3开始的。
同样可以得到，当i==2时，得到原序列的[12开始6个,18开始6个]。
得到Y，只需要将X的样本索引+1即可。

顺序采样

代码：

import random
import torch
def seq_data_iter_sequential(corpus, batch_size, num_steps):  #@save
    """使用顺序分区生成一个小批量子序列"""
    # 从随机偏移量开始划分序列
    offset = random.randint(0, num_steps)
    print(offset)
    num_tokens = ((len(corpus) - offset - 1) // batch_size) * batch_size
    Xs = torch.tensor(corpus[offset: offset + num_tokens])
    Ys = torch.tensor(corpus[offset + 1: offset + 1 + num_tokens])
    Xs, Ys = Xs.reshape(batch_size, -1), Ys.reshape(batch_size, -1)
    num_batches = Xs.shape[1] // num_steps
    for i in range(0, num_steps * num_batches, num_steps):
        X = Xs[:, i: i + num_steps]
        Y = Ys[:, i: i + num_steps]
        yield X, Y


my_seq = list(range(30))
for X, Y in seq_data_iter_sequential(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y)

输出：

batch1：
X:  tensor([[ 5,  6,  7,  8,  9, 10],
        [17, 18, 19, 20, 21, 22]])   
Y: tensor([[ 6,  7,  8,  9, 10, 11],
        [18, 19, 20, 21, 22, 23]])
batch2:
X:  tensor([[11, 12, 13, 14, 15, 16],
        [23, 24, 25, 26, 27, 28]])
Y: tensor([[12, 13, 14, 15, 16, 17],
        [24, 25, 26, 27, 28, 29]])

需要注意的是，顺序指的是两个batch是相邻的，而batch内部的样本仍然是随机的。

蜂蜜柚子茶。

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
3
评论
RNN时序数据的采样

分为随机采样和相邻采样参考链接import torchimport randomdef seq_data_iter_random(corpus, batch_size, num_steps): #@save """使用随机抽样生成一个小批量子序列""" # 从随机偏移量开始对序列进行分区，随机范围包括num_steps-1 # 从[0,1,2,...]变成了[randint, randint+1, randint+2,...] corpus = corpus[ra
复制链接

扫一扫

专栏目录