pack_padded_sequence介绍
抄自:https://github.com/HarshTrivedi/packing-unpacking-pytorch-minimal-tutorial
seq_tensor是一个pad到最大长度的句子id序列
#seq_tensor => [[ 6 9 8 4 1 11 12 10] # long_str
# [12 5 8 14 0 0 0 0] # tiny
# [ 7 3 2 5 13 7 0 0]] # medium
按照id获取embedding
>>> embedded_seq_tensor = embed(seq_tensor) # word_inds
>>> embedded_seq_tensor.shape
torch.Size([3, 8, 4])
Pack embedding用来做lstm的输入
>>> packed_input = pack_padded_sequence(embedded_seq_tensor, seq_lengths.cpu().numpy(), batch_first=True)
>>> packed_input
PackedSequence(data=tensor([[-0.6474, -0.4581, 1.0932, 0.3649],
[ 1.8927, 1.7140, 0.0307, -0.3000],
[-0.3006, 1.4894, 0.8670, -0.8590],
[ 0.0828, -0.6122, 0.9160, -1.5285],
[ 0.5459, -0.4584, -0.2575, 0.7583],
[ 1.0652, -1.0458, -1.2506, -0.4831],
[-0.6197, -1.8122, 1.3841, 0.3279],
[-0.9724, 0.5644, 0.1947, -0.8260],
[-0.6197, -1.8122, 1.3841, 0.3279],
[ 0.9421, 1.2762, 0.4542, 0.4911],
[ 1.0652, -1.0458, -1.2506, -0.4831],
[-1.1810, 0.6112, 1.6059, 0.3915],
[-1.8854, 0.1875, -0.0161, 0.1068],
[-0.5620, -0.8789, -0.9030, 1.0833],
[-0.4251, -0.9332, -0.6854, 0.6752],
[ 1.8927, 1.7140, 0.0307, -0.3000],
[-0.3006, 1.4894, 0.8670, -0.8590],
[-0.3473, 0.1230, 0.2848, 0.7579]],
grad_fn=<PackPaddedSequenceBackward>), batch_sizes=tensor([3, 3, 3, 3, 2, 2, 1, 1]), sorted_indices=None, unsorted_indices=None)
上面数据的分布可见相同颜色的数据,颜色相同的数据在一个batch,每个batch的数据个数为:[3, 3, 3, 3, 2, 2, 1, 1]
>>> embedded_seq_tensor
tensor([[[-0.6474, -0.4581, 1.0932, 0.3649],
[ 0.0828, -0.6122, 0.9160, -1.5285],
[-0.6197, -1.8122, 1.3841, 0.3279],
[ 0.9421, 1.2762, 0.4542, 0.4911],
[-1.8854, 0.1875, -0.0161, 0.1068],
[-0.4251, -0.9332, -0.6854, 0.6752],
[-0.3006, 1.4894, 0.8670, -0.8590],
[-0.3473, 0.1230, 0.2848, 0.7579]],
[[ 1.8927, 1.7140, 0.0307, -0.3000],
[ 0.5459, -0.4584, -0.2575, 0.7583],
[-0.9724, 0.5644, 0.1947, -0.8260],
[ 1.0652, -1.0458, -1.2506, -0.4831],
[-0.5620, -0.8789, -0.9030, 1.0833],
[ 1.8927, 1.7140, 0.0307, -0.3000],
[-1.3707, 1.0039, 0.3160, -0.0382], # <pad>
[-1.3707, 1.0039, 0.3160, -0.0382]], # <pad>
[[-0.3006, 1.4894, 0.8670, -0.8590],
[ 1.0652, -1.0458, -1.2506, -0.4831],
[-0.6197, -1.8122, 1.3841, 0.3279],
[-1.1810, 0.6112, 1.6059, 0.3915],
[-1.3707, 1.0039, 0.3160, -0.0382], # <pad>
[-1.3707, 1.0039, 0.3160, -0.0382],# <pad>
[-1.3707, 1.0039, 0.3160, -0.0382], # <pad>
[-1.3707, 1.0039, 0.3160, -0.0382]]], # <pad>
grad_fn=<EmbeddingBackward>)
可见pad的数据没有使用。
pack_padded_sequence将pad的数据压缩了成了多个batch,每个batch的长度在batch_size中出现。
# l o n g _ s t r #(long_str)
# m e d i u m #(medium)
# t i n y #(tiny)
# 3 3 3 3 2 2 1 1 (sum = 18 [batch_sum_seq_len])
PackedSequence数据可以用于rnn系列输入
packed_output, (ht, ct) = lstm(packed_input)
packed_output与packed_input相同
将packed的输出数据重新pad回来
output, input_sizes = pad_packed_sequence(packed_output, batch_first=True)
>>> output
tensor([[[-0.1913, 0.0616, 0.0053, -0.2993, -0.0519],
[-0.2037, 0.0111, -0.0575, -0.3527, -0.0061],
[-0.4200, 0.1399, -0.0600, -0.5751, -0.0459],
[-0.1117, 0.0467, 0.1404, -0.5936, -0.0950],
[-0.2933, -0.1411, -0.0714, -0.5874, -0.0953],
[-0.2503, -0.0916, -0.0808, -0.5001, -0.1415],
[-0.1736, -0.1496, 0.0066, -0.5162, -0.0836],
[-0.2805, -0.1221, 0.0558, -0.5681, -0.1146]],
[[ 0.0135, -0.0656, 0.2563, -0.2274, -0.1157],
[-0.0546, -0.0350, 0.2173, -0.3502, -0.1615],
[-0.1637, -0.1637, 0.0471, -0.3277, -0.1280],
[-0.0885, -0.1080, 0.0852, -0.1958, -0.1846],
[-0.2180, -0.1008, 0.0637, -0.3100, -0.1945],
[-0.0160, -0.1177, 0.3077, -0.3295, -0.1802],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], # pad
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]], # pad
[[-0.0914, -0.1176, 0.1046, -0.2301, 0.0150],
[-0.0793, -0.0833, 0.1397, -0.1591, -0.1546],
[-0.3428, 0.0979, -0.0181, -0.4932, -0.1172],
[-0.3916, 0.0067, -0.0245, -0.6498, -0.0518],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], # pad
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], # pad
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], # pad
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]], # pad
grad_fn=<TransposeBackward0>)