基于PyTorch的深度学习模型中的张量（Tensor）尺寸变换操作

全能骑士涛锅锅

已于 2023-09-09 11:16:53 修改

阅读量335

点赞数 1

分类专栏：人工智能与机器学习文章标签： cnn 人工智能神经网络 pytorch tensor

于 2023-08-16 14:13:57 首次发布

本文链接：https://blog.csdn.net/u013441358/article/details/132318226

版权

人工智能与机器学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

基于PyTorch的深度学习模型中的张量（Tensor）尺寸变换操作

Date	Author	Version	Note
2023.08.16	Dog Tao	V1.0	完成文档初稿（英文）
2023.09.09	Dog Tao	V1.1	修订文档，增加了张量连接与操作说明。

文章目录

基于PyTorch的深度学习模型中的张量（Tensor）尺寸变换操作

张量类型数据

The tensor is a fundamental data type in PyTorch, and it’s essential for deep learning computations.

Brief introduction

Definition: A tensor in PyTorch is a multi-dimensional array, similar to NumPy’s ndarray. Tensors can be used on a GPU to accelerate computing.
Utility: Tensors are crucial for deep learning frameworks like PyTorch as they allow for efficient mathematical operations on GPUs. They’re used to store the input, output, and intermediate data as well as model parameters (like weights and biases of a neural network).
Types & Shapes: Tensors can have various data types such as float, integer, and boolean. They can exist in multiple shapes, representing scalar values (0-dimensional), vectors (1-dimensional), matrices (2-dimensional), or higher-dimensional structures.
Device Agnostic: One of the notable features of PyTorch tensors is their ability to be device agnostic. This means you can move tensors between CPU and GPU without much hassle, using the .to() method or .cuda() and .cpu() methods.
Creation: You can create tensors from Python lists, from NumPy arrays, or directly in PyTorch using functions like torch.tensor(), torch.zeros(), torch.ones(), torch.randn(), and many others.
Operations: Tensors support a plethora of operations, including arithmetic operations, reshaping, indexing, and mathematical functions. PyTorch provides an automatic differentiation system, which makes it easy to compute gradients with respect to tensors (important for training neural networks).

PyTorch’s tensor library provides the necessary tools for efficient computation needed in deep learning. The familiar syntax (especially if you come from a NumPy background) combined with its GPU acceleration capabilities makes it a go-to choice for many researchers and practitioners in the machine learning community.

Meanings of the dimension

In deep learning and PyTorch, the dimensions of a tensor often have specific meanings based on the context in which they are used. However, it’s essential to note that the exact meaning of each dimension can vary based on the data type, the neural network architecture, or the specific operation being performed.

Here are some common interpretations of tensor dimensions based on different contexts:

Standard Images (e.g., from torchvision datasets):
- Shape: [batch_size, channels, height, width]
  - batch_size: Number of images in a mini-batch.
  - channels: Number of color channels (e.g., 3 for RGB, 1 for grayscale).
  - height: Height of the image in pixels.
  - width: Width of the image in pixels.
Sequences (e.g., for RNNs, LSTMs):
- Shape: [seq_len, batch_size, feature_size] or [batch_size, seq_len, feature_size] (depends on the batch_first argument)
  - seq_len: Length of the sequence.
  - batch_size: Number of sequences in a mini-batch.
  - feature_size: Number of features at each sequence step.
Time Series:
- Shape: [batch_size, sequence_length, num_features]
  - batch_size: Number of time series in a mini-batch.
  - sequence_length: Number of time steps in the time series.
  - num_features: Number of features at each time step.
Embeddings:
- Shape: [num_words, embedding_dim]
  - num_words: Number of words or unique tokens in the vocabulary.
  - embedding_dim: Dimensionality of the embedding vector for each word.
FC Layers (Fully Connected Layers):
- Shape: [batch_size, num_features]
  - batch_size: Number of samples in a mini-batch.
  - num_features: Number of features for each sample.
3D Medical Images (e.g., MRI scans):
- Shape: [batch_size, channels, depth, height, width]
  - batch_size: Number of scans in a mini-batch.
  - channels: Number of channels (could be different modalities or types of scans).
  - depth: Depth or number of slices in the 3D scan.
  - height: Height of each slice.
  - width: Width of each slice.

In practice, it’s crucial to consult the documentation or specific context in which you’re working to determine the precise meaning of each dimension.

张量的维度变换方法

In PyTorch, squeeze() , unsqueeze(), and view() are used to change the dimensions (or shape) of a tensor, but they do so in different ways.

tensor.squeeze()

The squeeze() method removes dimensions of size 1 from the shape of a tensor.
By default, it removes all dimensions of size 1, but you can also specify a particular dimension to squeeze.

Examples:

import torch

# Tensor with shape (1, 3, 1, 2)
x = torch.zeros(1, 3, 1, 2)

# Remove all dimensions of size 1
y = x.squeeze()
print(y.shape)  # torch.Size([3, 2])

# Squeeze only the 0th dimension
z = x.squeeze(0)
print(z.shape)  # torch.Size([3, 1, 2])

In PyTorch, when you use negative indices with functions like squeeze() and unsqueeze(), the counting of dimensions starts from the end (rightmost) of the tensor shape, similar to negative indexing in Python lists.

squeeze(-1): This will attempt to remove the last dimension of the tensor, but only if its size is 1. If the last dimension isn’t of size 1, the tensor remains unchanged.

Example:

import torch

# Tensor with shape (3, 4, 1)
x = torch.zeros(3, 4, 1)

# Remove the last dimension, as it is of size 1
y = x.squeeze(-1)
print(y.shape) # torch.Size([3, 4])

tensor.unsqueeze()

The unsqueeze() method adds a dimension of size 1 at a specified position.
You need to specify where you want the new dimension.

Examples:

# Tensor with shape (3, 2)
x = torch.zeros(3, 2)

# Add a dimension at position 0
y = x.unsqueeze(0)
print(y.shape)  # torch.Size([1, 3, 2])

# Add a dimension at position 2
z = x.unsqueeze(2)
print(z.shape)  # torch.Size([3, 2, 1])

unsqueeze(-1): This will add a new last dimension of size 1 to the tensor.

Example:

# Tensor with shape (3, 4)
x = torch.zeros(3, 4)

# Add a new last dimension
y = x.unsqueeze(-1)
print(y.shape) # torch.Size([3, 4, 1])

In deep learning, especially when dealing with models like CNNs or RNNs, the input tensor’s shape often needs to match the model’s expected shape. For instance, a CNN may expect a 4D tensor as input (batch size, channels, height, width), but sometimes you might have a single image of shape (channels, height, width). In this case, you’d use unsqueeze() to add a batch dimension of size 1 before passing the image to the model. Conversely, the output from the model might have a singleton batch dimension that you want to remove with squeeze() before further processing.

In practice, unsqueeze(1) is commonly used when you want to add a new last dimension (e.g. channel dimension) to a tensor, turning, for instance, a 2D tensor of shape [batch_size, features] into a 3D tensor of shape [batch_size, 1, features]. This is handy in various deep learning scenarios, such as when prepping data to meet the shape expectations of certain 1D convolutional layers.

tensor.view()

The tensor.view() method in PyTorch is used to reshape a tensor. It returns a new tensor with the specified shape. The new tensor will share the same underlying data with the original tensor, which means if you modify the original tensor, the reshaped tensor will also get modified and vice versa. This behavior ensures efficient memory usage.

Here’s a breakdown of how tensor.view() works:

Reshaping: You can provide the desired shape as arguments to the view() method to reshape the tensor.
Automatic Inference: You can specify one dimension as -1, and PyTorch will automatically compute the correct size for that dimension based on the other dimensions you’ve provided. This is particularly useful when you don’t know the size of a specific dimension in advance.
Requirements:
- The new shape must contain the same number of elements as the original shape. For instance, if the original tensor has a shape of [4, 5] (i.e., 20 elements), the reshaped tensor might have shapes like [10, 2], [20], [2, 10], etc., but not [3, 7] (because that would be 21 elements).
- The original tensor must be contiguous in memory. If it’s not, you’ll need to call tensor.contiguous() before using view().

Examples:

import torch

# Create a tensor of shape [2, 3]
x = torch.tensor([[1, 2, 3], [4, 5, 6]])

# Reshape to [3, 2]
y = x.view(3, 2)
print(y)
# tensor([[1, 2],
#         [3, 4],
#         [5, 6]])

# Reshape to a 1D tensor with 6 elements
z = x.view(-1)
print(z)
# tensor([1, 2, 3, 4, 5, 6])

# Reshape to [6, 1]
w = x.view(6, -1)
print(w)
# tensor([[1],
#         [2],
#         [3],
#         [4],
#         [5],
#         [6]])

张量的堆叠与连接

张量堆叠(torch.stack)

torch.stack is a function in PyTorch used to stack tensors along a new dimension. This operation is similar to torch.cat, but it introduces an additional dimension.

When you have a series of tensors and wish to stack them into a larger tensor, you can utilize torch.stack. This is particularly useful when you want to stack a series of vectors into a matrix or stack matrices into a 3D tensor.

Examples and Usage:

Let’s say you have the following two 1-D tensors:

a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])

If you wish to stack these two 1-D tensors into a 2-D tensor (matrix):

c = torch.stack((a, b))

Now, c would be:

tensor([[1, 2, 3],
        [4, 5, 6]])

Another parameter for torch.stack is dim, which signifies along which dimension you want to stack the tensors. The default is 0, but by adjusting it, you can alter the stacking direction.

In short, torch.stack allows you to stack tensors of the same shape into a higher-dimensional tensor.

张量连接(torch.cat)

torch.cat is a function in PyTorch used to concatenate tensors along a specified dimension. It lets you merge multiple tensors into a larger one.

The main difference between torch.cat and torch.stack is that torch.cat doesn’t introduce a new dimension; it extends the tensor on an existing dimension.

Examples and Usage:

1-D tensors:

For two 1-D tensors:

a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])

Use torch.cat to concatenate:

c = torch.cat((a, b))

Now, c is:

tensor([1, 2, 3, 4, 5, 6])

2-D tensors:

For two 2-D tensors:

x = torch.tensor([[1, 2], [3, 4]])
y = torch.tensor([[5, 6]])

To concatenate along dimension 0 (rows):

z = torch.cat((x, y), dim=0)

Now, z is:

tensor([[1, 2],
        [3, 4],
        [5, 6]])

Or if you have a y of the same shape as x:

y = torch.tensor([[5, 6], [7, 8]])

Concatenate along dimension 1 (columns):

z = torch.cat((x, y), dim=1)

Now, z is:

tensor([[1, 2, 5, 6],
        [3, 4, 7, 8]])

Note: For torch.cat, sizes for all dimensions, except for the one you wish to concatenate on, must match.

In summary, torch.cat enables you to concatenate tensors along a specified dimension, creating a larger tensor without adding new dimensions.

卷积层对数据维度的影响

Kernel Size, Stride, and Padding

The kernel size (often also referred to as the filter size) in a convolutional layer directly affects the size of the output (also called the feature map or activation map).

Here’s a breakdown of how the kernel size, along with other parameters, affects the output size:

Kernel Size: The dimensions of the filter used in the convolution operation. Common sizes include (1 × 1), (3 × 3), (5 × 5), etc. in 2D convolutions. The kernel size determines how big of a region in the input we are looking at.
Stride: The number of positions the kernel slides over the input tensor. A stride of 1 means the kernel moves one position at a time, while a stride of 2 means it jumps over one position. The greater the stride, the smaller the output size.
Padding: The number of zeroes added to the border of the input tensor. Padding can be used to control the spatial dimensions of the output tensor. Zero padding ensures that the spatial dimensions remain the same when a kernel of size greater than (1 × 1) and stride of 1 is used.

To compute the spatial dimensions of the output feature map for a 2D convolution (assuming square inputs and filters for simplicity):
$output_size = input_size − kernel_size + 2 × padding stride + 1 \text{{output\_size}} = \frac{{\text{{input\_size}} - \text{{kernel\_size}} + 2 \times \text{{padding}}}}{{\text{{stride}}}} + 1$

For example, let’s consider a 2D input of size (28 × 28):

Using a (3 × 3) kernel with stride 1 and padding 1, the output size remains (28 × 28).
Using a (5 × 5) kernel with stride 1 and padding 2, the output size remains (28 × 28).
Using a (3 × 3) kernel with stride 2 and padding 1, the output size becomes (14 × 14).

Remember that the exact formula for calculating output size can change depending on the specific type of convolution (e.g., transposed convolution, dilated convolution).

keep the size unchanged

To make the output size the same as the input size (often referred to as “same” padding), the padding (( P )) can be set based on the kernel size (( K )) and the stride (( S )).

For a convolution operation with a stride of 1, the padding needed to maintain the same spatial dimensions for input and output is:

$\frac{K - 1}{2}$

For instance, with a kernel size of (3 × 3) (( K = 3 )) and stride of 1, you’d need:

$\frac{3 - 1}{2} = 1$

So, a padding of 1 would maintain the same dimensions.

However, when using a stride greater than 1, it becomes trickier to maintain exact input-output dimensions. Generally, a stride greater than 1 will downsample the input, and the exact amount of padding needed to keep dimensions consistent will depend on both the input size and the desired output size.

It’s also worth noting that, in deep learning libraries like TensorFlow or PyTorch, you can often specify padding as “same” to automatically ensure the output size matches the input size, at least for a stride of 1. But if you’re implementing convolutions from scratch or need a deep understanding for some advanced architectures or troubleshooting, knowing how to compute the padding manually is useful.

self.convs.append(nn.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=1, padding="same"))

Using padding=‘same’ with even kernel lengths and odd dilation may require a zero-padded copy of the input be created.

when the stride is 1

Setting the padding value to kernel_size // 2 is a common practice when the stride is 1, especially for odd-sized kernels. This choice simplifies ensuring that the output dimensions are the same as the input dimensions.

Odd-sized Kernels: When the kernel size is odd (e.g., 3, 5, 7, …), kernel_size // 2 effectively implements the formula for “same” padding:
$\frac{K - 1}{2}$
Using integer division (// in Python) ensures a whole number. For example, for a 3x3 kernel:
$\frac{3 - 1}{2} = 1$
For a (5 × 5) kernel:
$\frac{5 - 1}{2} = 2$
and so on.
Even-sized Kernels: For even-sized kernels, using kernel_size // 2 as the padding doesn’t perfectly preserve dimensions. This is part of the reason why odd-sized kernels are more commonly used in practice. However, if even-sized kernels are used, the designer must decide on a specific padding scheme or adjust the kernel size.
Stride: The above rationale holds when the stride is set to 1. If stride is greater than 1, the output dimensions will be reduced even with the padding set to kernel_size // 2.

The practice of using kernel_size // 2 makes it easier to design and adjust architectures without constantly recalculating padding, especially when using odd-sized kernels with a stride of 1.

池化层对数据维度的影响

Certainly! Pooling layers in neural networks, especially in convolutional neural networks (CNNs), are used to reduce the spatial dimensions of the data (i.e., width and height). This downsampling operation serves a few purposes:

Reduces the number of parameters and computations in the network, which can help combat overfitting.
Introduces translation invariance to some extent.
Preserves the dominant features of the data due to the max or average operation.

There are several types of pooling operations, but the most common ones are:

Max Pooling: Takes the maximum value from a group of values in a local region.
Average Pooling: Takes the average value from a group of values in a local region.

The formula to compute the output size after pooling is similar to the formula used for convolution:

$output_size = ( input_size − pooling_size stride ) + 1 \text{output\_size} = \left( \frac{\text{input\_size} - \text{pooling\_size}}{\text{stride}} \right) + 1$

Where:

input_size is the width or height of the input data.
pooling_size is the size of the pooling kernel.
stride is the number of pixels the pooling kernel moves per step. If not specified, it’s usually the same as the pooling size.

Examples:

Max Pooling:

import torch.nn as nn

# Assume we have an input tensor of shape [batch_size, channels, height, width]
# For this example: [32, 3, 64, 64]

pooling_layer = nn.MaxPool2d(kernel_size=2, stride=2)
# This will reduce the spatial dimensions (height and width) by half.
# Output shape: [32, 3, 32, 32]

Average Pooling:

pooling_layer = nn.AvgPool2d(kernel_size=2, stride=2)
# Again, this will reduce the spatial dimensions by half.
# Output shape: [32, 3, 32, 32]

Note:

Pooling operations are usually applied after convolutional layers.
While they reduce the spatial dimensions, pooling operations don’t change the number of channels.
Pooling layers don’t have trainable parameters, unlike convolutional layers.

In practice, modern architectures sometimes prefer using strided convolutions for downsampling instead of pooling layers, but pooling remains an important concept in the understanding and history of CNNs.

全连接层对数据维度的影响

In a Convolutional Neural Network (CNN), a fully connected (FC) layer, also known as a dense layer, typically appears after a series of convolutional and pooling layers, and is used to make predictions or classifications based on the extracted features.

To properly set up the input and output dimensions for the FC layers, you need to understand the flow of the data:

Input Dimension of the First FC Layer:
- The input to the FC layer is usually a flattened version of the output from the last convolutional or pooling layer.
- To calculate the size, you multiply the depth (number of channels or feature maps), height, and width of the tensor output by the last conv/pool layer.
- For instance, if the output of your last pooling layer is [batch_size, 128, 5, 5] (with 128 feature maps of size 5x5), then the input dimension for your FC layer after flattening would be 128 * 5 * 5 = 3200 (channels×height×width).
Output Dimension of the FC Layer(s):
- The output dimension of an FC layer is a design choice that depends on the complexity of the model and the nature of the task.
- Common values in CNN architectures are powers of 2 (like 512, 256, 128, etc.).
- The very last FC layer (if you’re doing classification) should have an output size equal to the number of classes you are predicting.
- For regression tasks, the last FC layer should match the number of regression outputs.

Here’s a simple illustration:

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes):
        super(SimpleCNN, self).__init__()

        self.conv_layers = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),  # Assuming 3-channel images as input
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        
        self.fc_layers = nn.Sequential(
            nn.Linear(128 * 16 * 16, 512),  # Assuming input image size is 128x128
            nn.ReLU(),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.conv_layers(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.fc_layers(x)
        return x

In the above example, for an input image of size 128x128 and 3 channels, the size of the tensor before the FC layers is [batch_size, 128, 16, 16]. The flattened size is 128 * 16 * 16 = 32768. The FC layers reduce this to 512 features, and finally, to num_classes outputs.