torch.nn、(二)

这篇博客详细介绍了深度学习中常用的递归神经网络(RNN)、长短期记忆网络(LSTM)和门控循环单元(GRU)层的工作原理。包括它们的数学公式、在不同层中的实现以及参数初始化方法。通过实例展示了如何在PyTorch中使用这些层进行序列数据的处理。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

参考 torch.nn、(二) - 云+社区 - 腾讯云

目录

Recurrent layers

RNN

LSTM

GRU

RNNCell

LSTMCell

GRUCell

Transformer layers

Transformer

TransformerEncoder

TransformerDecoder

TransformerEncoderLayer

TransformerDecoderLayer

Linear layers

Identity

Linear

Bilinear

Dropout layers

Dropout

Dropout2d

Dropout3d

AlphaDropout

Sparse layers

Embedding

EmbeddingBag

Distance functions

CosineSimilarity

PairwiseDistance

Loss functions

L1Loss

MSELoss

CrossEntropyLoss

CTCLoss

NLLLoss

PoissonNLLLoss

KLDivLoss

BCELoss

BCEWithLogitsLoss

MarginRankingLoss

HingeEmbeddingLoss

MultiLabelMarginLoss

SmoothL1Loss

SoftMarginLoss

MultiLabelSoftMarginLoss

CosineEmbeddingLoss

MultiMarginLoss

TripletMarginLoss

Vision layers

PixelShuffle

Upsample

UpsamplingNearest2d

UpsamplingBilinear2d

DataParallel layers (multi-GPU, distributed)

DataParallel

DistributedDataParallel

Utilities

clip_grad_norm_

clip_grad_value_

parameters_to_vector

vector_to_parameters

weight_norm

remove_weight_norm

spectral_norm

remove_spectral_norm

PackedSequence

pack_padded_sequence

pad_packed_sequence

pad_sequence

pack_sequence


Recurrent layers

RNN

class torch.nn.RNN(*args, **kwargs)[source]

Applies a multi-layer Elman RNN with tanhtanhtanh or ReLUReLUReLU non-linearity to an input sequence.

For each element in the input sequence, each layer computes the following function:

ht=tanh(Wihxt+bih+Whhh(t−1)+bhh)h_t = \text{tanh}(W_{ih} x_t + b_{ih} + W_{hh} h_{(t-1)} + b_{hh}) ht​=tanh(Wih​xt​+bih​+Whh​h(t−1)​+bhh​)

where hth_tht​ is the hidden state at time t, xtx_txt​ is the input at time t, and h(t−1)h_{(t-1)}h(t−1)​ is the hidden state of the previous layer at time t-1 or the initial hidden state at time 0. If nonlinearity is 'relu', then ReLU is used instead of tanh.

Parameters

Inputs: input, h_0

Outputs: output, h_n

Shape:

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

​,k

​) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1​

Note

If the following conditions are satisfied: 1) cudnn is enabled, 2) input data is on the GPU 3) input data has dtype torch.float16 4) V100 GPU is used, 5) input data is not in PackedSequence format persistent algorithm can be selected to improve performance.

Examples:

>>> rnn = nn.RNN(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)

LSTM

class torch.nn.LSTM(*args, **kwargs)[source]

Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

it=σ(Wiixt+bii+Whih(t−1)+bhi)ft=σ(Wifxt+bif+Whfh(t−1)+bhf)gt=tanh⁡(Wigxt+big+Whgh(t−1)+bhg)ot=σ(Wioxt+bio+Whoh(t−1)+bho)ct=ft∗c(t−1)+it∗gtht=ot∗tanh⁡(ct)\begin{array}{ll} \\ i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\ f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\ o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\ c_t = f_t * c_{(t-1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \\ \end{array} it​=σ(Wii​xt​+bii​+Whi​h(t−1)​+bhi​)ft​=σ(Wif​xt​+bif​+Whf​h(t−1)​+bhf​)gt​=tanh(Wig​xt​+big​+Whg​h(t−1)​+bhg​)ot​=σ(Wio​xt​+bio​+Who​h(t−1)​+bho​)ct​=ft​∗c(t−1)​+it​∗gt​ht​=ot​∗tanh(ct​)​

where hth_tht​ is the hidden state at time t, ctc_tct​ is the cell state at time t, xtx_txt​ is the input at time t, h(t−1)h_{(t-1)}h(t−1)​ is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and iti_tit​ , ftf_tft​ , gtg_tgt​ , oto_tot​ are the input, forget, cell, and output gates, respectively. σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.

In a multilayer LSTM, the input xt(l)x^{(l)}_txt(l)​ of the lll -th layer (l>=2l >= 2l>=2 ) is the hidden state ht(l−1)h^{(l-1)}_tht(l−1)​ of the previous layer multiplied by dropout δt(l−1)\delta^{(l-1)}_tδt(l−1)​ where each δt(l−1)\delta^{(l-1)}_tδt(l−1)​ is a Bernoulli random variable which is 000 with probability dropout.

Parameters

Inputs: input, (h_0, c_0)

Outputs: output, (h_n, c_n)

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

​,k

​) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1​

Note

If the following conditions are satisfied: 1) cudnn is enabled, 2) input data is on the GPU 3) input data has dtype torch.float16 4) V100 GPU is used, 5) input data is not in PackedSequence format persistent algorithm can be selected to improve performance.

Examples:

>>> rnn = nn.LSTM(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> c0 = torch.randn(2, 3, 20)
>>> output, (hn, cn) = rnn(input, (h0, c0))

GRU

class torch.nn.GRU(*args, **kwargs)[source]

Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

rt=σ(Wirxt+bir+Whrh(t−1)+bhr)zt=σ(Wizxt+biz+Whzh(t−1)+bhz)nt=tanh⁡(Winxt+bin+rt∗(Whnh(t−1)+bhn))ht=(1−zt)∗nt+zt∗h(t−1)\begin{array}{ll} r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t = (1 - z_t) * n_t + z_t * h_{(t-1)} \end{array} rt​=σ(Wir​xt​+bir​+Whr​h(t−1)​+bhr​)zt​=σ(Wiz​xt​+biz​+Whz​h(t−1)​+bhz​)nt​=tanh(Win​xt​+bin​+rt​∗(Whn​h(t−1)​+bhn​))ht​=(1−zt​)∗nt​+zt​∗h(t−1)​​

where hth_tht​ is the hidden state at time t, xtx_txt​ is the input at time t, h(t−1)h_{(t-1)}h(t−1)​ is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and rtr_trt​ , ztz_tzt​ , ntn_tnt​ are the reset, update, and new gates, respectively. σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.

In a multilayer GRU, the input xt(l)x^{(l)}_txt(l)​ of the lll -th layer (l>=2l >= 2l>=2 ) is the hidden state ht(l−1)h^{(l-1)}_tht(l−1)​ of the previous layer multiplied by dropout δt(l−1)\delta^{(l-1)}_tδt(l−1)​ where each δt(l−1)\delta^{(l-1)}_tδt(l−1)​ is a Bernoulli random variable which is 000 with probability dropout.

Parameters

Inputs: input, h_0

Outputs: output, h_n

Shape:

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

​,k

​) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1​

Note

If the following conditions are satisfied: 1) cudnn is enabled, 2) input data is on the GPU 3) input data has dtype torch.float16 4) V100 GPU is used, 5) input data is not in PackedSequence format persistent algorithm can be selected to improve performance.

Examples:

>>> rnn = nn.GRU(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)

RNNCell

class torch.nn.RNNCell(input_size, hidden_size, bias=True, nonlinearity='tanh')[source]

An Elman RNN cell with tanh or ReLU non-linearity.

h′=tanh⁡(Wihx+bih+Whhh+bhh)h' = \tanh(W_{ih} x + b_{ih} + W_{hh} h + b_{hh})h′=tanh(Wih​x+bih​+Whh​h+bhh​)

If nonlinearity is ‘relu’, then ReLU is used in place of tanh.

Parameters

Inputs: input, hidden

Outputs: h’

Shape:

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

​,k

​) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1​

Examples:

>>> rnn = nn.RNNCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
        hx = rnn(input[i], hx)
        output.append(hx)

LSTMCell

class torch.nn.LSTMCell(input_size, hidden_size, bias=True)[source]

A long short-term memory (LSTM) cell.

i=σ(Wiix+bii+Whih+bhi)f=σ(Wifx+bif+Whfh+bhf)g=tanh⁡(Wigx+big+Whgh+bhg)o=σ(Wiox+bio+Whoh+bho)c′=f∗c+i∗gh′=o∗tanh⁡(c′)\begin{array}{ll} i = \sigma(W_{ii} x + b_{ii} + W_{hi} h + b_{hi}) \\ f = \sigma(W_{if} x + b_{if} + W_{hf} h + b_{hf}) \\ g = \tanh(W_{ig} x + b_{ig} + W_{hg} h + b_{hg}) \\ o = \sigma(W_{io} x + b_{io} + W_{ho} h + b_{ho}) \\ c' = f * c + i * g \\ h' = o * \tanh(c') \\ \end{array}i=σ(Wii​x+bii​+Whi​h+bhi​)f=σ(Wif​x+bif​+Whf​h+bhf​)g=tanh(Wig​x+big​+Whg​h+bhg​)o=σ(Wio​x+bio​+Who​h+bho​)c′=f∗c+i∗gh′=o∗tanh(c′)​

where σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.

Parameters

Inputs: input, (h_0, c_0)

Outputs: (h_1, c_1)

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

​,k

​) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1​

Examples:

>>> rnn = nn.LSTMCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> cx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
        hx, cx = rnn(input[i], (hx, cx))
        output.append(hx)

GRUCell

class torch.nn.GRUCell(input_size, hidden_size, bias=True)[source]

A gated recurrent unit (GRU) cell

r=σ(Wirx+bir+Whrh+bhr)z=σ(Wizx+biz+Whzh+bhz)n=tanh⁡(Winx+bin+r∗(Whnh+bhn))h′=(1−z)∗n+z∗h\begin{array}{ll} r = \sigma(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \\ z = \sigma(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) \\ n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \\ h' = (1 - z) * n + z * h \end{array}r=σ(Wir​x+bir​+Whr​h+bhr​)z=σ(Wiz​x+biz​+Whz​h+bhz​)n=tanh(Win​x+bin​+r∗(Whn​h+bhn​))h′=(1−z)∗n+z∗h​

where σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.

Parameters

Inputs: input, hidden

Outputs: h’

Shape:

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

​,k

​) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1​

Examples:

>>> rnn = nn.GRUCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
        hx = rnn(input[i], hx)
        output.append(hx)

Transformer layers

Transformer

class torch.nn.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, custom_encoder=None, custom_decoder=None)[source]

A transformer model. User is able to modify the attributes as needed. The architechture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010.

Parameters

Examples::

>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab)
>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab, nhead=16, num_encoder_layers=12)

forward(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]

Take in and process masked source/target sequences.

Parameters

Shape:

Note: [src/tgt/memory]_mask should be filled with float(‘-inf’) for the masked positions and float(0.0) else. These masks ensure that predictions for position i depend only on the unmasked positions j and are applied identically for each sequence in a batch. [src/tgt/memory]_key_padding_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.

Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.

where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number

Examples

>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)

generate_square_subsequent_mask(sz)[source]

Generate a square mask for the sequence. The masked positions are filled with float(‘-inf’). Unmasked positions are filled with float(0.0).

TransformerEncoder

class torch.nn.TransformerEncoder(encoder_layer, num_layers, norm=None)[source]

TransformerEncoder is a stack of N encoder layers

Parameters

Examples::

>>> encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
>>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)

forward(src, mask=None, src_key_padding_mask=None)[source]

Pass the input through the endocder layers in turn.

Parameters

Shape:

see the docs in Transformer class.

TransformerDecoder

class torch.nn.TransformerDecoder(decoder_layer, num_layers, norm=None)[source]

TransformerDecoder is a stack of N decoder layers

Parameters

Examples::

>>> decoder_layer = nn.TransformerDecoderLayer(d_model, nhead)
>>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers)

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]

Pass the inputs (and mask) through the decoder layer in turn.

Parameters

Shape:

see the docs in Transformer class.

TransformerEncoderLayer

class torch.nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]

TransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.

Parameters

Examples::

>>> encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)

forward(src, src_mask=None, src_key_padding_mask=None)[source]

Pass the input through the endocder layer.

Parameters

Shape:

see the docs in Transformer class.

TransformerDecoderLayer

class torch.nn.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]

TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.

Parameters

Examples::

>>> decoder_layer = nn.TransformerDecoderLayer(d_model, nhead)

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]

Pass the inputs (and mask) through the decoder layer.

Parameters

Shape:

see the docs in Transformer class.

Linear layers

Identity

class torch.nn.Identity(*args, **kwargs)[source]

A placeholder identity operator that is argument-insensitive.

Parameters

Examples:

>>> m = nn.Identity(54, unused_argument1=0.1, unused_argument2=False)
>>> input = torch.randn(128, 20)
>>> output = m(input)
>>> print(output.size())
torch.Size([128, 20])

Linear

class torch.nn.Linear(in_features, out_features, bias=True)[source]

Applies a linear transformation to the incoming data: y=xAT+by = xA^T + by=xAT+b

Parameters

Shape:

Variables

​,k

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • min_val – minimum value of the linear region range. Default: -1

  • max_val – maximum value of the linear region range. Default: 1

  • inplace – can optionally do the operation in-place. Default: False

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • negative_slope – Controls the angle of the negative slope. Default: 1e-2

  • inplace – can optionally do the operation in-place. Default: False

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • embed_dim – total dimension of the model.

  • num_heads – parallel attention heads.

  • dropout – a Dropout layer on attn_output_weights. Default: 0.0.

  • bias – add bias as module parameter. Default: True.

  • add_bias_kv – add bias to the key and value sequences at dim=0.

  • add_zero_attn – add a new batch of zeros to the key and value sequences at dim=1.

  • kdim – total number of features in key. Default: None.

  • vdim – total number of features in key. Default: None.

  • Note – if kdim and vdim are None, they will be set to embed_dim such that

  • key, and value have the same number of features. (query,) –

  • key, value (query,) – map a query and a set of key-value pairs to an output. See “Attention Is All You Need” for more details.

  • key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. This is an binary mask. When the value is True, the corresponding value on the attention layer will be filled with -inf.

  • need_weights – output attn_output_weights.

  • attn_mask – mask that prevents attention to certain positions. This is an additive mask (i.e. the values will be added to the attention layer).

  • Inputs:

  • query: (L,N,E)(L, N, E)(L,N,E) where L is the target sequence length, N is the batch size, E is the embedding dimension.

  • key: (S,N,E)(S, N, E)(S,N,E) , where S is the source sequence length, N is the batch size, E is the embedding dimension.

  • value: (S,N,E)(S, N, E)(S,N,E) where S is the source sequence length, N is the batch size, E is the embedding dimension.

  • key_padding_mask: (N,S)(N, S)(N,S) , ByteTensor, where N is the batch size, S is the source sequence length.

  • attn_mask: (L,S)(L, S)(L,S) where L is the target sequence length, S is the source sequence length.

  • Outputs:

  • attn_output: (L,N,E)(L, N, E)(L,N,E) where L is the target sequence length, N is the batch size, E is the embedding dimension.

  • attn_output_weights: (N,L,S)(N, L, S)(N,L,S) where N is the batch size, L is the target sequence length, S is the source sequence length.

  • num_parameters (int) – number of aaa to learn. Although it takes an int as input, there is only two values are legitimate: 1, or the number of channels at input. Default: 1

  • init (float) – the initial value of aaa . Default: 0.25

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • lower – lower bound of the uniform distribution. Default: 18\frac{1}{8}81​

  • upper – upper bound of the uniform distribution. Default: 13\frac{1}{3}31​

  • inplace – can optionally do the operation in-place. Default: False

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • alpha – the α\alphaα value for the CELU formulation. Default: 1.0

  • inplace – can optionally do the operation in-place. Default: False

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • beta – the β\betaβ value for the Softplus formulation. Default: 1

  • threshold – values above this revert to a linear function. Default: 20

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • threshold – The value to threshold at

  • value – The value to replace with

  • inplace – can optionally do the operation in-place. Default: False

  • Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions

  • Output: (N,∗)(N, *)(N,∗) , same shape as the input

  • Input: (∗)(*)(∗) where * means, any number of additional dimensions

  • Output: (∗)(*)(∗) , same shape as the input

  • Input: (∗)(*)(∗) where * means, any number of additional dimensions

  • Output: (∗)(*)(∗) , same shape as the input

  • Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)

  • Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)

  • Input: (∗)(*)(∗) where * means, any number of additional dimensions

  • Output: (∗)(*)(∗) , same shape as the input

  • cutoffs should be an ordered Sequence of integers sorted in the increasing order. It controls number of clusters and the partitioning of targets into clusters. For example setting cutoffs = [10, 100, 1000] means that first 10 targets will be assigned to the ‘head’ of the adaptive softmax, targets 11, 12, …, 100 will be assigned to the first cluster, and targets 101, 102, …, 1000 will be assigned to the second cluster, while targets 1001, 1002, …, n_classes - 1 will be assigned to the last, third cluster.

  • div_value is used to compute the size of each additional cluster, which is given as ⌊in_featuresdiv_valueidx⌋\left\lfloor\frac{in\_features}{div\_value^{idx}}\right\rfloor⌊div_valueidxin_features​⌋ , where idxidxidx is the cluster index (with clusters for less frequent words having larger indices, and indices starting from 111 ).

  • head_bias if set to True, adds a bias term to the ‘head’ of the adaptive softmax. See paper for details. Set to False in the official implementation.

  • in_features (int) – Number of features in the input tensor

  • n_classes (int) – Number of classes in the dataset

  • cutoffs (Sequence) – Cutoffs used to assign targets to their buckets

  • div_value (float, optional) – value used as an exponent to compute sizes of the clusters. Default: 4.0

  • head_bias (bool, optional) – If True, adds a bias term to the ‘head’ of the adaptive softmax. Default: False

  • output is a Tensor of size N containing computed target log probabilities for each example

  • loss is a Scalar representing the computed negative log likelihood loss

  • input: (N,in_features)(N, in\_features)(N,in_features)

  • target: (N)(N)(N) where each value satisfies 0<=target[i]<=n_classes0 <= target[i] <= n\_classes0<=target[i]<=n_classes

  • output1: (N)(N)(N)

  • output2: Scalar

  • Input: (N,in_features)(N, in\_features)(N,in_features)

  • Output: (N,n_classes)(N, n\_classes)(N,n_classes)

  • Input: (N,in_features)(N, in\_features)(N,in_features)

  • Output: (N)(N)(N)

  • num_features – CCC from an expected input of size (N,C,L)(N, C, L)(N,C,L) or LLL from input of size (N,L)(N, L)(N,L)

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1

  • affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True

  • track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

  • Input: (N,C)(N, C)(N,C) or (N,C,L)(N, C, L)(N,C,L)

  • Output: (N,C)(N, C)(N,C) or (N,C,L)(N, C, L)(N,C,L) (same shape as input)

  • num_features – CCC from an expected input of size (N,C,H,W)(N, C, H, W)(N,C,H,W)

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1

  • affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True

  • track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

  • Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)

  • Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)

  • num_features – CCC from an expected input of size (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1

  • affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True

  • track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

  • Input: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)

  • Output: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W) (same shape as input)

  • num_groups (int) – number of groups to separate the channels into

  • num_channels (int) – number of channels expected in input

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • affine – a boolean value that when set to True, this module has learnable per-channel affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.

  • Input: (N,C,∗)(N, C, *)(N,C,∗) where C=num_channelsC=\text{num\_channels}C=num_channels

  • Output: (N,C,∗)(N, C, *)(N,C,∗) (same shape as input)

  • num_features – CCC from an expected input of size (N,C,+)(N, C, +)(N,C,+)

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1

  • affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True

  • track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

  • process_group – synchronization of stats happen within each process group individually. Default behavior is synchronization across the whole world

  • Input: (N,C,+)(N, C, +)(N,C,+)

  • Output: (N,C,+)(N, C, +)(N,C,+) (same shape as input)

  • module (nn.Module) – containing module

  • process_group (optional) – process group to scope synchronization,

  • num_features – CCC from an expected input of size (N,C,L)(N, C, L)(N,C,L) or LLL from input of size (N,L)(N, L)(N,L)

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • momentum – the value used for the running_mean and running_var computation. Default: 0.1

  • affine – a boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.

  • track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: False

  • Input: (N,C,L)(N, C, L)(N,C,L)

  • Output: (N,C,L)(N, C, L)(N,C,L) (same shape as input)

  • num_features – CCC from an expected input of size (N,C,H,W)(N, C, H, W)(N,C,H,W)

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • momentum – the value used for the running_mean and running_var computation. Default: 0.1

  • affine – a boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.

  • track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: False

  • Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)

  • Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)

  • num_features – CCC from an expected input of size (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • momentum – the value used for the running_mean and running_var computation. Default: 0.1

  • affine – a boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.

  • track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: False

  • Input: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)

  • Output: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W) (same shape as input)

  • normalized_shape (int or list or torch.Size) –

    input shape from an expected input of size

    [∗×normalized_shape[0]×normalized_shape[1]×…×normalized_shape[−1]][* \times \text{normalized\_shape}[0] \times \text{normalized\_shape}[1] \times \ldots \times \text{normalized\_shape}[-1]] [∗×normalized_shape[0]×normalized_shape[1]×…×normalized_shape[−1]]

    If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.

  • eps – a value added to the denominator for numerical stability. Default: 1e-5

  • elementwise_affine – a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.

  • Input: (N,∗)(N, *)(N,∗)

  • Output: (N,∗)(N, *)(N,∗) (same shape as input)

  • size – amount of neighbouring channels used for normalization

  • alpha – multiplicative factor. Default: 0.0001

  • beta – exponent. Default: 0.75

  • k – additive factor. Default: 1

  • Input: (N,C,∗)(N, C, *)(N,C,∗)

  • Output: (N,C,∗)(N, C, *)(N,C,∗) (same shape as input)

  • input_size – The number of expected features in the input x

  • hidden_size – The number of features in the hidden state h

  • num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1

  • nonlinearity – The non-linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh'

  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True

  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False

  • dropout – If non-zero, introduces a Dropout layer on the outputs of each RNN layer except the last layer, with dropout probability equal to dropout. Default: 0

  • bidirectional – If True, becomes a bidirectional RNN. Default: False

  • input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.

  • h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional, num_directions should be 2, else it should be 1.

  • output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.

    For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case.

  • h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len.

    Like output, the layers can be separated using h_n.view(num_layers, num_directions, batch, hidden_size).

  • Input1: (L,N,Hin)(L, N, H_{in})(L,N,Hin​) tensor containing input features where Hin=input_sizeH_{in}=\text{input\_size}Hin​=input_size and L represents a sequence length.

  • Input2: (S,N,Hout)(S, N, H_{out})(S,N,Hout​) tensor containing the initial hidden state for each element in the batch. Hout=hidden_sizeH_{out}=\text{hidden\_size}Hout​=hidden_size Defaults to zero if not provided. where S=num_layers∗num_directionsS=\text{num\_layers} * \text{num\_directions}S=num_layers∗num_directions If the RNN is bidirectional, num_directions should be 2, else it should be 1.

  • Output1: (L,N,Hall)(L, N, H_{all})(L,N,Hall​) where Hall=num_directions∗hidden_sizeH_{all}=\text{num\_directions} * \text{hidden\_size}Hall​=num_directions∗hidden_size

  • Output2: (S,N,Hout)(S, N, H_{out})(S,N,Hout​) tensor containing the next hidden state for each element in the batch

  • ~RNN.weight_ih_l[k] – the learnable input-hidden weights of the k-th layer, of shape (hidden_size, input_size) for k = 0. Otherwise, the shape is (hidden_size, num_directions * hidden_size)

  • ~RNN.weight_hh_l[k] – the learnable hidden-hidden weights of the k-th layer, of shape (hidden_size, hidden_size)

  • ~RNN.bias_ih_l[k] – the learnable input-hidden bias of the k-th layer, of shape (hidden_size)

  • ~RNN.bias_hh_l[k] – the learnable hidden-hidden bias of the k-th layer, of shape (hidden_size)

  • input_size – The number of expected features in the input x

  • hidden_size – The number of features in the hidden state h

  • num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1

  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True

  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Wanderer001

ROIAlign原理

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值