torch.nn、(二)_nn.decoder-CSDN博客

where hth_tht is the hidden state at time t, xtx_txt is the input at time t, and h(t−1)h_{(t-1)}h(t−1) is the hidden state of the previous layer at time t-1 or the initial hidden state at time 0. If nonlinearity is 'relu', then ReLU is used instead of tanh.

Parameters

Inputs: input, h_0

Outputs: output, h_n

Shape:

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1

Note

If the following conditions are satisfied: 1) cudnn is enabled, 2) input data is on the GPU 3) input data has dtype torch.float16 4) V100 GPU is used, 5) input data is not in PackedSequence format persistent algorithm can be selected to improve performance.

Examples:

>>> rnn = nn.RNN(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)

LSTM

class torch.nn.LSTM(*args, **kwargs)[source]

Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

it=σ(Wiixt+bii+Whih(t−1)+bhi)ft=σ(Wifxt+bif+Whfh(t−1)+bhf)gt=tanh⁡(Wigxt+big+Whgh(t−1)+bhg)ot=σ(Wioxt+bio+Whoh(t−1)+bho)ct=ft∗c(t−1)+it∗gtht=ot∗tanh⁡(ct)\begin{array}{ll} \\ i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\ f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\ o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\ c_t = f_t * c_{(t-1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \\ \end{array} it=σ(Wiixt+bii+Whih(t−1)+bhi)ft=σ(Wifxt+bif+Whfh(t−1)+bhf)gt=tanh(Wigxt+big+Whgh(t−1)+bhg)ot=σ(Wioxt+bio+Whoh(t−1)+bho)ct=ft∗c(t−1)+it∗gtht=ot∗tanh(ct)

where hth_tht is the hidden state at time t, ctc_tct is the cell state at time t, xtx_txt is the input at time t, h(t−1)h_{(t-1)}h(t−1) is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and iti_tit , ftf_tft , gtg_tgt , oto_tot are the input, forget, cell, and output gates, respectively. σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.

In a multilayer LSTM, the input xt(l)x^{(l)}_txt(l) of the lll -th layer (l>=2l >= 2l>=2 ) is the hidden state ht(l−1)h^{(l-1)}_tht(l−1) of the previous layer multiplied by dropout δt(l−1)\delta^{(l-1)}_tδt(l−1) where each δt(l−1)\delta^{(l-1)}_tδt(l−1) is a Bernoulli random variable which is 000 with probability dropout.

Parameters

Inputs: input, (h_0, c_0)

Outputs: output, (h_n, c_n)

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1

Note

Examples:

>>> rnn = nn.LSTM(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> c0 = torch.randn(2, 3, 20)
>>> output, (hn, cn) = rnn(input, (h0, c0))

GRU

class torch.nn.GRU(*args, **kwargs)[source]

Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

rt=σ(Wirxt+bir+Whrh(t−1)+bhr)zt=σ(Wizxt+biz+Whzh(t−1)+bhz)nt=tanh⁡(Winxt+bin+rt∗(Whnh(t−1)+bhn))ht=(1−zt)∗nt+zt∗h(t−1)\begin{array}{ll} r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t = (1 - z_t) * n_t + z_t * h_{(t-1)} \end{array} rt=σ(Wirxt+bir+Whrh(t−1)+bhr)zt=σ(Wizxt+biz+Whzh(t−1)+bhz)nt=tanh(Winxt+bin+rt∗(Whnh(t−1)+bhn))ht=(1−zt)∗nt+zt∗h(t−1)

where hth_tht is the hidden state at time t, xtx_txt is the input at time t, h(t−1)h_{(t-1)}h(t−1) is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and rtr_trt , ztz_tzt , ntn_tnt are the reset, update, and new gates, respectively. σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.

In a multilayer GRU, the input xt(l)x^{(l)}_txt(l) of the lll -th layer (l>=2l >= 2l>=2 ) is the hidden state ht(l−1)h^{(l-1)}_tht(l−1) of the previous layer multiplied by dropout δt(l−1)\delta^{(l-1)}_tδt(l−1) where each δt(l−1)\delta^{(l-1)}_tδt(l−1) is a Bernoulli random variable which is 000 with probability dropout.

Parameters

Inputs: input, h_0

Outputs: output, h_n

Shape:

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1

Note

Examples:

>>> rnn = nn.GRU(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)

RNNCell

class torch.nn.RNNCell(input_size, hidden_size, bias=True, nonlinearity='tanh')[source]

An Elman RNN cell with tanh or ReLU non-linearity.

h′=tanh⁡(Wihx+bih+Whhh+bhh)h' = \tanh(W_{ih} x + b_{ih} + W_{hh} h + b_{hh})h′=tanh(Wihx+bih+Whhh+bhh)

If nonlinearity is ‘relu’, then ReLU is used in place of tanh.

Parameters

Inputs: input, hidden

Outputs: h’

Shape:

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1

Examples:

>>> rnn = nn.RNNCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
        hx = rnn(input[i], hx)
        output.append(hx)

LSTMCell

class torch.nn.LSTMCell(input_size, hidden_size, bias=True)[source]

A long short-term memory (LSTM) cell.

i=σ(Wiix+bii+Whih+bhi)f=σ(Wifx+bif+Whfh+bhf)g=tanh⁡(Wigx+big+Whgh+bhg)o=σ(Wiox+bio+Whoh+bho)c′=f∗c+i∗gh′=o∗tanh⁡(c′)\begin{array}{ll} i = \sigma(W_{ii} x + b_{ii} + W_{hi} h + b_{hi}) \\ f = \sigma(W_{if} x + b_{if} + W_{hf} h + b_{hf}) \\ g = \tanh(W_{ig} x + b_{ig} + W_{hg} h + b_{hg}) \\ o = \sigma(W_{io} x + b_{io} + W_{ho} h + b_{ho}) \\ c' = f * c + i * g \\ h' = o * \tanh(c') \\ \end{array}i=σ(Wiix+bii+Whih+bhi)f=σ(Wifx+bif+Whfh+bhf)g=tanh(Wigx+big+Whgh+bhg)o=σ(Wiox+bio+Whoh+bho)c′=f∗c+i∗gh′=o∗tanh(c′)

where σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.

Parameters

Inputs: input, (h_0, c_0)

Outputs: (h_1, c_1)

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1

Examples:

>>> rnn = nn.LSTMCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> cx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
        hx, cx = rnn(input[i], (hx, cx))
        output.append(hx)

GRUCell

class torch.nn.GRUCell(input_size, hidden_size, bias=True)[source]

A gated recurrent unit (GRU) cell

r=σ(Wirx+bir+Whrh+bhr)z=σ(Wizx+biz+Whzh+bhz)n=tanh⁡(Winx+bin+r∗(Whnh+bhn))h′=(1−z)∗n+z∗h\begin{array}{ll} r = \sigma(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \\ z = \sigma(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) \\ n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \\ h' = (1 - z) * n + z * h \end{array}r=σ(Wirx+bir+Whrh+bhr)z=σ(Wizx+biz+Whzh+bhz)n=tanh(Winx+bin+r∗(Whnh+bhn))h′=(1−z)∗n+z∗h

where σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.

Parameters

Inputs: input, hidden

Outputs: h’

Shape:

Variables

Note

All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k

) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1

Examples:

>>> rnn = nn.GRUCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
        hx = rnn(input[i], hx)
        output.append(hx)

Transformer layers

Transformer

class torch.nn.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, custom_encoder=None, custom_decoder=None)[source]

A transformer model. User is able to modify the attributes as needed. The architechture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010.

Parameters

Examples::

>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab)
>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab, nhead=16, num_encoder_layers=12)

forward(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]

Take in and process masked source/target sequences.

Parameters

Shape:

Note: [src/tgt/memory]_mask should be filled with float(‘-inf’) for the masked positions and float(0.0) else. These masks ensure that predictions for position i depend only on the unmasked positions j and are applied identically for each sequence in a batch. [src/tgt/memory]_key_padding_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.

Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.

where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number

Examples

>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)

generate_square_subsequent_mask(sz)[source]

Generate a square mask for the sequence. The masked positions are filled with float(‘-inf’). Unmasked positions are filled with float(0.0).

TransformerEncoder

class torch.nn.TransformerEncoder(encoder_layer, num_layers, norm=None)[source]

TransformerEncoder is a stack of N encoder layers

Parameters

Examples::

>>> encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
>>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)

forward(src, mask=None, src_key_padding_mask=None)[source]

Pass the input through the endocder layers in turn.

Parameters

Shape:

see the docs in Transformer class.

TransformerDecoder

class torch.nn.TransformerDecoder(decoder_layer, num_layers, norm=None)[source]

TransformerDecoder is a stack of N decoder layers

Parameters

Examples::

>>> decoder_layer = nn.TransformerDecoderLayer(d_model, nhead)
>>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers)

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]

Pass the inputs (and mask) through the decoder layer in turn.

Parameters

Shape:

see the docs in Transformer class.

TransformerEncoderLayer

class torch.nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]

TransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.

Parameters

Examples::

>>> encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)

forward(src, src_mask=None, src_key_padding_mask=None)[source]

Pass the input through the endocder layer.

Parameters

Shape:

see the docs in Transformer class.

TransformerDecoderLayer

class torch.nn.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]

TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.

Parameters

Examples::

>>> decoder_layer = nn.TransformerDecoderLayer(d_model, nhead)

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]

Pass the inputs (and mask) through the decoder layer.

Parameters

Shape:

see the docs in Transformer class.

Linear layers

Identity

class torch.nn.Identity(*args, **kwargs)[source]

A placeholder identity operator that is argument-insensitive.

Parameters

Examples:

>>> m = nn.Identity(54, unused_argument1=0.1, unused_argument2=False)
>>> input = torch.randn(128, 20)
>>> output = m(input)
>>> print(output.size())
torch.Size([128, 20])

Linear

class torch.nn.Linear(in_features, out_features, bias=True)[source]

Applies a linear transformation to the incoming data: y=xAT+by = xA^T + by=xAT+b

Parameters

Shape:

Variables

Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
min_val – minimum value of the linear region range. Default: -1
max_val – maximum value of the linear region range. Default: 1
inplace – can optionally do the operation in-place. Default: False
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
negative_slope – Controls the angle of the negative slope. Default: 1e-2
inplace – can optionally do the operation in-place. Default: False
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
embed_dim – total dimension of the model.
num_heads – parallel attention heads.
dropout – a Dropout layer on attn_output_weights. Default: 0.0.
bias – add bias as module parameter. Default: True.
add_bias_kv – add bias to the key and value sequences at dim=0.
add_zero_attn – add a new batch of zeros to the key and value sequences at dim=1.
kdim – total number of features in key. Default: None.
vdim – total number of features in key. Default: None.
Note – if kdim and vdim are None, they will be set to embed_dim such that
key, and value have the same number of features. (query,) –
key, value (query,) – map a query and a set of key-value pairs to an output. See “Attention Is All You Need” for more details.
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. This is an binary mask. When the value is True, the corresponding value on the attention layer will be filled with -inf.
need_weights – output attn_output_weights.
attn_mask – mask that prevents attention to certain positions. This is an additive mask (i.e. the values will be added to the attention layer).
Inputs:
query: (L,N,E)(L, N, E)(L,N,E) where L is the target sequence length, N is the batch size, E is the embedding dimension.
key: (S,N,E)(S, N, E)(S,N,E) , where S is the source sequence length, N is the batch size, E is the embedding dimension.
value: (S,N,E)(S, N, E)(S,N,E) where S is the source sequence length, N is the batch size, E is the embedding dimension.
key_padding_mask: (N,S)(N, S)(N,S) , ByteTensor, where N is the batch size, S is the source sequence length.
attn_mask: (L,S)(L, S)(L,S) where L is the target sequence length, S is the source sequence length.
Outputs:
attn_output: (L,N,E)(L, N, E)(L,N,E) where L is the target sequence length, N is the batch size, E is the embedding dimension.
attn_output_weights: (N,L,S)(N, L, S)(N,L,S) where N is the batch size, L is the target sequence length, S is the source sequence length.
num_parameters (int) – number of aaa to learn. Although it takes an int as input, there is only two values are legitimate: 1, or the number of channels at input. Default: 1
init (float) – the initial value of aaa . Default: 0.25
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
lower – lower bound of the uniform distribution. Default: 18\frac{1}{8}81
upper – upper bound of the uniform distribution. Default: 13\frac{1}{3}31
inplace – can optionally do the operation in-place. Default: False
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
alpha – the α\alphaα value for the CELU formulation. Default: 1.0
inplace – can optionally do the operation in-place. Default: False
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
beta – the β\betaβ value for the Softplus formulation. Default: 1
threshold – values above this revert to a linear function. Default: 20
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
threshold – The value to threshold at
value – The value to replace with
inplace – can optionally do the operation in-place. Default: False
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
Output: (N,∗)(N, *)(N,∗) , same shape as the input
Input: (∗)(*)(∗) where * means, any number of additional dimensions
Output: (∗)(*)(∗) , same shape as the input
Input: (∗)(*)(∗) where * means, any number of additional dimensions
Output: (∗)(*)(∗) , same shape as the input
Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)
Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)
Input: (∗)(*)(∗) where * means, any number of additional dimensions
Output: (∗)(*)(∗) , same shape as the input
cutoffs should be an ordered Sequence of integers sorted in the increasing order. It controls number of clusters and the partitioning of targets into clusters. For example setting cutoffs = [10, 100, 1000] means that first 10 targets will be assigned to the ‘head’ of the adaptive softmax, targets 11, 12, …, 100 will be assigned to the first cluster, and targets 101, 102, …, 1000 will be assigned to the second cluster, while targets 1001, 1002, …, n_classes - 1 will be assigned to the last, third cluster.
div_value is used to compute the size of each additional cluster, which is given as ⌊in_featuresdiv_valueidx⌋\left\lfloor\frac{in\_features}{div\_value^{idx}}\right\rfloor⌊div_valueidxin_features⌋ , where idxidxidx is the cluster index (with clusters for less frequent words having larger indices, and indices starting from 111 ).
head_bias if set to True, adds a bias term to the ‘head’ of the adaptive softmax. See paper for details. Set to False in the official implementation.
in_features (int) – Number of features in the input tensor
n_classes (int) – Number of classes in the dataset
cutoffs (Sequence) – Cutoffs used to assign targets to their buckets
div_value (float, optional) – value used as an exponent to compute sizes of the clusters. Default: 4.0
head_bias (bool, optional) – If True, adds a bias term to the ‘head’ of the adaptive softmax. Default: False
output is a Tensor of size N containing computed target log probabilities for each example
loss is a Scalar representing the computed negative log likelihood loss
input: (N,in_features)(N, in\_features)(N,in_features)
target: (N)(N)(N) where each value satisfies 0<=target[i]<=n_classes0 <= target[i] <= n\_classes0<=target[i]<=n_classes
output1: (N)(N)(N)
output2: Scalar
Input: (N,in_features)(N, in\_features)(N,in_features)
Output: (N,n_classes)(N, n\_classes)(N,n_classes)
Input: (N,in_features)(N, in\_features)(N,in_features)
Output: (N)(N)(N)
num_features – CCC from an expected input of size (N,C,L)(N, C, L)(N,C,L) or LLL from input of size (N,L)(N, L)(N,L)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True
Input: (N,C)(N, C)(N,C) or (N,C,L)(N, C, L)(N,C,L)
Output: (N,C)(N, C)(N,C) or (N,C,L)(N, C, L)(N,C,L) (same shape as input)
num_features – CCC from an expected input of size (N,C,H,W)(N, C, H, W)(N,C,H,W)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True
Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)
Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)
num_features – CCC from an expected input of size (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True
Input: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)
Output: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W) (same shape as input)
num_groups (int) – number of groups to separate the channels into
num_channels (int) – number of channels expected in input
eps – a value added to the denominator for numerical stability. Default: 1e-5
affine – a boolean value that when set to True, this module has learnable per-channel affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.
Input: (N,C,∗)(N, C, *)(N,C,∗) where C=num_channelsC=\text{num\_channels}C=num_channels
Output: (N,C,∗)(N, C, *)(N,C,∗) (same shape as input)
num_features – CCC from an expected input of size (N,C,+)(N, C, +)(N,C,+)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True
process_group – synchronization of stats happen within each process group individually. Default behavior is synchronization across the whole world
Input: (N,C,+)(N, C, +)(N,C,+)
Output: (N,C,+)(N, C, +)(N,C,+) (same shape as input)
module (nn.Module) – containing module
process_group (optional) – process group to scope synchronization,
num_features – CCC from an expected input of size (N,C,L)(N, C, L)(N,C,L) or LLL from input of size (N,L)(N, L)(N,L)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: False
Input: (N,C,L)(N, C, L)(N,C,L)
Output: (N,C,L)(N, C, L)(N,C,L) (same shape as input)
num_features – CCC from an expected input of size (N,C,H,W)(N, C, H, W)(N,C,H,W)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: False
Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)
Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)
num_features – CCC from an expected input of size (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: False
Input: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)
Output: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W) (same shape as input)
normalized_shape (int or list or torch.Size) –

input shape from an expected input of size

[∗×normalized_shape[0]×normalized_shape[1]×…×normalized_shape[−1]][* \times \text{normalized\_shape}[0] \times \text{normalized\_shape}[1] \times \ldots \times \text{normalized\_shape}[-1]] [∗×normalized_shape[0]×normalized_shape[1]×…×normalized_shape[−1]]

If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.
eps – a value added to the denominator for numerical stability. Default: 1e-5
elementwise_affine – a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.
Input: (N,∗)(N, *)(N,∗)
Output: (N,∗)(N, *)(N,∗) (same shape as input)
size – amount of neighbouring channels used for normalization
alpha – multiplicative factor. Default: 0.0001
beta – exponent. Default: 0.75
k – additive factor. Default: 1
Input: (N,C,∗)(N, C, *)(N,C,∗)
Output: (N,C,∗)(N, C, *)(N,C,∗) (same shape as input)
input_size – The number of expected features in the input x
hidden_size – The number of features in the hidden state h
num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1
nonlinearity – The non-linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh'
bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False
dropout – If non-zero, introduces a Dropout layer on the outputs of each RNN layer except the last layer, with dropout probability equal to dropout. Default: 0
bidirectional – If True, becomes a bidirectional RNN. Default: False
input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.
h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional, num_directions should be 2, else it should be 1.
output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.

For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case.
h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len.

Like output, the layers can be separated using h_n.view(num_layers, num_directions, batch, hidden_size).
Input1: (L,N,Hin)(L, N, H_{in})(L,N,Hin) tensor containing input features where Hin=input_sizeH_{in}=\text{input\_size}Hin=input_size and L represents a sequence length.
Input2: (S,N,Hout)(S, N, H_{out})(S,N,Hout) tensor containing the initial hidden state for each element in the batch. Hout=hidden_sizeH_{out}=\text{hidden\_size}Hout=hidden_size Defaults to zero if not provided. where S=num_layers∗num_directionsS=\text{num\_layers} * \text{num\_directions}S=num_layers∗num_directions If the RNN is bidirectional, num_directions should be 2, else it should be 1.
Output1: (L,N,Hall)(L, N, H_{all})(L,N,Hall) where Hall=num_directions∗hidden_sizeH_{all}=\text{num\_directions} * \text{hidden\_size}Hall=num_directions∗hidden_size
Output2: (S,N,Hout)(S, N, H_{out})(S,N,Hout) tensor containing the next hidden state for each element in the batch
~RNN.weight_ih_l[k] – the learnable input-hidden weights of the k-th layer, of shape (hidden_size, input_size) for k = 0. Otherwise, the shape is (hidden_size, num_directions * hidden_size)
~RNN.weight_hh_l[k] – the learnable hidden-hidden weights of the k-th layer, of shape (hidden_size, hidden_size)
~RNN.bias_ih_l[k] – the learnable input-hidden bias of the k-th layer, of shape (hidden_size)
~RNN.bias_hh_l[k] – the learnable hidden-hidden bias of the k-th layer, of shape (hidden_size)
input_size – The number of expected features in the input x
hidden_size – The number of features in the hidden state h
num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False