目录
DataParallel layers (multi-GPU, distributed)
Recurrent layers
RNN
class torch.nn.
RNN
(*args, **kwargs)[source]
Applies a multi-layer Elman RNN with tanhtanhtanh or ReLUReLUReLU non-linearity to an input sequence.
For each element in the input sequence, each layer computes the following function:
ht=tanh(Wihxt+bih+Whhh(t−1)+bhh)h_t = \text{tanh}(W_{ih} x_t + b_{ih} + W_{hh} h_{(t-1)} + b_{hh}) ht=tanh(Wihxt+bih+Whhh(t−1)+bhh)
where hth_tht is the hidden state at time t, xtx_txt is the input at time t, and h(t−1)h_{(t-1)}h(t−1) is the hidden state of the previous layer at time t-1 or the initial hidden state at time 0. If nonlinearity
is 'relu'
, then ReLU is used instead of tanh.
Parameters
Inputs: input, h_0
Outputs: output, h_n
Shape:
Variables
Note
All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k
,k
) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1
Note
If the following conditions are satisfied: 1) cudnn is enabled, 2) input data is on the GPU 3) input data has dtype torch.float16
4) V100 GPU is used, 5) input data is not in PackedSequence
format persistent algorithm can be selected to improve performance.
Examples:
>>> rnn = nn.RNN(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)
LSTM
class torch.nn.
LSTM
(*args, **kwargs)[source]
Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.
For each element in the input sequence, each layer computes the following function:
it=σ(Wiixt+bii+Whih(t−1)+bhi)ft=σ(Wifxt+bif+Whfh(t−1)+bhf)gt=tanh(Wigxt+big+Whgh(t−1)+bhg)ot=σ(Wioxt+bio+Whoh(t−1)+bho)ct=ft∗c(t−1)+it∗gtht=ot∗tanh(ct)\begin{array}{ll} \\ i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\ f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\ o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\ c_t = f_t * c_{(t-1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \\ \end{array} it=σ(Wiixt+bii+Whih(t−1)+bhi)ft=σ(Wifxt+bif+Whfh(t−1)+bhf)gt=tanh(Wigxt+big+Whgh(t−1)+bhg)ot=σ(Wioxt+bio+Whoh(t−1)+bho)ct=ft∗c(t−1)+it∗gtht=ot∗tanh(ct)
where hth_tht is the hidden state at time t, ctc_tct is the cell state at time t, xtx_txt is the input at time t, h(t−1)h_{(t-1)}h(t−1) is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and iti_tit , ftf_tft , gtg_tgt , oto_tot are the input, forget, cell, and output gates, respectively. σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.
In a multilayer LSTM, the input xt(l)x^{(l)}_txt(l) of the lll -th layer (l>=2l >= 2l>=2 ) is the hidden state ht(l−1)h^{(l-1)}_tht(l−1) of the previous layer multiplied by dropout δt(l−1)\delta^{(l-1)}_tδt(l−1) where each δt(l−1)\delta^{(l-1)}_tδt(l−1) is a Bernoulli random variable which is 000 with probability dropout
.
Parameters
Inputs: input, (h_0, c_0)
Outputs: output, (h_n, c_n)
Variables
Note
All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k
,k
) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1
Note
If the following conditions are satisfied: 1) cudnn is enabled, 2) input data is on the GPU 3) input data has dtype torch.float16
4) V100 GPU is used, 5) input data is not in PackedSequence
format persistent algorithm can be selected to improve performance.
Examples:
>>> rnn = nn.LSTM(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> c0 = torch.randn(2, 3, 20)
>>> output, (hn, cn) = rnn(input, (h0, c0))
GRU
class torch.nn.
GRU
(*args, **kwargs)[source]
Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
For each element in the input sequence, each layer computes the following function:
rt=σ(Wirxt+bir+Whrh(t−1)+bhr)zt=σ(Wizxt+biz+Whzh(t−1)+bhz)nt=tanh(Winxt+bin+rt∗(Whnh(t−1)+bhn))ht=(1−zt)∗nt+zt∗h(t−1)\begin{array}{ll} r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t = (1 - z_t) * n_t + z_t * h_{(t-1)} \end{array} rt=σ(Wirxt+bir+Whrh(t−1)+bhr)zt=σ(Wizxt+biz+Whzh(t−1)+bhz)nt=tanh(Winxt+bin+rt∗(Whnh(t−1)+bhn))ht=(1−zt)∗nt+zt∗h(t−1)
where hth_tht is the hidden state at time t, xtx_txt is the input at time t, h(t−1)h_{(t-1)}h(t−1) is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and rtr_trt , ztz_tzt , ntn_tnt are the reset, update, and new gates, respectively. σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.
In a multilayer GRU, the input xt(l)x^{(l)}_txt(l) of the lll -th layer (l>=2l >= 2l>=2 ) is the hidden state ht(l−1)h^{(l-1)}_tht(l−1) of the previous layer multiplied by dropout δt(l−1)\delta^{(l-1)}_tδt(l−1) where each δt(l−1)\delta^{(l-1)}_tδt(l−1) is a Bernoulli random variable which is 000 with probability dropout
.
Parameters
Inputs: input, h_0
Outputs: output, h_n
Shape:
Variables
Note
All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k
,k
) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1
Note
If the following conditions are satisfied: 1) cudnn is enabled, 2) input data is on the GPU 3) input data has dtype torch.float16
4) V100 GPU is used, 5) input data is not in PackedSequence
format persistent algorithm can be selected to improve performance.
Examples:
>>> rnn = nn.GRU(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> output, hn = rnn(input, h0)
RNNCell
class torch.nn.
RNNCell
(input_size, hidden_size, bias=True, nonlinearity='tanh')[source]
An Elman RNN cell with tanh or ReLU non-linearity.
h′=tanh(Wihx+bih+Whhh+bhh)h' = \tanh(W_{ih} x + b_{ih} + W_{hh} h + b_{hh})h′=tanh(Wihx+bih+Whhh+bhh)
If nonlinearity
is ‘relu’, then ReLU is used in place of tanh.
Parameters
Inputs: input, hidden
Outputs: h’
Shape:
Variables
Note
All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k
,k
) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1
Examples:
>>> rnn = nn.RNNCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
hx = rnn(input[i], hx)
output.append(hx)
LSTMCell
class torch.nn.
LSTMCell
(input_size, hidden_size, bias=True)[source]
A long short-term memory (LSTM) cell.
i=σ(Wiix+bii+Whih+bhi)f=σ(Wifx+bif+Whfh+bhf)g=tanh(Wigx+big+Whgh+bhg)o=σ(Wiox+bio+Whoh+bho)c′=f∗c+i∗gh′=o∗tanh(c′)\begin{array}{ll} i = \sigma(W_{ii} x + b_{ii} + W_{hi} h + b_{hi}) \\ f = \sigma(W_{if} x + b_{if} + W_{hf} h + b_{hf}) \\ g = \tanh(W_{ig} x + b_{ig} + W_{hg} h + b_{hg}) \\ o = \sigma(W_{io} x + b_{io} + W_{ho} h + b_{ho}) \\ c' = f * c + i * g \\ h' = o * \tanh(c') \\ \end{array}i=σ(Wiix+bii+Whih+bhi)f=σ(Wifx+bif+Whfh+bhf)g=tanh(Wigx+big+Whgh+bhg)o=σ(Wiox+bio+Whoh+bho)c′=f∗c+i∗gh′=o∗tanh(c′)
where σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.
Parameters
Inputs: input, (h_0, c_0)
Outputs: (h_1, c_1)
Variables
Note
All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k
,k
) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1
Examples:
>>> rnn = nn.LSTMCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> cx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
hx, cx = rnn(input[i], (hx, cx))
output.append(hx)
GRUCell
class torch.nn.
GRUCell
(input_size, hidden_size, bias=True)[source]
A gated recurrent unit (GRU) cell
r=σ(Wirx+bir+Whrh+bhr)z=σ(Wizx+biz+Whzh+bhz)n=tanh(Winx+bin+r∗(Whnh+bhn))h′=(1−z)∗n+z∗h\begin{array}{ll} r = \sigma(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \\ z = \sigma(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) \\ n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \\ h' = (1 - z) * n + z * h \end{array}r=σ(Wirx+bir+Whrh+bhr)z=σ(Wizx+biz+Whzh+bhz)n=tanh(Winx+bin+r∗(Whnh+bhn))h′=(1−z)∗n+z∗h
where σ\sigmaσ is the sigmoid function, and ∗*∗ is the Hadamard product.
Parameters
Inputs: input, hidden
Outputs: h’
Shape:
Variables
Note
All the weights and biases are initialized from U(−k,k)\mathcal{U}(-\sqrt{k}, \sqrt{k})U(−k
,k
) where k=1hidden_sizek = \frac{1}{\text{hidden\_size}}k=hidden_size1
Examples:
>>> rnn = nn.GRUCell(10, 20)
>>> input = torch.randn(6, 3, 10)
>>> hx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
hx = rnn(input[i], hx)
output.append(hx)
Transformer layers
Transformer
class torch.nn.
Transformer
(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, custom_encoder=None, custom_decoder=None)[source]
A transformer model. User is able to modify the attributes as needed. The architechture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010.
Parameters
Examples::
>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab)
>>> transformer_model = nn.Transformer(src_vocab, tgt_vocab, nhead=16, num_encoder_layers=12)
forward
(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]
Take in and process masked source/target sequences.
Parameters
Shape:
Note: [src/tgt/memory]_mask should be filled with float(‘-inf’) for the masked positions and float(0.0) else. These masks ensure that predictions for position i depend only on the unmasked positions j and are applied identically for each sequence in a batch. [src/tgt/memory]_key_padding_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.
Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.
where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number
Examples
>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
generate_square_subsequent_mask
(sz)[source]
Generate a square mask for the sequence. The masked positions are filled with float(‘-inf’). Unmasked positions are filled with float(0.0).
TransformerEncoder
class torch.nn.
TransformerEncoder
(encoder_layer, num_layers, norm=None)[source]
TransformerEncoder is a stack of N encoder layers
Parameters
Examples::
>>> encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
>>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)
forward
(src, mask=None, src_key_padding_mask=None)[source]
Pass the input through the endocder layers in turn.
Parameters
Shape:
see the docs in Transformer class.
TransformerDecoder
class torch.nn.
TransformerDecoder
(decoder_layer, num_layers, norm=None)[source]
TransformerDecoder is a stack of N decoder layers
Parameters
Examples::
>>> decoder_layer = nn.TransformerDecoderLayer(d_model, nhead)
>>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers)
forward
(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]
Pass the inputs (and mask) through the decoder layer in turn.
Parameters
Shape:
see the docs in Transformer class.
TransformerEncoderLayer
class torch.nn.
TransformerEncoderLayer
(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]
TransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
Parameters
Examples::
>>> encoder_layer = nn.TransformerEncoderLayer(d_model, nhead)
forward
(src, src_mask=None, src_key_padding_mask=None)[source]
Pass the input through the endocder layer.
Parameters
Shape:
see the docs in Transformer class.
TransformerDecoderLayer
class torch.nn.
TransformerDecoderLayer
(d_model, nhead, dim_feedforward=2048, dropout=0.1)[source]
TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
Parameters
Examples::
>>> decoder_layer = nn.TransformerDecoderLayer(d_model, nhead)
forward
(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]
Pass the inputs (and mask) through the decoder layer.
Parameters
Shape:
see the docs in Transformer class.
Linear layers
Identity
class torch.nn.
Identity
(*args, **kwargs)[source]
A placeholder identity operator that is argument-insensitive.
Parameters
Examples:
>>> m = nn.Identity(54, unused_argument1=0.1, unused_argument2=False)
>>> input = torch.randn(128, 20)
>>> output = m(input)
>>> print(output.size())
torch.Size([128, 20])
Linear
class torch.nn.
Linear
(in_features, out_features, bias=True)[source]
Applies a linear transformation to the incoming data: y=xAT+by = xA^T + by=xAT+b
Parameters
Shape:
Variables
,k
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
min_val – minimum value of the linear region range. Default: -1
-
max_val – maximum value of the linear region range. Default: 1
-
inplace – can optionally do the operation in-place. Default:
False
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
negative_slope – Controls the angle of the negative slope. Default: 1e-2
-
inplace – can optionally do the operation in-place. Default:
False
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
embed_dim – total dimension of the model.
-
num_heads – parallel attention heads.
-
dropout – a Dropout layer on attn_output_weights. Default: 0.0.
-
bias – add bias as module parameter. Default: True.
-
add_bias_kv – add bias to the key and value sequences at dim=0.
-
add_zero_attn – add a new batch of zeros to the key and value sequences at dim=1.
-
kdim – total number of features in key. Default: None.
-
vdim – total number of features in key. Default: None.
-
Note – if kdim and vdim are None, they will be set to embed_dim such that
-
key, and value have the same number of features. (query,) –
-
key, value (query,) – map a query and a set of key-value pairs to an output. See “Attention Is All You Need” for more details.
-
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. This is an binary mask. When the value is True, the corresponding value on the attention layer will be filled with -inf.
-
need_weights – output attn_output_weights.
-
attn_mask – mask that prevents attention to certain positions. This is an additive mask (i.e. the values will be added to the attention layer).
-
Inputs:
-
query: (L,N,E)(L, N, E)(L,N,E) where L is the target sequence length, N is the batch size, E is the embedding dimension.
-
key: (S,N,E)(S, N, E)(S,N,E) , where S is the source sequence length, N is the batch size, E is the embedding dimension.
-
value: (S,N,E)(S, N, E)(S,N,E) where S is the source sequence length, N is the batch size, E is the embedding dimension.
-
key_padding_mask: (N,S)(N, S)(N,S) , ByteTensor, where N is the batch size, S is the source sequence length.
-
attn_mask: (L,S)(L, S)(L,S) where L is the target sequence length, S is the source sequence length.
-
Outputs:
-
attn_output: (L,N,E)(L, N, E)(L,N,E) where L is the target sequence length, N is the batch size, E is the embedding dimension.
-
attn_output_weights: (N,L,S)(N, L, S)(N,L,S) where N is the batch size, L is the target sequence length, S is the source sequence length.
-
num_parameters (int) – number of aaa to learn. Although it takes an int as input, there is only two values are legitimate: 1, or the number of channels at input. Default: 1
-
init (float) – the initial value of aaa . Default: 0.25
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
lower – lower bound of the uniform distribution. Default: 18\frac{1}{8}81
-
upper – upper bound of the uniform distribution. Default: 13\frac{1}{3}31
-
inplace – can optionally do the operation in-place. Default:
False
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
alpha – the α\alphaα value for the CELU formulation. Default: 1.0
-
inplace – can optionally do the operation in-place. Default:
False
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
beta – the β\betaβ value for the Softplus formulation. Default: 1
-
threshold – values above this revert to a linear function. Default: 20
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
threshold – The value to threshold at
-
value – The value to replace with
-
inplace – can optionally do the operation in-place. Default:
False
-
Input: (N,∗)(N, *)(N,∗) where * means, any number of additional dimensions
-
Output: (N,∗)(N, *)(N,∗) , same shape as the input
-
Input: (∗)(*)(∗) where * means, any number of additional dimensions
-
Output: (∗)(*)(∗) , same shape as the input
-
Input: (∗)(*)(∗) where * means, any number of additional dimensions
-
Output: (∗)(*)(∗) , same shape as the input
-
Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)
-
Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)
-
Input: (∗)(*)(∗) where * means, any number of additional dimensions
-
Output: (∗)(*)(∗) , same shape as the input
-
cutoffs
should be an ordered Sequence of integers sorted in the increasing order. It controls number of clusters and the partitioning of targets into clusters. For example settingcutoffs = [10, 100, 1000]
means that first 10 targets will be assigned to the ‘head’ of the adaptive softmax, targets 11, 12, …, 100 will be assigned to the first cluster, and targets 101, 102, …, 1000 will be assigned to the second cluster, while targets 1001, 1002, …, n_classes - 1 will be assigned to the last, third cluster. -
div_value
is used to compute the size of each additional cluster, which is given as ⌊in_featuresdiv_valueidx⌋\left\lfloor\frac{in\_features}{div\_value^{idx}}\right\rfloor⌊div_valueidxin_features⌋ , where idxidxidx is the cluster index (with clusters for less frequent words having larger indices, and indices starting from 111 ). -
head_bias
if set to True, adds a bias term to the ‘head’ of the adaptive softmax. See paper for details. Set to False in the official implementation. -
in_features (int) – Number of features in the input tensor
-
n_classes (int) – Number of classes in the dataset
-
cutoffs (Sequence) – Cutoffs used to assign targets to their buckets
-
div_value (float, optional) – value used as an exponent to compute sizes of the clusters. Default: 4.0
-
head_bias (bool, optional) – If
True
, adds a bias term to the ‘head’ of the adaptive softmax. Default:False
-
output is a Tensor of size
N
containing computed target log probabilities for each example -
loss is a Scalar representing the computed negative log likelihood loss
-
input: (N,in_features)(N, in\_features)(N,in_features)
-
target: (N)(N)(N) where each value satisfies 0<=target[i]<=n_classes0 <= target[i] <= n\_classes0<=target[i]<=n_classes
-
output1: (N)(N)(N)
-
output2:
Scalar
-
Input: (N,in_features)(N, in\_features)(N,in_features)
-
Output: (N,n_classes)(N, n\_classes)(N,n_classes)
-
Input: (N,in_features)(N, in\_features)(N,in_features)
-
Output: (N)(N)(N)
-
num_features – CCC from an expected input of size (N,C,L)(N, C, L)(N,C,L) or LLL from input of size (N,L)(N, L)(N,L)
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
momentum – the value used for the running_mean and running_var computation. Can be set to
None
for cumulative moving average (i.e. simple average). Default: 0.1 -
affine – a boolean value that when set to
True
, this module has learnable affine parameters. Default:True
-
track_running_stats – a boolean value that when set to
True
, this module tracks the running mean and variance, and when set toFalse
, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default:True
-
Input: (N,C)(N, C)(N,C) or (N,C,L)(N, C, L)(N,C,L)
-
Output: (N,C)(N, C)(N,C) or (N,C,L)(N, C, L)(N,C,L) (same shape as input)
-
num_features – CCC from an expected input of size (N,C,H,W)(N, C, H, W)(N,C,H,W)
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
momentum – the value used for the running_mean and running_var computation. Can be set to
None
for cumulative moving average (i.e. simple average). Default: 0.1 -
affine – a boolean value that when set to
True
, this module has learnable affine parameters. Default:True
-
track_running_stats – a boolean value that when set to
True
, this module tracks the running mean and variance, and when set toFalse
, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default:True
-
Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)
-
Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)
-
num_features – CCC from an expected input of size (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
momentum – the value used for the running_mean and running_var computation. Can be set to
None
for cumulative moving average (i.e. simple average). Default: 0.1 -
affine – a boolean value that when set to
True
, this module has learnable affine parameters. Default:True
-
track_running_stats – a boolean value that when set to
True
, this module tracks the running mean and variance, and when set toFalse
, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default:True
-
Input: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)
-
Output: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W) (same shape as input)
-
num_groups (int) – number of groups to separate the channels into
-
num_channels (int) – number of channels expected in input
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
affine – a boolean value that when set to
True
, this module has learnable per-channel affine parameters initialized to ones (for weights) and zeros (for biases). Default:True
. -
Input: (N,C,∗)(N, C, *)(N,C,∗) where C=num_channelsC=\text{num\_channels}C=num_channels
-
Output: (N,C,∗)(N, C, *)(N,C,∗) (same shape as input)
-
num_features – CCC from an expected input of size (N,C,+)(N, C, +)(N,C,+)
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
momentum – the value used for the running_mean and running_var computation. Can be set to
None
for cumulative moving average (i.e. simple average). Default: 0.1 -
affine – a boolean value that when set to
True
, this module has learnable affine parameters. Default:True
-
track_running_stats – a boolean value that when set to
True
, this module tracks the running mean and variance, and when set toFalse
, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default:True
-
process_group – synchronization of stats happen within each process group individually. Default behavior is synchronization across the whole world
-
Input: (N,C,+)(N, C, +)(N,C,+)
-
Output: (N,C,+)(N, C, +)(N,C,+) (same shape as input)
-
module (nn.Module) – containing module
-
process_group (optional) – process group to scope synchronization,
-
num_features – CCC from an expected input of size (N,C,L)(N, C, L)(N,C,L) or LLL from input of size (N,L)(N, L)(N,L)
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
momentum – the value used for the running_mean and running_var computation. Default: 0.1
-
affine – a boolean value that when set to
True
, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default:False
. -
track_running_stats – a boolean value that when set to
True
, this module tracks the running mean and variance, and when set toFalse
, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default:False
-
Input: (N,C,L)(N, C, L)(N,C,L)
-
Output: (N,C,L)(N, C, L)(N,C,L) (same shape as input)
-
num_features – CCC from an expected input of size (N,C,H,W)(N, C, H, W)(N,C,H,W)
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
momentum – the value used for the running_mean and running_var computation. Default: 0.1
-
affine – a boolean value that when set to
True
, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default:False
. -
track_running_stats – a boolean value that when set to
True
, this module tracks the running mean and variance, and when set toFalse
, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default:False
-
Input: (N,C,H,W)(N, C, H, W)(N,C,H,W)
-
Output: (N,C,H,W)(N, C, H, W)(N,C,H,W) (same shape as input)
-
num_features – CCC from an expected input of size (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
momentum – the value used for the running_mean and running_var computation. Default: 0.1
-
affine – a boolean value that when set to
True
, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default:False
. -
track_running_stats – a boolean value that when set to
True
, this module tracks the running mean and variance, and when set toFalse
, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default:False
-
Input: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W)
-
Output: (N,C,D,H,W)(N, C, D, H, W)(N,C,D,H,W) (same shape as input)
-
normalized_shape (int or list or torch.Size) –
input shape from an expected input of size
[∗×normalized_shape[0]×normalized_shape[1]×…×normalized_shape[−1]][* \times \text{normalized\_shape}[0] \times \text{normalized\_shape}[1] \times \ldots \times \text{normalized\_shape}[-1]] [∗×normalized_shape[0]×normalized_shape[1]×…×normalized_shape[−1]]
If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.
-
eps – a value added to the denominator for numerical stability. Default: 1e-5
-
elementwise_affine – a boolean value that when set to
True
, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default:True
. -
Input: (N,∗)(N, *)(N,∗)
-
Output: (N,∗)(N, *)(N,∗) (same shape as input)
-
size – amount of neighbouring channels used for normalization
-
alpha – multiplicative factor. Default: 0.0001
-
beta – exponent. Default: 0.75
-
k – additive factor. Default: 1
-
Input: (N,C,∗)(N, C, *)(N,C,∗)
-
Output: (N,C,∗)(N, C, *)(N,C,∗) (same shape as input)
-
input_size – The number of expected features in the input x
-
hidden_size – The number of features in the hidden state h
-
num_layers – Number of recurrent layers. E.g., setting
num_layers=2
would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1 -
nonlinearity – The non-linearity to use. Can be either
'tanh'
or'relu'
. Default:'tanh'
-
bias – If
False
, then the layer does not use bias weights b_ih and b_hh. Default:True
-
batch_first – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:False
-
dropout – If non-zero, introduces a Dropout layer on the outputs of each RNN layer except the last layer, with dropout probability equal to
dropout
. Default: 0 -
bidirectional – If
True
, becomes a bidirectional RNN. Default:False
-
input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.
-
h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional, num_directions should be 2, else it should be 1.
-
output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
For the unpacked case, the directions can be separated using
output.view(seq_len, batch, num_directions, hidden_size)
, with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case. -
h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len.
Like output, the layers can be separated using
h_n.view(num_layers, num_directions, batch, hidden_size)
. -
Input1: (L,N,Hin)(L, N, H_{in})(L,N,Hin) tensor containing input features where Hin=input_sizeH_{in}=\text{input\_size}Hin=input_size and L represents a sequence length.
-
Input2: (S,N,Hout)(S, N, H_{out})(S,N,Hout) tensor containing the initial hidden state for each element in the batch. Hout=hidden_sizeH_{out}=\text{hidden\_size}Hout=hidden_size Defaults to zero if not provided. where S=num_layers∗num_directionsS=\text{num\_layers} * \text{num\_directions}S=num_layers∗num_directions If the RNN is bidirectional, num_directions should be 2, else it should be 1.
-
Output1: (L,N,Hall)(L, N, H_{all})(L,N,Hall) where Hall=num_directions∗hidden_sizeH_{all}=\text{num\_directions} * \text{hidden\_size}Hall=num_directions∗hidden_size
-
Output2: (S,N,Hout)(S, N, H_{out})(S,N,Hout) tensor containing the next hidden state for each element in the batch
-
~RNN.weight_ih_l[k] – the learnable input-hidden weights of the k-th layer, of shape (hidden_size, input_size) for k = 0. Otherwise, the shape is (hidden_size, num_directions * hidden_size)
-
~RNN.weight_hh_l[k] – the learnable hidden-hidden weights of the k-th layer, of shape (hidden_size, hidden_size)
-
~RNN.bias_ih_l[k] – the learnable input-hidden bias of the k-th layer, of shape (hidden_size)
-
~RNN.bias_hh_l[k] – the learnable hidden-hidden bias of the k-th layer, of shape (hidden_size)
-
input_size – The number of expected features in the input x
-
hidden_size – The number of features in the hidden state h
-
num_layers – Number of recurrent layers. E.g., setting
num_layers=2
would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1 -
bias – If
False
, then the layer does not use bias weights b_ih and b_hh. Default:True
-
batch_first – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:False