1、项目场景:
使用matlab深度学习时,设置选择优化器trainingOptions()
出现'BatchNormalizationStatistics' is not an option for solver 'adam(sgdm)'
2、原因分析:
打开trainingOptions()
函数查看,发现并没有这项设置,百度一下,估计可能是matlab版本问题,现在使用的是matlab2020a。
3、解决方案:
更新matlab到2022b版本,打开trainingOptions()
函数,发现增加了几个新的参数选项,解决问题。如下所示:
function opts = trainingOptions(solverName, varargin)
% trainingOptions Options for training a neural network
%
% options = trainingOptions(solverName) creates a set of training options
% for the solver specified by solverName. Possible values for solverName
% include:
%
% 'sgdm' - Stochastic gradient descent with momentum.
% 'adam' - Adaptive moment estimation (ADAM).
% 'rmsprop' - Root mean square propagation (RMSProp).
%
% options = trainingOptions(solverName, 'PARAM1', VAL1, 'PARAM2', VAL2, ...)
% specifies optional parameter name/value pairs for creating the training
% options:
%
% 'Momentum' - This parameter only applies if the solver
% is 'sgdm'. The momentum determines the
% contribution of the gradient step from the
% previous iteration to the current
% iteration of training. It must be a value
% between 0 and 1, where 0 will give no
% contribution from the previous step, and 1
% will give a maximal contribution from the
% previous step. The default value is 0.9.
% 'GradientDecayFactor' - This parameter only applies if the solver
% is 'adam'. It specifies the exponential
% decay rate for the gradient moving average
% in solver 'adam'. It must be a value >= 0
% and < 1. This parameter is denoted by the
% symbol 'beta1' in the original paper on
% 'adam'. The default value is 0.9.
% 'SquaredGradientDecayFactor'
% - This parameter only applies if the solver
% is 'adam' or 'rmsprop'. It specifies the
% exponential decay rate for the squared
% gradient moving average in solver 'adam'
% and 'rmsprop'. It must be a value >= 0 and
% < 1. This parameter is denoted by the
% symbol 'beta2' in the original paper on
% 'adam'. The default value is 0.999 for
% 'adam' and 0.9 for 'rmsprop'.
% 'Epsilon' - This parameter only applies if the solver
% is 'adam' or 'rmsprop'. It specifies the
% offset to use in the denominator for the
% update in solver 'adam' and 'rmsprop'. It
% must be a value > 0. The default value is
% 1e-8.
% 'InitialLearnRate' - The initial learning rate that is used for
% training. If the learning rate is too low,
% training will take a long time, but if it
% is too high, the training is likely to get
% stuck at a suboptimal result. The default
% is 0.01 for solver 'sgdm' and 0.001 for
% solvers 'adam' and 'rmsprop'.
% 'LearnRateSchedule' - This option allows the user to specify a
% method for lowering the global learning
% rate during training. Possible options
% include:
% - 'none' - The learning rate does not
% change and remains constant.
% - 'piecewise' - The learning rate is
% multiplied by a factor every time a
% certain number of epochs has passed.
% The multiplicative factor is controlled
% by the parameter 'LearnRateDropFactor',
% and the number of epochs between
% multiplications is controlled by
% 'LearnRateDropPeriod'.
% The default is 'none'.
% 'LearnRateDropFactor' - This parameter only applies if the
% 'LearnRateSchedule' is set to 'piecewise'.
% It is a multiplicative factor that is
% applied to the learning rate every time a
% certain number of epochs has passed.
% The default is 0.1.
% 'LearnRateDropPeriod' - This parameter only applies if the
% 'LearnRateSchedule' is set to 'piecewise'.
% The learning rate drop factor will be
% applied to the global learning rate every
% time this number of epochs is passed. The
% default is 10.
% 'L2Regularization' - The factor for the L2 regularizer. It
% should be noted that each set of parameters
% in a layer can specify a multiplier for
% this L2 regularizer. The default is 0.0001.
% 'GradientThresholdMethod'
% - Method used for gradient thresholding
% Possible options are:
% - 'global-l2norm' - If the global L2 norm
% of the gradients, considering all
% learnable parameters, is larger than
% GradientThreshold, then scale all
% gradients by a factor of
% GradientThreshold/L where L is the
% global L2 norm.
% - 'l2norm' - If the L2 norm of the
% gradient of a learnable parameter is
% larger than GradientThreshold, then
% scale the gradient so that its norm
% equals GradientThreshold.
% - 'absolute-value' - If the absolute value
% of the individual partial derivatives
% in the gradient of a learnable
% parameter are larger than
% GradientThreshold, then scale them to
% have absolute value equal to
% GradientThreshold and retain its sign.
% The default is 'l2norm'.
% 'GradientThreshold' - Positive threshold for the gradient. The
% default is Inf.
% 'MaxEpochs' - The maximum number of epochs that will be
% used for training. The default is 30.
% 'MiniBatchSize' - The size of the mini-batch used for each
% training iteration. The default is 128.
% 'Verbose' - If this is set to true, information on
% training progress will be printed to the
% command window. The default is true.
% 'VerboseFrequency' - This only has an effect if 'Verbose' is set
% to true. It specifies the number of
% iterations between printing to the command
% window. The default is 50.
% 'ValidationData' - Data to use for validation during training.
% This can be:
% - A datastore with categorical labels or
% numeric responses
% - A table, where the first column
% contains either image paths or images
% - A cell array {X, Y}, where X is a
% numeric array with the input data and Y
% is an array of responses
% 'ValidationFrequency' - Number of iterations between evaluations of
% validation metrics. This only has an effect
% if you also specify 'ValidationData'. The
% default is 50.
% 'ValidationPatience' - Number of times that the validation loss is
% allowed to be larger than or equal to the
% previously smallest loss before network
% training is stopped, specified as a
% positive integer or Inf. The default is
% Inf.
% 'Shuffle' - This controls if the training data is
% shuffled. The options are:
% - 'never'- No shuffling is applied.
% - 'once' - The data will be shuffled once
% before training.
% - 'every-epoch' - The data will be
% shuffled before every training epoch.
% The default is 'once'.
% 'CheckpointPath' - Path to save checkpoint networks, specified
% as a character vector or string scalar. If
% CheckPointPath is '', then the software
% does not save checkpoints.
% The default is ''.
% 'CheckpointFrequency' - Frequency of saving checkpoint networks,
% specified as a positive integer. To save
% checkpoint networks, specify a path using
% the 'CheckpointPath' training option.
% The default is 1.
% 'CheckpointFrequencyUnit'
% - Unit of 'CheckpointFrequency', specified
% as one of the following:
% - 'epoch' - Save checkpoints every
% CheckpointFrequency
% epochs.
% - 'iteration' - Save checkpoints every
% CheckpointFrequency
% iterations.
% The default is 'epoch'.
% 'ExecutionEnvironment'
% - The execution environment for the
% network. This determines what hardware
% resources will be used to train the
% network. To use a GPU, you must have
% Parallel Computing Toolbox(TM) and a
% supported GPU device. To use a compute
% cluster, you must have Parallel Computing
% Toolbox.
% - 'auto' - Use a GPU if it is
% available, otherwise use the CPU.
% - 'gpu' - Use the GPU.
% - 'cpu' - Use the CPU.
% - 'multi-gpu' - Use multiple GPUs on one
% machine, using a local parallel pool.
% If no pool is open, one is opened with
% one worker per supported GPU device.
% - 'parallel' - Use a compute cluster. If
% no pool is open, one is opened using
% the default cluster profile. If the
% pool has access to GPUs then they will
% be used and excess workers will be
% idle. If the pool has no GPUs then
% training will take place on all cluster
% CPUs.
% The default is 'auto'.
% 'WorkerLoad' - For the 'multi-gpu' and 'parallel'
% execution environments. Determines how to
% divide up computation between GPUs or CPUs,
% by specifying relative load on parallel
% workers. For MiniBatchable Datastore inputs
% to training, if the 'DispatchInBackground'
% value is set to true, then workers with a
% load of 0 prefetch data in the background.
% Specify as one of
% - an integer specifying the number, or
% fraction specifying the proportion, of
% workers on each machine to use for
% training computation. The remaining
% workers perform background prefetch, if
% enabled.
% - a vector with one element per worker in
% the parallel pool, specifying the load
% for each worker. For a vector W, each
% worker gets W/sum(W) of the work. A
% worker with a WorkerLoad of 0 performs
% background prefetch, if enabled.
% For GPU environments, the default is to use
% one worker per GPU for training
% computation, with the remaining available
% for background prefetch. For CPU
% environments, the default is to use all
% workers for computation, except when
% 'DispatchInBackground' is enabled, in which
% case one worker on each machine is set
% aside for background prefetch.
% 'OutputFcn' - Specifies one or more functions to be
% called during training at the end of each
% iteration. Typically, you might use an
% output function to display or plot progress
% information, or determine whether training
% should be terminated early. The function
% will be passed a struct containing
% information from the current iteration. It
% may also return true, which will trigger
% early termination.
% 'Plots' - Plots to display during training, specified
% as 'training-progress' or 'none' (default)
% 'SequenceLength' - Strategy to determine the length of the
% sequences used per mini-batch. Options are:
% - 'longest' to pad all sequences in a
% batch to the length of the longest
% sequence.
% - 'shortest' to truncate all sequences in
% a batch to the length of the shortest
% sequence.
% - Positive integer - For each mini-batch,
% pad the sequences to the nearest
% multiple of the specified length that
% is greater than the longest sequence
% length in the mini-batch, and then
% split the sequences into smaller
% sequences of the specified length. If
% splitting occurs, then the software
% creates extra mini-batches.
% The default is 'longest'.
% 'SequencePaddingValue' - Scalar value used to pad sequences where
% necessary. The default is 0.
% 'SequencePaddingDirection'
% - Direction of padding or truncation,
% specified as:
% - 'right' - Pad or truncate sequences
% on the right.
% - 'left' - Pad or truncate sequences on
% the left.
% The default is 'right'.
% 'DispatchInBackground' - Scalar logical used to control whether
% asynchronous pre-fetch queueing is
% used when reading training data. The
% default is false. Requires Parallel
% Computing Toolbox.
% 'ResetInputNormalization'
% - Flag to reset input layer normalization,
% specified as one of the following:
% - true - Clear the input layer statistics
% and recalculate them during training.
% - false - Do not clear existing input
% layer statistics and calculate values
% for empty statistics only.
% The default is true.
% 'BatchNormalizationStatistics' **增加项**
% - Mode to evaluate the statistics in batch
% normalization layers, specified as one of
% the following:
% - 'population' - use the population
% statistics. This requires a
% finalization phase after training.
% - 'moving' - approximate the statistics
% using a running estimate accumulated
% during training. To use this option,
% 'ExecutionEnviroment' must be 'auto',
% 'cpu', or 'gpu'.
% The default is 'population'.
% 'OutputNetwork' - Network to return when training ends,
% specified as one of the following:
% - 'last-iteration' - return the network
% corresponding to the last training
% iteration.
% - 'best-validation-loss' - return the
% network at the iteration corresponding
% to the best validation loss. To use
% this option, you must also specify
% 'ValidationData'.
% The default is 'last-iteration'.