keras使用class weight和sample weight处理不平衡问题

在机器学习或者深度学习中,经常会遇到正负样本不平衡问题,尤其是广告、push等场景,正负样本严重不平衡。常规的就是上采样和下采样。

这里介绍Keras中的两个参数

class_weight和sample_weight

1、class_weight 对训练集中的每个类别加一个权重,如果是大类别样本多那么可以设置低的权重,反之可以设置大的权重值

2、sample_weight 对每个样本加权中,思路与上面类似。样本多的类别样本权重低

例如 model.fit(class_weight={0:1.,1:100.0},,,) 其中Keras也可以直接设置为model.fit(class_weight='auto',,,)

需要注意的是:

1.使用class_weight就会改变loss范围,这样可能会导致训练的稳定性。当Optimizer中的step_size与梯度的大小相关时,将会出现问题。而类似Adam等优化器则不受影响。另外,使用class_weight后的模型的loss大小不能和不使用时做比较

2、也可以直接使用Keras中的‘auto’这样自动识别权重大小。

下面是摘要的代码

def fit(self,
        x=None,
        y=None,
        batch_size=None,
        epochs=1,
        verbose=1,
        callbacks=None,
        validation_split=0.,
        validation_data=None,
        shuffle=True,
        class_weight=None,
        sample_weight=None,
        initial_epoch=0,
        steps_per_epoch=None,
        validation_steps=None,
        validation_freq=1,
        max_queue_size=10,
        workers=1,
        use_multiprocessing=False,
        **kwargs):
  """Trains the model for a fixed number of epochs (iterations on a dataset).

  Arguments:
      x: Input data. It could be:
        - A Numpy array (or array-like), or a list of arrays
          (in case the model has multiple inputs).
        - A TensorFlow tensor, or a list of tensors
          (in case the model has multiple inputs).
        - A dict mapping input names to the corresponding array/tensors,
          if the model has named inputs.
        - A `tf.data` dataset. Should return a tuple
          of either `(inputs, targets)` or
          `(inputs, targets, sample_weights)`.
        - A generator or `keras.utils.Sequence` returning `(inputs, targets)`
          or `(inputs, targets, sample weights)`.
      y: Target data. Like the input data `x`,
        it could be either Numpy array(s) or TensorFlow tensor(s).
        It should be consistent with `x` (you cannot have Numpy inputs and
        tensor targets, or inversely). If `x` is a dataset, generator,
        or `keras.utils.Sequence` instance, `y` should
        not be specified (since targets will be obtained from `x`).
      batch_size: Integer or `None`.
          Number of samples per gradient update.
          If unspecified, `batch_size` will default to 32.
          Do not specify the `batch_size` if your data is in the
          form of symbolic tensors, datasets,
          generators, or `keras.utils.Sequence` instances (since they generate
          batches).
      epochs: Integer. Number of epochs to train the model.
          An epoch is an iteration over the entire `x` and `y`
          data provided.
          Note that in conjunction with `initial_epoch`,
          `epochs` is to be understood as "final epoch".
          The model is not trained for a number of iterations
          given by `epochs`, but merely until the epoch
          of index `epochs` is reached.
      verbose: 0, 1, or 2. Verbosity mode.
          0 = silent, 1 = progress bar, 2 = one line per epoch.
          Note that the progress bar is not particularly useful when
          logged to a file, so verbose=2 is recommended when not running
          interactively (eg, in a production environment).
      callbacks: List of `keras.callbacks.Callback` instances.
          List of callbacks to apply during training.
          See `tf.keras.callbacks`.
      validation_split: Float between 0 and 1.
          Fraction of the training data to be used as validation data.
          The model will set apart this fraction of the training data,
          will not train on it, and will evaluate
          the loss and any model metrics
          on this data at the end of each epoch.
          The validation data is selected from the last samples
          in the `x` and `y` data provided, before shuffling. This argument is
          not supported when `x` is a dataset, generator or
         `keras.utils.Sequence` instance.
      validation_data: Data on which to evaluate
          the loss and any model metrics at the end of each epoch.
          The model will not be trained on this data.
          `validation_data` will override `validation_split`.
          `validation_data` could be:
            - tuple `(x_val, y_val)` of Numpy arrays or tensors
            - tuple `(x_val, y_val, val_sample_weights)` of Numpy arrays
            - dataset
          For the first two cases, `batch_size` must be provided.
          For the last case, `validation_steps` must be provided.
      shuffle: Boolean (whether to shuffle the training data
          before each epoch) or str (for 'batch').
          'batch' is a special option for dealing with the
          limitations of HDF5 data; it shuffles in batch-sized chunks.
          Has no effect when `steps_per_epoch` is not `None`.
      class_weight: Optional dictionary mapping class indices (integers)
          to a weight (float) value, used for weighting the loss function
          (during training only).
          This can be useful to tell the model to
          "pay more attention" to samples from
          an under-represented class.
      sample_weight: Optional Numpy array of weights for
          the training samples, used for weighting the loss function
          (during training only). You can either pass a flat (1D)
          Numpy array with the same length as the input samples
          (1:1 mapping between weights and samples),
          or in the case of temporal data,
          you can pass a 2D array with shape
          `(samples, sequence_length)`,
          to apply a different weight to every timestep of every sample.
          In this case you should make sure to specify
          `sample_weight_mode="temporal"` in `compile()`. This argument is not
          supported when `x` is a dataset, generator, or
         `keras.utils.Sequence` instance, instead provide the sample_weights
          as the third element of `x`.
      initial_epoch: Integer.
          Epoch at which to start training
          (useful for resuming a previous training run).
      steps_per_epoch: Integer or `None`.
          Total number of steps (batches of samples)
          before declaring one epoch finished and starting the
          next epoch. When training with input tensors such as
          TensorFlow data tensors, the default `None` is equal to
          the number of samples in your dataset divided by
          the batch size, or 1 if that cannot be determined. If x is a
          `tf.data` dataset, and 'steps_per_epoch'
          is None, the epoch will run until the input dataset is exhausted.
          This argument is not supported with array inputs.
      validation_steps: Only relevant if `validation_data` is provided and
          is a `tf.data` dataset. Total number of steps (batches of
          samples) to draw before stopping when performing validation
          at the end of every epoch. If validation_data is a `tf.data` dataset
          and 'validation_steps' is None, validation
          will run until the `validation_data` dataset is exhausted.
      validation_freq: Only relevant if validation data is provided. Integer
          or `collections_abc.Container` instance (e.g. list, tuple, etc.).
          If an integer, specifies how many training epochs to run before a
          new validation run is performed, e.g. `validation_freq=2` runs
          validation every 2 epochs. If a Container, specifies the epochs on
          which to run validation, e.g. `validation_freq=[1, 2, 10]` runs
          validation at the end of the 1st, 2nd, and 10th epochs.
      max_queue_size: Integer. Used for generator or `keras.utils.Sequence`
          input only. Maximum size for the generator queue.
          If unspecified, `max_queue_size` will default to 10.
      workers: Integer. Used for generator or `keras.utils.Sequence` input
          only. Maximum number of processes to spin up
          when using process-based threading. If unspecified, `workers`
          will default to 1. If 0, will execute the generator on the main
          thread.
      use_multiprocessing: Boolean. Used for generator or
          `keras.utils.Sequence` input only. If `True`, use process-based
          threading. If unspecified, `use_multiprocessing` will default to
          `False`. Note that because this implementation relies on
          multiprocessing, you should not pass non-picklable arguments to
          the generator as they can't be passed easily to children processes.
      **kwargs: Used for backwards compatibility.

  Returns:
      A `History` object. Its `History.history` attribute is
      a record of training loss values and metrics values
      at successive epochs, as well as validation loss values
      and validation metrics values (if applicable).

  Raises:
      RuntimeError: If the model was never compiled.
      ValueError: In case of mismatch between the provided input data
          and what the model expects.
  """
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
好的,您的问题是如何在Keras中定义自定义损失函数并实现样本加权以及使用metrics进行模型评估的详细说明。 首先,我们来介绍一下在Keras中定义自定义损失函数的方法。Keras自带了一些常见的损失函数,例如均方误差(MSE)、交叉熵(cross-entropy)等,但是在实际应用中,我们可能需要根据具体的问题来定义自己的损失函数。 在Keras中,我们可以通过编写一个Python函数来定义自定义损失函数,函数的输入应该是(y_true, y_pred),分别表示真实标签和模型输出的预测值,函数的输出应该是一个标量(即损失值)。例如,下面是一个计算均方误差的自定义损失函数: ```python import tensorflow as tf from keras import backend as K def custom_mse(y_true, y_pred): return K.mean(K.square(y_true - y_pred)) ``` 接下来,如果您希望对样本进行加权,可以在定义损失函数时使用Keras的API来实现。具体来说,您可以在损失函数中使用Keras的乘法运算符`K.dot`来将每个样本的权重乘以对应的损失值,然后再将这些值相加。例如,下面是一个计算加权均方误差的自定义损失函数,其中`sample_weight`是一个与`y_true`形状相同的张量,用于指定每个样本的权重: ```python def weighted_mse(y_true, y_pred): # sample_weight shape is (batch_size,) sample_weight = tf.constant([1, 2, 3], dtype=tf.float32) return K.mean(K.square(y_true - y_pred) * sample_weight) ``` 最后,如果您希望使用metrics来评估模型的性能,可以在Keras的`compile`函数中指定一个或多个metrics。Keras提供了许多常见的metrics,例如准确率(accuracy)、平均绝对误差(MAE)等。如果您定义了自己的metrics,可以使用Keras的`Metric`类来实现。例如,下面是一个计算绝对误差百分比的自定义metrics: ```python from keras.metrics import Metric class PercentageError(Metric): def __init__(self, name='percentage_error', **kwargs): super(PercentageError, self).__init__(name=name, **kwargs) self.total_error = self.add_weight(name='total_error', initializer='zeros') self.total_count = self.add_weight(name='total_count', initializer='zeros') def update_state(self, y_true, y_pred, sample_weight=None): abs_error = K.abs(y_true - y_pred) percentage_error = abs_error / K.clip(K.abs(y_true), K.epsilon(), None) if sample_weight is not None: percentage_error *= sample_weight self.total_error.assign_add(K.sum(percentage_error)) self.total_count.assign_add(K.sum(K.ones_like(y_true))) def result(self): return self.total_error / self.total_count ``` 在`update_state`方法中,我们首先计算每个样本的绝对误差百分比,然后将这些误差乘以对应的样本权重(如果有的话),并将其累加到`total_error`变量中。同时,我们还将样本数累加到`total_count`变量中。最后,我们在`result`方法中计算总的绝对误差百分比,即将累计的误差除以样本数。 在使用自定义metrics时,我们可以将其传递给`compile`函数的`metrics`参数。例如,下面是一个使用自定义损失函数和metrics的Keras模型定义: ```python from keras.models import Sequential from keras.layers import Dense model = Sequential() model.add(Dense(10, input_shape=(3,), activation='relu')) model.add(Dense(1, activation='linear')) model.compile(loss=weighted_mse, optimizer='adam', metrics=[PercentageError()]) ``` 在这个例子中,我们使用了上面定义的加权均方误差损失函数,并使用了上面定义的绝对误差百分比metrics。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值