如何处理深度学习模型中的过拟合

最新推荐文章于 2024-07-21 09:13:43 发布

cumifi2519

最新推荐文章于 2024-07-21 09:13:43 发布

阅读量1.3k

点赞数

文章标签： python 机器学习人工智能深度学习神经网络

原文链接：https://www.freecodecamp.org/news/handling-overfitting-in-deep-learning-models/

版权

本文探讨了深度学习模型中的过拟合问题，包括如何识别和减少过拟合。通过实例展示了如何通过减少网络容量、应用正则化和添加辍学层来防止过拟合。最终，具有辍学层的模型在测试数据上的性能最佳。

摘要由CSDN通过智能技术生成

Overfitting occurs when you achieve a good fit of your model on the training data, but it does not generalize well on new, unseen data. In other words, the model learned patterns specific to the training data, which are irrelevant in other data.

当您在训练数据上很好地拟合模型时，就会发生过度拟合，但是对于新的，看不见的数据，它并不能很好地概括。换句话说，模型学习了特定于训练数据的模式，这些模式与其他数据无关。

We can identify overfitting by looking at validation metrics like loss or accuracy. Usually, the validation metric stops improving after a certain number of epochs and begins to decrease afterward. The training metric continues to improve because the model seeks to find the best fit for the training data.

我们可以通过查看诸如损失或准确性之类的验证指标来识别过度拟合。通常，验证指标会在一定时期后停止改善，此后开始下降。训练指标不断提高，因为该模型试图找到最适合训练数据的模型。

There are several manners in which we can reduce overfitting in deep learning models. The best option is to get more training data. Unfortunately, in real-world situations, you often do not have this possibility due to time, budget, or technical constraints.

有几种方法可以减少深度学习模型的过度拟合。最好的选择是获取更多的训练数据 。不幸的是，在现实世界中，由于时间，预算或技术限制，您通常没有这种可能性。

Another way to reduce overfitting is to lower the capacity of the model to memorize the training data. As such, the model will need to focus on the relevant patterns in the training data, which results in better generalization. In this post, we’ll discuss three options to achieve this.

减少过度拟合的另一种方法是降低模型记忆训练数据的能力 。这样，该模型将需要集中于训练数据中的相关模式，这将导致更好的概括。在本文中，我们将讨论实现此目的的三个选项。

设置项目 (Set up the project)

We start by importing the necessary packages and configuring some parameters. We will use Keras to fit the deep learning models. The training data is the Twitter US Airline Sentiment data set from Kaggle.

我们首先导入必要的程序包并配置一些参数。我们将使用Keras来适应深度学习模型。训练数据是来自Kaggle的Twitter美国航空情绪数据集。

# Basic packages
import pandas as pd 
import numpy as np
import re
import collections
import matplotlib.pyplot as plt
from pathlib import Path
# Packages for data preparation
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder
# Packages for modeling
from keras import models
from keras import layers
from keras import regularizers
NB_WORDS = 10000  # Parameter indicating the number of words we'll put in the dictionary
NB_START_EPOCHS = 20  # Number of epochs we usually start to train with
BATCH_SIZE = 512  # Size of the batches used in the mini-batch gradient descent
MAX_LEN = 20  # Maximum number of words in a sequence
root = Path('../')
input_path = root / 'input/' 
ouput_path = root / 'output/'
source_path = root / 'source/'

一些辅助功能 (Some helper functions)

We will use some helper functions throughout this post.

在本文中，我们将使用一些辅助功能。

def deep_model(model, X_train, y_train, X_valid, y_valid):
    '''
    Function to train a multi-class model. The number of epochs and 
    batch_size are set by the constants at the top of the
    notebook. 
    
    Parameters:
        model : model with the chosen architecture
        X_train : training features
        y_train : training target
        X_valid : validation features
        Y_valid : validation target
    Output:
        model training history
    '''
    model.compile(optimizer='rmsprop'
                  , loss='categorical_crossentropy'
                  , metrics=['accuracy'])
    
    history = model.fit(X_train
                       , y_train
                       , epochs=NB_START_EPOCHS
                       , batch_size=BATCH_SIZE
                       , validation_data=(X_valid, y_valid)
                       , verbose=0)
    return history
def eval_metric(model, history, metric_name):
    '''
    Function to evaluate a trained model on a chosen metric. 
    Training and validation metric are plotted in a
    line chart for each epoch.
    
    Parameters:
        history : model training history
        metric_name : loss or accuracy
    Output:
        line chart with epochs of x-axis and metric on
        y-axis
    '''
    metric = history.history[metric_name]
    val_metric = history.history['val_' + metric_name]
    e = range(1, NB_START_EPOCHS + 1)
    plt.plot(e, metric, 'bo', label='Train ' + metric_name)
    plt.plot(e, val_metric, 'b', label='Validation ' + metric_name)
    plt.xlabel('Epoch number')
    plt.ylabel(metric_name)
    plt.title('Comparing training and validation ' + metric_name + ' for ' + model.name)
    plt.legend()
    plt.show()
def test_model(model, X_train, y_train, X_test, y_test, epoch_stop):
    '''
    Function to test the model on new data after training it
    on the full training data with the optimal number of epochs.
    
    Parameters:
        model : trained model
        X_train : training features
        y_train : training target
        X_test : test features
        y_test : test target
        epochs : optimal number of epochs
    Output:
        test accuracy and test loss
    '''
    model.fit(X_train
              , y_train
              , epochs=epoch_stop
              , batch_size=BATCH_SIZE
              , verbose=0)
    results = model.evaluate(X_test, y_test)
    print()
    print('Test accuracy: {0:.2f}%'.format(results[1]*100))
    return results
    
def remove_stopwords(input_text):
    '''
    Function to remove English stopwords from a Pandas Series.
    
    Parameters:
        input_text : text to clean
    Output:
        cleaned Pandas Series 
    '''
    stopwords_list = stopwords.words('english')
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"]
    words = input_text.split() 
    clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
    return " ".join(clean_words) 
    
def remove_mentions(input_text):
    '''
    Function to remove mentions, preceded by @, in a Pandas Series
    
    Parameters:
        input_text : text to clean
    Output:
        cleaned Pandas Series 
    '''
    return re.sub(r'@\w+', '', input_text)
def compare_models_by_metric(model_1, model_2, model_hist_1, model_hist_2, metric):
    '''
    Function to compare a metric between two models 
    
    Parameters:
        model_hist_1 : training history of model 1
        model_hist_2 : training history of model 2
        metrix : metric to compare, loss, acc, val_loss or val_acc
        
    Output:
        plot of metrics of both models
    '''
    metric_model_1 = model_hist_1.history[metric]
    metric_model_2 = model_hist_2.history[metric]
    e = range(1, NB_START_EPOCHS + 1)
    
    metrics_dict = {
        'acc' : 'Training Accuracy',
        'loss' : 'Training Loss',
        'val_acc' : 'Validation accuracy',
        'val_loss' : 'Validation loss'
    }
    
    metric_label = metrics_dict[metric]
    plt.plot(e, metric_model_1, 'bo', label=model_1.name)
    plt.plot(e, metric_model_2, 'b', label=model_2.name)
    plt.xlabel('Epoch number')
    plt.ylabel(metric_label)
    plt.title('Comparing ' + metric_label + ' between models')
    plt.legend()
    plt.show()
    
def optimal_epoch(model_hist):
    '''
    Function to return the epoch number where the validation loss is
    at its minimum
    
    Parameters:
        model_hist : training history of model
    Output:
        epoch number with minimum validation loss
    '''
    min_epoch = np.argmin(model_hist.history['val_loss']) + 1
    print("Minimum validation loss reached in epoch {}".format(min_epoch))
    return min_epoch

资料准备 (Data preparation)

数据清理 (Data cleaning)

We load the CSV with the tweets and perform a random shuffle. It’s a good practice to shuffle the data before splitting between a train and test set. That way the sentiment classes are equally distributed over the train and test sets. We’ll only keep the text column as input and the airline_sentiment column as the target.

我们使用推文加载CSV并执行随机洗牌。在火车和测试集之间拆分之前，最好对数据进行随机排序。这样，情感等级在火车和测试装置上平均分配。我们将只保留text列作为输入，而Airlines_sentiment列作为目标。

The next thing we’ll do is remove stopwords. Stopwords do not have any value for predicting the sentiment. Furthermore, as we want to build a model that can be used for other airline companies as well, we remove the mentions.

我们下一步要做的是删除停用词 。停用词对预测情绪没有任何价值。此外，由于我们希望建立一个也可用于其他航空公司的模型，因此我们删除了提及的内容 。

df = pd.read_csv(input_path / 'Tweets.csv')
df = df.reindex(np.random.permutation(df.index))  
df = df[['text', 'airline_sentiment']]
df.text = df.text.apply(remove_stopwords).apply(remove_mentions)

火车测试拆分 (Train-Test split)

The evaluation of the model performance needs to be done on a separate test set. As such, we can estimate how well the model generalizes. This is done with the train_test_split method of scikit-learn.

模型性能的评估需要在单独的测试集中进行。这样，我们可以估计模型的概括程度。这是通过scikit-learn的train_test_split方法完成的。

X_train, X_test, y_train, y_test = train_test_split(df.text, df.airline_sentiment, test_size=0.1, random_state=37)

将单词转换为数字 (Converting words to numbers)

To use the text as input for a model, we first need to convert the words into tokens, which simply means converting the words into integers that refer to an index in a dictionary. Here we will only keep the most frequent words in the training set.

要将文本用作模型的输入，我们首先需要将单词转换为标记，这仅意味着将单词转换为引用字典中索引的整数。在这里，我们只保留训练集中最常用的单词。

We clean up the text by applying filters and putting the words to lowercase. Words are separated by spaces.

我们通过应用过滤器并将单词转换为小写来清理文本。单词之间用空格隔开。

tk = Tokenizer(num_words=NB_WORDS,
               filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{"}~\t\n',
               lower=True,
               char_level=False,
               split=' ')
tk.fit_on_texts(X_train)

After having created the dictionary we can convert the text of a tweet to a vector with NB_WORDS values. With mode=binary, it contains an indicator whether the word appeared in the tweet or not. This is done with the texts_to_matrix method of the Tokenizer.

创建字典后，我们可以将tweet的文本转换为具有NB_WORDS值的向量。使用mode = binary ，它包含一个指示符，指示单词是否出现在推文中。这是通过Tokenizer的texts_to_matrix方法完成的。

X_train_oh = tk.texts_to_matrix(X_train, mode='binary')
X_test_oh = tk.texts_to_matrix(X_test, mode='binary')

将目标类别转换为数字 (Converting the target classes to numbers)

We need to convert the target classes to numbers as well, which in turn are one-hot-encoded with the to_categorical method in Keras.

我们还需要将目标类转换为数字，然后使用Keras中的to_categorical方法对其进行一次热编码。

le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_test_le = le.transform(y_test)
y_train_oh = to_categorical(y_train_le)
y_test_oh = to_categorical(y_test_le)

拆分验证集 (Splitting off a validation set)

Now that our data is ready, we split off a validation set. This validation set will be used to evaluate the model performance when we tune the parameters of the model.

现在我们的数据已经准备好了，我们将拆分一个验证集。当我们调整模型参数时，将使用该验证集来评估模型性能。

X_train_rest, X_valid, y_train_rest, y_valid = train_test_split(X_train_oh, y_train_oh, test_size=0.1, random_state=37)

深度学习 (Deep learning)

创建一个过拟合的模型 (Creating a model that overfits)

We start with a model that overfits. It has 2 densely connected layers of 64 elements. The input_shape for the first layer is equal to the number of words we kept in the dictionary and for which we created one-hot-encoded features.

我们从一个过拟合模型开始。它具有2个由64个元素组成的紧密连接的层。第一层的input_shape等于我们在字典中保留的单词数，并且我们为其创建了一个热编码特征。

As we need to predict 3 different sentiment classes, the last layer has 3 elements. The softmax activation function makes sure the three probabilities sum up to 1.

由于我们需要预测3个不同的情感类别，因此最后一层包含3个元素。 softmax激活功能可确保三个概率之和为1。

The number of parameters to train is computed as (nb inputs x nb elements in hidden layer) + nb bias terms. The number of inputs for the first layer equals the number of words in our corpus. The subsequent layers have the number of outputs of the previous layer as inputs. So the number of parameters per layer are:

要训练的参数数量的计算方式为(nb个输入x隐藏层中的nb个元素)+ nb个偏置项 。第一层的输入数量等于语料库中的单词数量。后续层将前一层的输出数量作为输入。因此，每层参数的数量为：

First layer : (10000 x 64) + 64 = 640064
第一层：(10000 x 64)+ 64 = 640064
Second layer : (64 x 64) + 64 = 4160
第二层：(64 x 64)+ 64 = 4160
Last layer : (64 x 3) + 3 = 195
最后一层：(64 x 3)+ 3 = 195

base_model = models.Sequential()
base_model.add(layers.Dense(64, activation='relu', input_shape=(NB_WORDS,)))
base_model.add(layers.Dense(64, activation='relu'))
base_model.add(layers.Dense(3, activation='softmax'))
base_model.name = 'Baseline model'

Because this project is a multi-class, single-label prediction, we use categorical_crossentropy as the loss function and softmax as the final activation function. We fit the model on the train data and validate on the validation set. We run for a predetermined number of epochs and will see when the model starts to overfit.

由于该项目是一个多类别的单标签预测，因此我们将categorical_crossentropy用作损失函数，将softmax用作最终激活函数。我们将模型拟合到火车数据上，并在验证集上进行验证。我们运行了预定的时期，然后将看到模型何时开始过拟合。

base_history = deep_model(base_model, X_train_rest, y_train_rest, X_valid, y_valid)
base_min = optimal_epoch(base_history)
eval_metric(base_model, base_history, 'loss')

In the beginning, the validation loss goes down. But at epoch 3 this stops and the validation loss starts increasing rapidly. This is when the models begin to overfit.

最初， 验证损失下降了。但是在第3阶段，此过程停止了，并且验证损失开始Swift增加。这是模型开始过度拟合的时候。

The training loss continues to go down and almost reaches zero at epoch 20. This is normal as the model is trained to fit the train data as well as possible.

训练损失继续下降，并且在时期20几乎达到零。这是正常的，因为训练了模型以尽可能拟合火车数据。

处理过度拟合 (Handling overfitting)

Now, we can try to do something about the overfitting. There are different options to do that.

现在，我们可以尝试对过度拟合进行一些处理。有不同的选择可以做到这一点。

Reduce the network’s capacity by removing layers or reducing the number of elements in the hidden layers
通过删除层或减少隐藏层中的元素数来减少网络的容量
Apply regularization, which comes down to adding a cost to the loss function for large weights
应用正则化 ，这会导致大权重的损失函数增加成本
Use Dropout layers, which will randomly remove certain features by setting them to zero
使用Dropout图层 ，它将通过将某些要素设置为零来随机删除某些要素

减少网络容量 (Reducing the network’s capacity)

Our first model has a large number of trainable parameters. The higher this number, the easier the model can memorize the target class for each training sample. Obviously, this is not ideal for generalizing on new data.

我们的第一个模型具有大量可训练的参数。该数字越高，模型越容易记住每个训练样本的目标类别。显然，这对于归纳新数据并不理想。

By lowering the capacity of the network, you force it to learn the patterns that matter or that minimize the loss. On the other hand, reducing the network’s capacity too much will lead to underfitting. The model will not be able to learn the relevant patterns in the train data.

通过降低网络容量，您可以强制网络学习重要的模式或将损失最小化的模式。在另一方面，降低了网络的容量过多会导致欠拟合 。该模型将无法学习火车数据中的相关模式。

We reduce the network’s capacity by removing one hidden layer and lowering the number of elements in the remaining layer to 16.

我们通过删除一个隐藏层并将剩余层中的元素数量减少到16个来减少网络的容量。

reduced_model = models.Sequential()
reduced_model.add(layers.Dense(16, activation='relu', input_shape=(NB_WORDS,)))
reduced_model.add(layers.Dense(3, activation='softmax'))
reduced_model.name = 'Reduced model'
reduced_history = deep_model(reduced_model, X_train_rest, y_train_rest, X_valid, y_valid)
reduced_min = optimal_epoch(reduced_history)
eval_metric(reduced_model, reduced_history, 'loss')

We can see that it takes more epochs before the reduced model starts overfitting. The validation loss also goes up slower than our first model.

我们可以看到，在简化模型开始过度拟合之前，需要花费更多的时间。验证损失也比我们的第一个模型慢。

compare_models_by_metric(base_model, reduced_model, base_history, reduced_history, 'val_loss')

When we compare the validation loss of the baseline model, it is clear that the reduced model starts overfitting at a later epoch. The validation loss stays lower much longer than the baseline model.

当我们比较基准模型的验证损失时，很明显，简化模型在以后的时期开始过度拟合。验证损失保持比基线模型低得多的时间。

应用正则化 (Applying regularization)

To address overfitting, we can apply weight regularization to the model. This will add a cost to the loss function of the network for large weights (or parameter values). As a result, you get a simpler model that will be forced to learn only the relevant patterns in the train data.

为了解决过度拟合问题，我们可以对模型应用权重正则化。对于较大的权重(或参数值)，这将增加网络的损失函数的成本。结果，您将获得一个更简单的模型，该模型将被迫仅学习火车数据中的相关模式。

There are L1 regularization and L2 regularization.

有L1正则化和L2正则化 。

L1 regularization will add a cost with regards to the absolute value of the parameters. It will result in some of the weights to be equal to zero.
L1正则化将增加参数绝对值方面的成本。这将导致某些权重等于零。
L2 regularization will add a cost with regards to the squared value of the parameters. This results in smaller weights.
L2正则化将增加参数平方值的成本。这导致较小的重量。

Let’s try with L2 regularization.

让我们尝试使用L2正则化。

reg_model = models.Sequential()
reg_model.add(layers.Dense(64, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape=(NB_WORDS,)))
reg_model.add(layers.Dense(64, kernel_regularizer=regularizers.l2(0.001), activation='relu'))
reg_model.add(layers.Dense(3, activation='softmax'))
reg_model.name = 'L2 Regularization model'
reg_history = deep_model(reg_model, X_train_rest, y_train_rest, X_valid, y_valid)
reg_min = optimal_epoch(reg_history)

For the regularized model we notice that it starts overfitting in the same epoch as the baseline model. However, the loss increases much slower afterward.

对于正则化模型，我们注意到它在与基线模型相同的时期开始过度拟合。但是，此后损失增加得慢得多。

eval_metric(reg_model, reg_history, 'loss')

compare_models_by_metric(base_model, reg_model, base_history, reg_history, 'val_loss')

添加辍学层 (Adding dropout layers)

The last option we’ll try is to add dropout layers. A dropout layer will randomly set output features of a layer to zero.

我们将尝试的最后一个选项是添加辍学图层。辍学图层将把图层的输出特征随机设置为零。

drop_model = models.Sequential()
drop_model.add(layers.Dense(64, activation='relu', input_shape=(NB_WORDS,)))
drop_model.add(layers.Dropout(0.5))
drop_model.add(layers.Dense(64, activation='relu'))
drop_model.add(layers.Dropout(0.5))
drop_model.add(layers.Dense(3, activation='softmax'))
drop_model.name = 'Dropout layers model'
drop_history = deep_model(drop_model, X_train_rest, y_train_rest, X_valid, y_valid)
drop_min = optimal_epoch(drop_history)
eval_metric(drop_model, drop_history, 'loss')

The model with dropout layers starts overfitting later than the baseline model. The loss also increases slower than the baseline model.

具有辍学层的模型比基线模型晚开始过度拟合。损耗的增加也比基线模型慢。

compare_models_by_metric(base_model, drop_model, base_history, drop_history, 'val_loss')

The model with the dropout layers starts overfitting later. Compared to the baseline model the loss also remains much lower.

具有辍学层的模型稍后开始过度拟合。与基准模型相比，损耗也仍然低得多。

训练完整的火车数据并评估测试数据 (Training on the full train data and evaluation on test data)

At first sight, the reduced model seems to be the best model for generalization. But let’s check that on the test set.

乍一看，简化模型似乎是最佳的概括模型。但是让我们在测试集上检查一下。

base_results = test_model(base_model, X_train_oh, y_train_oh, X_test_oh, y_test_oh, base_min)
reduced_results = test_model(reduced_model, X_train_oh, y_train_oh, X_test_oh, y_test_oh, reduced_min)
reg_results = test_model(reg_model, X_train_oh, y_train_oh, X_test_oh, y_test_oh, reg_min)
drop_results = test_model(drop_model, X_train_oh, y_train_oh, X_test_oh, y_test_oh, drop_min)

结论 (Conclusion)

As shown above, all three options help to reduce overfitting. We manage to increase the accuracy on the test data substantially. Among these three options, the model with the dropout layers performs the best on the test data.

如上所示，这三个选项均有助于减少过度拟合。我们设法大幅提高测试数据的准确性。在这三个选项中，具有辍学层的模型在测试数据上表现最佳。

You can find the notebook on GitHub. Have fun with it!

您可以在GitHub上找到笔记本。玩得开心！