深度学习入门Deep Learning_深度学习入门英文-CSDN博客

本文链接：https://blog.csdn.net/weixin_61272643/article/details/132039193

本文介绍了深度学习基础，包括线性单元、多层神经网络、ReLU激活函数、Keras中的Sequential模型、随机梯度下降、损失函数和优化器的选择，以及如何避免过拟合和欠拟合。此外，还涵盖了二元分类、概率生成与交叉熵的概念。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Deep Learning

本文为作者学习DL过程中所记录的笔记
第一次写此类文章，若文章中存在错误，欢迎各位指正，也期待与各位交流学习！
学习内容为Kaggle的Intro to Deep Learning，对深度学习有一定了解，目前过一遍，并在不久之后上手一些项目

L1: A single neutron

Learn about linear units, the building blocks of deep learning.

The Linear Unit

The Linear Unit as a Model
Multiple Inputs

Linear Units in Keras

The easiest way to create a model in Keras is through keras.Sequential, which creates a neural network as a stack of layers. We can create models like those above using a dense layer (which we’ll learn more about in the next lesson).
We could define a linear model accepting three input features (‘sugars’, ‘fiber’, and ‘protein’) and producing a single output (‘calories’) like so:

from tensorflow import keras
from tensorflow.keras import layers

# Create a network with 1 linear unit
model = keras.Sequential([
    layers.Dense(units=1, input_shape=[3])
])

L2: Deep Neural Networks

Add hidden layers to your network to uncover complex relationships.

Layers

Neural networks typically organize their neurons into layers. When we collect together linear units having a common set of inputs we get a dense layer. (fully-connected layers)

The Activation Function

Without activation functions, neural networks can only learn linear relationships. In order to fit curves, we’ll need to use activation functions.
ReLU: When we attach the rectifier to a linear unit, we get a rectified linear unit.

Stacking Dense Layers

在这里插入图片描述

The layers before the output layer are sometimes called hidden since we never see their outputs directly.
The final (output) layer is a linear unit (meaning, no activation function). That makes this network appropriate to a regression task, where we are trying to predict some arbitrary numeric value. Other tasks (like classification) might require an activation function on the output.

Building Sequential Models

The Sequential model will connect together a list of layers in order from first to last: the first layer gets the input, the last layer produces the output.

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    # the hidden ReLU layers
    layers.Dense(units=4, activation='relu', input_shape=[2]),
    layers.Dense(units=3, activation='relu'),
    # the linear output layer 
    layers.Dense(units=1),
])

Only the first layer need to add the input_shape=, and ‘1’ not belong to inputs.
Above code can be rewritten as below if you want to add some layers between hidden layers and activatio function.

model = keras.Sequential([
    # the hidden ReLU layers
    layers.Dense(units=4, input_shape=[2]),
    layers.Activation('relu'),
    layers.Dense(units=3),
    layers.Activation('relu'),
    # the linear output layer 
    layers.Dense(units=1),
])

L3: Stochastic Gradient Descent

Use Keras and Tensorflow to train your first neural network.
Training the network means adjusting its weights in such a way that it can transform the features into the target.
In addition to the training data, we need two more things:
1. A “loss function” that measures how good the network’s predictions are.
2. An “optimizer” that can tell the network how to change its weights.

The Loss Function

Loss function measures the disparity between the the target’s true value and the value the model predicts.
Different problems call for different loss functions.
A common loss function for regression problems is the mean absolute error or MAE
1. The total MAE loss on a dataset is the mean of all these absolute differences.
2. Besides MAE, other loss functions for regression problems are the mean-squared error (MSE) or the Huber loss (both available in Keras).
During training, the model will use the loss function as a guide for finding the correct values of its weights

The Optimizer - Stochastic Gradient Descen（随机梯度下降）

The optimizer is an algorithm that adjusts the weights to minimize the loss. They are iterative algorithms that train a network in steps. One step of training goes like this:

Sample some training data and run it through the network to make predictions.
Measure the loss between the predictions and the true values.
Finally, adjust the weights in a direction that makes the loss smaller.

Then just do this over and over until the loss is as small as you like (or until it won’t decrease any further.)
注意一些名词来了：Each iteration’s sample of training data is called a minibatch (or often just “batch”), while a complete round of the training data is called an epoch. The number of epochs you train for is how many times the network will see each training example.

Learning Rate and Batch Size

The size of these shifts is determined by the learning rate. A smaller learning rate means the network needs to see more minibatches before its weights converge to their best values.
The learning rate and the size of the minibatches are the two parameters that have the largest effect on how the SGD training proceeds. Their interaction is often subtle and the right choice for these parameters isn’t always obvious.
补充： Adam is an SGD algorithm that has an adaptive learning rate that makes it suitable for most problems without any parameter tuning (it is “self tuning”, in a sense). Adam is a great general-purpose optimizer.
!!! Important:

Smaller batch sizes gave noisier weight updates and loss curves. This is because each batch is a small sample of data and smaller samples tend to give noisier estimates. Smaller batches can have an “averaging” effect though which can be beneficial.
Smaller learning rates make the updates smaller and the training takes longer to converge. Large learning rates can speed up training, but don’t “settle in” to a minimum as well. When the learning rate is too large, the training can fail completely. (会错过最低点)

Adding the Loss and Optimizer

model.compile(
    optimizer="adam",
    loss="mae",
)

What’s In a Name SGD ？

The gradient is a vector that tells us in what direction the weights need to go. More precisely, it tells us how to change the weights to make the loss change fastest. We call our process gradient descent because it uses the gradient to descend the loss curve towards a minimum. Stochastic means “determined by chance.” Our training is stochastic because the minibatches are random samples from the dataset.

示例：

#数据为kaggle中Red Wine Quality数据集
#数据处理
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/red-wine.csv')

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

# Scale to [0, 1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# Split features and target
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']
#建立模型
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[11]),
    layers.Dense(512, activation='relu'),
    layers.Dense(512, activation='relu'),
    layers.Dense(1),
])
#加入优化器和损失函数
model.compile(
    optimizer='adam',
    loss='mae',
)
#开始训练，并且保存训练数据（loss数据）
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=10,
)
#转换成dataframe并且作图，使得结果更明显
import pandas as pd

# convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
# use Pandas native plot method
history_df['loss'].plot();

在这里插入图片描述

L4: Overfitting and Underfitting

Improve performance with extra capacity or early stopping.

Learning Curves

The information in the training data are two kinds: signal and noise. The signal is the part that generalizes, the part that can help our model make predictions from new data. The noise is that part that is only true of the training data; the noise is all of the random fluctuation that comes from data in the real-world or all of the incidental, non-informative patterns that can’t actually help the model make predictions. The noise is the part might look useful but really isn’t.
The training loss will go down either when the model learns signal or when it learns noise. But the validation loss will go down only when the model learns signal. (Whatever noise the model learned from the training set won’t generalize to new data.) So, when a model learns signal both curves go down, but when it learns noise a gap is created in the curves. The size of the gap tells you how much noise the model has learned.
Underfitting: The training set is when the loss is not as low as it could be because the model hasn’t learned enough signal.
Overfitting: The training set is when the loss is not as low as it could be because the model learned too much noise.
The trick to training deep learning models is finding the best balance between the two.

Capacity

A model’s capacity refers to the size and complexity of the patterns it is able to learn.
Increase the capacity of a network either by making it wider (more units to existing layers) or by making it deeper (adding more layers). Wider networks have an easier time learning more linear relationships, while deeper networks prefer more nonlinear ones. Which is better just depends on the dataset.

Early Stopping

We can simply stop the training whenever it seems the validation loss isn’t decreasing anymore. Interrupting the training this way is called early stopping.
Once we detect that the validation loss is starting to rise again, we can reset the weights back to where the minimum occured.
Training with early stopping also means we’re in less danger of stopping the training too early, before the network has finished learning signal. So besides preventing overfitting from training too long, early stopping can also prevent underfitting from not training long enough. Just set your training epochs to some large number (more than you’ll need), and early stopping will take care of the rest.

Adding Early Stopping

Thinking the relation between ‘patience’ and ‘min_delta’ !!!

# "If there hasn't been at least an improvement of 0.001 in the validation loss over the previous 20 epochs, then stop the training and keep the best model you found."
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

# pass this callback to the fit method
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,   # choose a large number of epochs when using early stopping, more than you'll need.
    callbacks=[early_stopping], # put your callbacks in a list
    verbose=0,  # turn off training log
)

L5: Dropout and Batch Normalization

Dropout

Background: In the last lesson we talked about how overfitting is caused by the network learning spurious patterns in the training data.
This is the idea behind dropout. To break up these conspiracies, we randomly drop out some fraction of a layer’s input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. Instead, it has to search for broad, general patterns, whose weight patterns tend to be more robust.
Other understanding: Just like random forests as an ensemble of decision trees, it’s the same idea.
The predictions will no longer be made by one big network, but instead by a committee of smaller networks. Individuals in the committee tend to make different kinds of mistakes, but be right at the same time, making the committee as a whole better than any individual.

Adding Dropout

rate defines what percentage of the input units to shut off. Put the Dropout layer just before the layer you want the dropout applied to:

keras.Sequential([
    # ...
    layers.Dropout(rate=0.3), # apply 30% dropout to the next layer
    layers.Dense(16),
    # ...
])

Batch Normalization

It’s good to normalize the data before it goes into the network, maybe also normalizing inside the network would be better! It can help correct training that is slow or unstable.
A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.
Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also help prediction performance). Models with batchnorm tend to need fewer epochs to complete training. Moreover, batchnorm can also fix various problems that can cause the training to get “stuck”. Consider adding batch normalization to your models, especially if you’re having trouble during training.

Adding Batch Normalization

layers.Dense(16, activation='relu'),
layers.BatchNormalization(),
#  or between a layer and its activation function:
layers.Dense(16),
layers.BatchNormalization(),
layers.Activation('relu'),
# And if you add it as the first layer of your network it can act as a kind of adaptive preprocessor, standing in for something like Sci-Kit Learn's StandardScaler.

Example

Add dropout to control overfitting and batch normalization to speed up optimization. (batchnorm also good at nomalization)

from tensorflow import keras
from tensorflow.keras import layers
# When adding dropout, you may need to increase the number of units in your Dense layers.
model = keras.Sequential([
    layers.Dense(1024, activation='relu', input_shape=[11]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1),
])

L6: Binary Classification

Binary Classification

In your raw data, the classes might be represented by strings like “Yes” and “No”, or “Dog” and “Cat”. Before using this data we’ll assign a class label: one class will be 0 and the other will be 1.

Accuracy and Cross-Entropy

Accuracy is one of the many metrics in use for measuring success on a classification problem. Accuracy is the ratio of correct predictions to total predictions: accuracy = number_correct / total.
The problem with accuracy (and most other classification metrics) is that it can’t be used as a loss function. SGD needs a loss function that changes smoothly, but accuracy, being a ratio of counts, changes in “jumps”.
Cross-Entropy: It is a sort of measure for the distance from one probability distribution to another. For classification, what we want instead is a distance between probabilities, and this is what cross-entropy provides.

The idea is that we want our network to predict the correct class with probability 1.0. The further away the predicted probability is from 1.0, the greater will be the cross-entropy loss.

Making Probabilities with the Sigmoid Function

The cross-entropy and accuracy functions both require probabilities as inputs, meaning, numbers from 0 to 1. To covert the real-valued outputs produced by a dense layer into probabilities, we attach a new kind of activation function, the sigmoid activation.

To get the final class prediction, we define a threshold probability. Typically this will be 0.5, so that rounding will give us the correct class: below 0.5 means the class with label 0 and 0.5 or above means the class with label 1. A 0.5 threshold is what Keras uses by default with its accuracy metric.

Example

from tensorflow import keras
from tensorflow.keras import layers
# In the final layer include a 'sigmoid' activation so that the model will produce class probabilities.
model = keras.Sequential([
    layers.Dense(4, activation='relu', input_shape=[33]),
    layers.Dense(4, activation='relu'),    
    layers.Dense(1, activation='sigmoid'),
])
# For two-class problems, be sure to use 'binary' versions. 
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)

early_stopping = keras.callbacks.EarlyStopping(
    patience=10,
    min_delta=0.001,
    restore_best_weights=True,
)
# The model in this particular problem can take quite a few epochs to complete training, so we'll include an early stopping callback for convenience.
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0, # hide the output because we have so many epochs
)