神经网络与深度学习（基于动手学深度学习pytorch）(一)

最新推荐文章于 2024-07-21 11:19:08 发布

loongxl

最新推荐文章于 2024-07-21 11:19:08 发布

阅读量1.3k

点赞数 21

文章标签：深度学习神经网络 pytorch

本文链接：https://blog.csdn.net/weixin_58500689/article/details/137179419

版权

【内容介绍】

本文关于神经网络与深度学习，主要学习教材为李沐老师等编写的动手学深度学习pytorch版，并加有相关补充内容。教材官网链接https://zh.d2l.ai/，相关代码可以从官网上下载，相关运行环境下载安装https://www.zhihu.com/zvideo/1363284223420436480。

1.概述

通常我们提到深度学习，常常会忘记深度学习只是机器学习的一小部分，而认为它是独立于机器学习的单独模块。这是因为机器学习作为一门历史更悠久的学科，在深度学习没有问世之前，在现实世界的应用范围很窄。在语音识别、计算机视觉、自然语言处理等领域，由于需要大量的领域知识并且现实情况异常复杂，机器学习往往只是解决这些领域问题方案中的一小部分。但是就在过去的几年里，深度学习的问世和应用给世界带来了惊喜，推动了计算机视觉、自然语言处理、自动语音识别、强化学习和统计建模等领域的快速发展，并逐渐引领潮流，在世界掀起了一波人工智能的革命。

在 《动手学习深度学习》 课程中，既有少量的机器学习的基础知识，比如：线性神经网络，多层感知机 等等；又有如今前沿应用的 各种深度学习模型：包括leNet，ResNet，LSTM，BERT…… 同时每一章节的讲解还配备由pytorch实现的代码、教科书等等，可以帮助我们在短期内掌握深度学习的基础模型与前沿知识和并提高实践能力。

2.线性回归

2.1 线性回归

回归（ regression ）是能为一个或多个自变量与因变量之间关系建模的一类方法。在自然科学和社会科学领域，回归经常用来表示输入和输出之间的关系。

线性回归要素：训练集（training set）是我们的输入的数据，称为x。输出数据，称为y。拟合的函数或者称为假设或者模型，一般写做 𝑦 = ℎ(𝑥)，训练数据的条目数 (#training set) ：一条训练数据是由一对输入数据和输出数据组成的，输入数据的维度 𝑛 ( 特征的个数， #features)。

2.2 线性分类

一般的 线性回归模型要求属性的数据类型为 连续值，故需要对离散属性进行连续化。

样本只有一个属性x,训练集有m个样本。

假设函数模型函数）：用于预测

为了将f(x)写成向量化的形式(方便编程），可在变量x中添加一列1形成2维向量X = [ 1 , x ] T X=[1,x]^TX=[1,x] T ，参数θ为2维的向量 $[\theta_0,\theta_1]^]T$ （这里向量指的都是列向量），则

f(x)=θTX

代价函数：均方误差，也称平方损失

目标函数： minJ(θ)

2.3 梯度下降法

当模型没有显示解的时候，应用梯度下降法逼近最优解。
梯度下降法的具体步骤：
- 挑选一个初始值
- 重复迭代参数，迭代公式为：
- - 为函数值下降最快的方向，学习率为学习步长。
选择学习率
- 学习率为学习步长，代表了沿负梯度方向走了多远，这是超参数（人为指定的的值，不是训练得到的）
- 学习率不能太大，也不能太小，需要选取适当。

2.4 多分类回归

Softmax回归(Softmax regression)，也称为多项（Multinomial)或多类（Multi-Class)的Logistic回归，是Logistic回归在多分类问题上的推广。Softmax回归跟线性回归一样将输入特征与权重做线性叠加。与线性回归的一个主要不同在于，Softmax回归的输出值个数等于标签里的类别数。最后，再对这些输出值进行Softmax函数运算。

图中用神经网络图描绘了上面的计算。softmax回归同线性回归一样，也是一个单层神经网络。由于每个输出的计算都要依赖于所有的输入x 1 , x 2 , x 3 , x 4 ，所以softmax回归的输出层也是一个全连接层。

针对softmax回归，运行动手学深度学习上代码进行验证学习

import torch
import torchvision
import time
import matplotlib.pyplot as plt
import numpy as np 
from torch.utils import data
from torchvision import transforms
from d2l import torch as d2l
from IPython import display

#初始化参数
batch_size=256
train_iter,test_iter=d2l.load_data_fashion_mnist(batch_size)
num_inputs=784
#图像有28*28像素，本节将其看作长度为784的向量
num_outputs=10
#softmax回归中输出与类别一样多，数据集有10个类别
W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True)
b = torch.zeros(num_outputs, requires_grad=True)

#定义softmax操作
def softmax(X):
    X_exp=torch.exp(X)
    partition=X_exp.sum(1,keepdim=True)
    return X_exp/partition#结果每行和为1

#定义softmax回归模型
#在将数据传递到模型之前，使用reshape将每个原始图像展开为向量
def net(X):
    return softmax(torch.matmul(X.reshape((-1,W.shape[0])),W)+b)

y = torch.tensor([0, 2])
#有了y，我们知道在第一个样本中第一类是正确的预测；在第二个样本中第三类是正确的预测
y_hat = torch.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
#2个样本在3个类别上的预测概率
#print(y_hat[[0, 1], y])
#使用y作为y_hat中概率的索引
#我们选择第一个样本中第一个类的概率和第二个样本中第三个类的概率，即输出[y[0],y[1]]

#定义交叉熵损失函数
def cross_entropy(y_hat,y):
    return -torch.log(y_hat[range(len(y_hat)),y])

#定义一个用于对多个变量累加的的类
class Accumulator:#@save
    def __init__(self,n):
        self.data=[0.0]*n
    
    def add(self,*args):
        self.data=[a+float(b) for a,b in zip(self.data,args)]

    def reset(self):
        self.data=[0.0]*len(self.data)
        
    def __getitem__(self,idx):
        return self.data[idx]
    
#计算分类精度
def accuracy(y_hat,y):#@save
    """计算预测正确的数量"""
    if len(y_hat.shape)>1 and y_hat.shape[1]>1:#如果`y_hat`是矩阵，那么假定第二个维度存储每个类的预测分数
        y_hat=y_hat.argmax(axis=1)
        #使用`argmax`获得每行中最大元素的索引来获得预测类别
    cmp=y_hat.type(y.dtype)==y
    #由于等式运算符“`==`”对数据类型很敏感，因此我们将`y_hat`的数据类型转换为与`y`的数据类型一致。
    return float(cmp.type(y.dtype).sum())

def evaluate_accuracy(net,data_iter):#@save
    """计算在指定数据集上模型的精度"""
    if isinstance(net,torch.nn.Module):
        net.eval()#将模型设为评估模式
    metric=Accumulator(2)#Initializes an Accumulator with two variables: the number of correct predictions and the total number of predictions.
    with torch.no_grad():#disables gradient computation
        for X,y in data_iter:
            metric.add(accuracy(net(X),y),y.numel())#y.numel() returns the total number of elements in y.
    return metric[0]/metric[1]

#if __name__ == '__main__':
    #print(evaluate_accuracy(net,test_iter))
    #由于使用随机权重初始化net模型,因此该模型的精度接近于随机猜测,如在有10个类别情况下的精度接近0.1 

#定义一个在动画中绘制数据的类
class Animator:
    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
                ylim=None, xscale='linear', yscale='linear',
                fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
                figsize=(3.5, 2.5)):
        # 增量地绘制多条线
        if legend is None:#lengend:图例
            legend = []
        d2l.use_svg_display()
        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
        #"plt.subplots()" is called to create a figure (self.fig) and one or more subplots (self.axes).
        if nrows * ncols == 1:
            self.axes = [self.axes, ]
        # 使用lambda函数捕获参数
        self.config_axes = lambda: d2l.set_axes(self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
        #A lambda function is used to create an anonymous function that is then assigned to the "self.config_axes" attribute. 
        #This is a common pattern in Python, especially when a short, simple function is needed, and there's no intention to reuse it elsewhere in the code.
        # It provides a more compact and inline way to express the behavior.
        self.X, self.Y, self.fmts = None, None, fmts

    def add(self, x, y):#Adds data points to the plot.
        if not hasattr(y, "__len__"):#If y is not iterable (doesn't have a length), it is converted to a list to ensure it can be processed as a collection of values.
            y = [y]
        n = len(y)
        if not hasattr(x, "__len__"):#If x is not iterable, it is repeated n times to match the length of y.
            x = [x] * n
        if not self.X:#If "self.X" is not initialized, it is initialized as a list of empty lists, with one list for each element in y.
            self.X = [[] for _ in range(n)]
        if not self.Y:
            self.Y = [[] for _ in range(n)]
        for i, (a, b) in enumerate(zip(x, y)):
            if a is not None and b is not None:
                self.X[i].append(a)
                self.Y[i].append(b)
        self.axes[0].cla()# clears the current axis to prepare for the new data
        for x, y, fmt in zip(self.X, self.Y, self.fmts):
            self.axes[0].plot(x, y, fmt)
        self.config_axes()
        #configures the axis using specified parameters.
        display.display(self.fig)
        display.clear_output(wait=True)
        
#训练
def train_epoch_ch3(net, train_iter, loss, updater):  #@save
    #updater是更新模型参数的常用函数，在后文定义
    """训练模型一个迭代周期"""
    # 将模型设置为训练模式
    if isinstance(net, torch.nn.Module):#checks if the object referred to by the variable net is an instance of the "torch.nn.Module" class
        net.train()
    # 训练损失总和、训练准确度总和、样本数
    metric = Accumulator(3)
    for X, y in train_iter:
        # 计算梯度并更新参数
        y_hat = net(X)
        l = loss(y_hat, y)
        if isinstance(updater, torch.optim.Optimizer):
            #使用PyTorch内置的优化器和损失函数
            updater.zero_grad()# Clear previously calculated gradients
            l.mean().backward()
            updater.step()
        else:
            #使用定制的优化器和损失函数
            l.sum().backward()
            updater(X.shape[0])
        metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
    # 返回训练损失和训练精度
    return metric[0] / metric[2], metric[1] / metric[2]

def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
    """训练模型"""
    # 创建一个用于动画绘制的实例
    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
                        legend=['train loss', 'train acc', 'test acc'])

    for epoch in range(num_epochs):
        # 训练模型一个迭代周期，并获取训练损失和准确度
        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
        # 在测试集上评估模型精度
        test_acc = evaluate_accuracy(net, test_iter)
        # 将训练损失、训练准确度和测试准确度添加到动画中
        animator.add(epoch + 1, train_metrics + (test_acc,))

    # 获取最后一个周期的训练损失和训练准确度
    train_loss, train_acc = train_metrics
    # 检查训练损失、训练准确度和测试准确度的合理性
    assert train_loss < 0.5, train_loss#If the condition is False, it raises an AssertionError exception.
    assert train_acc <= 1 and train_acc > 0.7, train_acc
    assert test_acc <= 1 and test_acc > 0.7, test_acc

def updater(batch_size):
    return d2l.sgd([W, b], lr, batch_size)

if __name__ == '__main__':
    lr = 0.1
    num_epochs = 10
    train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater)

#预测
#给定一系列图像，我们将比较它们的实际标签（文本输出的第一行）和模型预测（文本输出的第二行）
def predict_ch3(net, test_iter, n=6):
    """预测标签"""
    # Iterate over the test dataset to get a batch of images and their true labels
    for X, y in test_iter:
        break
    
    # Get the true labels in text format
    trues = d2l.get_fashion_mnist_labels(y)
    # Use the trained model to make predictions on the test batch and convert predictions to text labels
    preds = d2l.get_fashion_mnist_labels(net(X).argmax(axis=1)) 
    # Create titles for the images by combining true labels and predicted labels
    titles = [true + '\n' + pred for true, pred in zip(trues, preds)]
    # Display a subset (n) of the images along with their true and predicted labels
    d2l.show_images(X[0:n].reshape((n, 28, 28)), 1, n, titles=titles[0:n])

    
if __name__ == '__main__':
    predict_ch3(net, test_iter)
    
plt.show()#将折线图和预测结果的图像统一显示## 3.6 Implementations of Softmax Regression from Scratch

import torch
import torchvision
import time
import matplotlib.pyplot as plt
import numpy as np 
from torch.utils import data
from torchvision import transforms
from d2l import torch as d2l
from IPython import display

#初始化参数
batch_size=256
train_iter,test_iter=d2l.load_data_fashion_mnist(batch_size)
num_inputs=784
#图像有28*28像素，本节将其看作长度为784的向量
num_outputs=10
#softmax回归中输出与类别一样多，数据集有10个类别
W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True)
b = torch.zeros(num_outputs, requires_grad=True)

#定义softmax操作
def softmax(X):
    X_exp=torch.exp(X)
    partition=X_exp.sum(1,keepdim=True)
    return X_exp/partition#结果每行和为1

#定义softmax回归模型
#在将数据传递到模型之前，使用reshape将每个原始图像展开为向量
def net(X):
    return softmax(torch.matmul(X.reshape((-1,W.shape[0])),W)+b)

y = torch.tensor([0, 2])
#有了y，我们知道在第一个样本中第一类是正确的预测；在第二个样本中第三类是正确的预测
y_hat = torch.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
#2个样本在3个类别上的预测概率
#print(y_hat[[0, 1], y])
#使用y作为y_hat中概率的索引
#我们选择第一个样本中第一个类的概率和第二个样本中第三个类的概率，即输出[y[0],y[1]]

#定义交叉熵损失函数
def cross_entropy(y_hat,y):
    return -torch.log(y_hat[range(len(y_hat)),y])

#定义一个用于对多个变量累加的的类
class Accumulator:#@save
    def __init__(self,n):
        self.data=[0.0]*n
    
    def add(self,*args):
        self.data=[a+float(b) for a,b in zip(self.data,args)]

    def reset(self):
        self.data=[0.0]*len(self.data)
        
    def __getitem__(self,idx):
        return self.data[idx]
    
#计算分类精度
def accuracy(y_hat,y):#@save
    """计算预测正确的数量"""
    if len(y_hat.shape)>1 and y_hat.shape[1]>1:#如果`y_hat`是矩阵，那么假定第二个维度存储每个类的预测分数
        y_hat=y_hat.argmax(axis=1)
        #使用`argmax`获得每行中最大元素的索引来获得预测类别
    cmp=y_hat.type(y.dtype)==y
    #由于等式运算符“`==`”对数据类型很敏感，因此我们将`y_hat`的数据类型转换为与`y`的数据类型一致。
    return float(cmp.type(y.dtype).sum())

def evaluate_accuracy(net,data_iter):#@save
    """计算在指定数据集上模型的精度"""
    if isinstance(net,torch.nn.Module):
        net.eval()#将模型设为评估模式
    metric=Accumulator(2)#Initializes an Accumulator with two variables: the number of correct predictions and the total number of predictions.
    with torch.no_grad():#disables gradient computation
        for X,y in data_iter:
            metric.add(accuracy(net(X),y),y.numel())#y.numel() returns the total number of elements in y.
    return metric[0]/metric[1]

#if __name__ == '__main__':
    #print(evaluate_accuracy(net,test_iter))
    #由于使用随机权重初始化net模型,因此该模型的精度接近于随机猜测,如在有10个类别情况下的精度接近0.1 

#定义一个在动画中绘制数据的类
class Animator:
    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
                ylim=None, xscale='linear', yscale='linear',
                fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
                figsize=(3.5, 2.5)):
        # 增量地绘制多条线
        if legend is None:#lengend:图例
            legend = []
        d2l.use_svg_display()
        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
        #"plt.subplots()" is called to create a figure (self.fig) and one or more subplots (self.axes).
        if nrows * ncols == 1:
            self.axes = [self.axes, ]
        # 使用lambda函数捕获参数
        self.config_axes = lambda: d2l.set_axes(self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
        #A lambda function is used to create an anonymous function that is then assigned to the "self.config_axes" attribute. 
        #This is a common pattern in Python, especially when a short, simple function is needed, and there's no intention to reuse it elsewhere in the code.
        # It provides a more compact and inline way to express the behavior.
        self.X, self.Y, self.fmts = None, None, fmts

    def add(self, x, y):#Adds data points to the plot.
        if not hasattr(y, "__len__"):#If y is not iterable (doesn't have a length), it is converted to a list to ensure it can be processed as a collection of values.
            y = [y]
        n = len(y)
        if not hasattr(x, "__len__"):#If x is not iterable, it is repeated n times to match the length of y.
            x = [x] * n
        if not self.X:#If "self.X" is not initialized, it is initialized as a list of empty lists, with one list for each element in y.
            self.X = [[] for _ in range(n)]
        if not self.Y:
            self.Y = [[] for _ in range(n)]
        for i, (a, b) in enumerate(zip(x, y)):
            if a is not None and b is not None:
                self.X[i].append(a)
                self.Y[i].append(b)
        self.axes[0].cla()# clears the current axis to prepare for the new data
        for x, y, fmt in zip(self.X, self.Y, self.fmts):
            self.axes[0].plot(x, y, fmt)
        self.config_axes()
        #configures the axis using specified parameters.
        display.display(self.fig)
        display.clear_output(wait=True)
        
#训练
def train_epoch_ch3(net, train_iter, loss, updater):  #@save
    #updater是更新模型参数的常用函数，在后文定义
    """训练模型一个迭代周期"""
    # 将模型设置为训练模式
    if isinstance(net, torch.nn.Module):#checks if the object referred to by the variable net is an instance of the "torch.nn.Module" class
        net.train()
    # 训练损失总和、训练准确度总和、样本数
    metric = Accumulator(3)
    for X, y in train_iter:
        # 计算梯度并更新参数
        y_hat = net(X)
        l = loss(y_hat, y)
        if isinstance(updater, torch.optim.Optimizer):
            #使用PyTorch内置的优化器和损失函数
            updater.zero_grad()# Clear previously calculated gradients
            l.mean().backward()
            updater.step()
        else:
            #使用定制的优化器和损失函数
            l.sum().backward()
            updater(X.shape[0])
        metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
    # 返回训练损失和训练精度
    return metric[0] / metric[2], metric[1] / metric[2]

def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
    """训练模型"""
    # 创建一个用于动画绘制的实例
    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
                        legend=['train loss', 'train acc', 'test acc'])

    for epoch in range(num_epochs):
        # 训练模型一个迭代周期，并获取训练损失和准确度
        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
        # 在测试集上评估模型精度
        test_acc = evaluate_accuracy(net, test_iter)
        # 将训练损失、训练准确度和测试准确度添加到动画中
        animator.add(epoch + 1, train_metrics + (test_acc,))

    # 获取最后一个周期的训练损失和训练准确度
    train_loss, train_acc = train_metrics
    # 检查训练损失、训练准确度和测试准确度的合理性
    assert train_loss < 0.5, train_loss#If the condition is False, it raises an AssertionError exception.
    assert train_acc <= 1 and train_acc > 0.7, train_acc
    assert test_acc <= 1 and test_acc > 0.7, test_acc

def updater(batch_size):
    return d2l.sgd([W, b], lr, batch_size)

if __name__ == '__main__':
    lr = 0.1
    num_epochs = 10
    train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater)

#预测
#给定一系列图像，我们将比较它们的实际标签（文本输出的第一行）和模型预测（文本输出的第二行）
def predict_ch3(net, test_iter, n=6):
    """预测标签"""
    # Iterate over the test dataset to get a batch of images and their true labels
    for X, y in test_iter:
        break
    
    # Get the true labels in text format
    trues = d2l.get_fashion_mnist_labels(y)
    # Use the trained model to make predictions on the test batch and convert predictions to text labels
    preds = d2l.get_fashion_mnist_labels(net(X).argmax(axis=1)) 
    # Create titles for the images by combining true labels and predicted labels
    titles = [true + '\n' + pred for true, pred in zip(trues, preds)]
    # Display a subset (n) of the images along with their true and predicted labels
    d2l.show_images(X[0:n].reshape((n, 28, 28)), 1, n, titles=titles[0:n])

    
if __name__ == '__main__':
    predict_ch3(net, test_iter)
    
plt.show()#将折线图和预测结果的图像统一显示

2.5 神经元模型

神经网络中最基本的成分是神经元(neuron)模型，目前广泛使用的神经元模型是1943年心理学家McCulloch和数学家W.Pitts首先提出的M-P神经元模型。如图，每个神经元都是一个多输入单输出的信息处理单元，输入信号通过带权重的连接传递，和阈值对比后得到总输入值，再通过激活函数(activation function)的处理产生单个输出。

神经元激活与否取决于某一阈值电平，即只有当其输入总和超过阈值 thea 时，神经元才被激

活而发放脉冲，否则神经元不会发生输出信号。整个过程可以用下面这个函数来表示：

作用函数

3.BP网络

3.1多层感知机

最简单的深度网络称为多层感知机。多层感知机由多层神经元组成，每一层与它的上一层相连，从中接收输入；同时每一层也与它的下一层相连，影响当前层的神经元。

相较于单层感知机，多层感知机的改进如下：

1. 引入了隐藏层(hidden layer)的结构，隐藏层通常指代的是，输入层(input layer)和输出层(output layer)中间的具有 N 个神经元的结构。其中层与层之间采用全连接的结构，跨层之前没有相连。

2. 引入了新的非线性激活函数。

3. 采用了反向传播算法(back propagation)。

3.2 BP算法简述

BP神经网络的全称为Back-Propagation Network，即反向传播网络，它是一个前向多层网络，利用误差反向传播算法对网络进行训练。BP神经网络的结构由输入层、隐含层和输出层构成，结构简单、可塑性强，输入层的节点只起到缓冲器的作用，负责把网络的输入数据传递给第一隐含层，因而各节点之间没有传递函数的功能。BP神经网络的结构形态属于前向网络，如下：

BP神经网络的上下层之间实现全连接，而每层神经元之间无连接。当一对学习样本提供给网络后，神经元的激活值从输入层经各隐含层传递至输出层。按照减少目标输出和实际输出之间误差的方向，从输出层反向经过中间层回到输入层，从而逐步修正各连接权值。与感知机不同的是由于误差反向传播中会对传递函数进行求到计算，BP神经网络的传递函数必须是可微的，所以不能使用感知机网络中的硬阈值传递函数，而通常使用的传递函数仍是分段线性函数、阶跃函数、Sigmoid函数。

程序代码

import torch
from torch import nn

from d2l import torch as d2l

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
num_inputs, num_outputs, num_hiddens = 784, 10, 256
W1 = nn.Parameter(torch.randn(
     num_inputs, num_hiddens, requires_grad=True) * 0.01)
b1 = nn.Parameter(torch.zeros(num_hiddens, requires_grad=True))
W2 = nn.Parameter(torch.randn(
     num_hiddens, num_outputs, requires_grad=True) * 0.01)
b2 = nn.Parameter(torch.zeros(num_outputs, requires_grad=True))
params = [W1, b1, W2, b2]
def relu(X):
    a = torch.zeros_like(X)
    return torch.max(X, a)
def net(X):
    X = X.reshape((-1, num_inputs))
    H = relu(X @ W1 + b1)
    return (H @ W2 + b2)
loss = nn.CrossEntropyLoss(reduction='none')


class Animator:
    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
                ylim=None, xscale='linear', yscale='linear',
                fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
                figsize=(3.5, 2.5)):
        # 增量地绘制多条线
        if legend is None:#lengend:图例
            legend = []
        d2l.use_svg_display()
        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
        #"plt.subplots()" is called to create a figure (self.fig) and one or more subplots (self.axes).
        if nrows * ncols == 1:
            self.axes = [self.axes, ]
        # 使用lambda函数捕获参数
        self.config_axes = lambda: d2l.set_axes(self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
        #A lambda function is used to create an anonymous function that is then assigned to the "self.config_axes" attribute. 
        #This is a common pattern in Python, especially when a short, simple function is needed, and there's no intention to reuse it elsewhere in the code.
        # It provides a more compact and inline way to express the behavior.
        self.X, self.Y, self.fmts = None, None, fmts

    def add(self, x, y):#Adds data points to the plot.
        if not hasattr(y, "__len__"):#If y is not iterable (doesn't have a length), it is converted to a list to ensure it can be processed as a collection of values.
            y = [y]
        n = len(y)
        if not hasattr(x, "__len__"):#If x is not iterable, it is repeated n times to match the length of y.
            x = [x] * n
        if not self.X:#If "self.X" is not initialized, it is initialized as a list of empty lists, with one list for each element in y.
            self.X = [[] for _ in range(n)]
        if not self.Y:
            self.Y = [[] for _ in range(n)]
        for i, (a, b) in enumerate(zip(x, y)):
            if a is not None and b is not None:
                self.X[i].append(a)
                self.Y[i].append(b)
        self.axes[0].cla()# clears the current axis to prepare for the new data
        for x, y, fmt in zip(self.X, self.Y, self.fmts):
            self.axes[0].plot(x, y, fmt)
        self.config_axes()
        #configures the axis using specified parameters.
        display.display(self.fig)
        display.clear_output(wait=True)

#定义一个用于对多个变量累加的的类
class Accumulator:#@save
    def __init__(self,n):
        self.data=[0.0]*n
    
    def add(self,*args):
        self.data=[a+float(b) for a,b in zip(self.data,args)]

    def reset(self):
        self.data=[0.0]*len(self.data)
        
    def __getitem__(self,idx):
        return self.data[idx]
    

def accuracy(y_hat,y):#@save
    """计算预测正确的数量"""
    if len(y_hat.shape)>1 and y_hat.shape[1]>1:#如果`y_hat`是矩阵，那么假定第二个维度存储每个类的预测分数
        y_hat=y_hat.argmax(axis=1)
        #使用`argmax`获得每行中最大元素的索引来获得预测类别
    cmp=y_hat.type(y.dtype)==y
    #由于等式运算符“`==`”对数据类型很敏感，因此我们将`y_hat`的数据类型转换为与`y`的数据类型一致。
    return float(cmp.type(y.dtype).sum())

def evaluate_accuracy(net,data_iter):#@save
    """计算在指定数据集上模型的精度"""
    if isinstance(net,torch.nn.Module):
        net.eval()#将模型设为评估模式
    metric=Accumulator(2)#Initializes an Accumulator with two variables: the number of correct predictions and the total number of predictions.
    with torch.no_grad():#disables gradient computation
        for X,y in data_iter:
            metric.add(accuracy(net(X),y),y.numel())#y.numel() returns the total number of elements in y.
    return metric[0]/metric[1]

def train_epoch_ch3(net, train_iter, loss, updater):  #@save
    #updater是更新模型参数的常用函数，在后文定义
    """训练模型一个迭代周期"""
    # 将模型设置为训练模式
    if isinstance(net, torch.nn.Module):#checks if the object referred to by the variable net is an instance of the "torch.nn.Module" class
        net.train()
    # 训练损失总和、训练准确度总和、样本数
    metric = Accumulator(3)
    for X, y in train_iter:
        # 计算梯度并更新参数
        y_hat = net(X)
        l = loss(y_hat, y)
        if isinstance(updater, torch.optim.Optimizer):
            #使用PyTorch内置的优化器和损失函数
            updater.zero_grad()# Clear previously calculated gradients
            l.mean().backward()
            updater.step()
        else:
            #使用定制的优化器和损失函数
            l.sum().backward()
            updater(X.shape[0])
        metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
    # 返回训练损失和训练精度
    return metric[0] / metric[2], metric[1] / metric[2]
def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater): #@save
    """训练模型（定义见第3章）"""
    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
                        legend=['train loss', 'train acc', 'test acc'])
    for epoch in range(num_epochs):
        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
        test_acc = evaluate_accuracy(net, test_iter)
        animator.add(epoch + 1, train_metrics + (test_acc,))
    train_loss, train_acc = train_metrics
    assert train_loss < 0.5, train_loss
    assert train_acc <= 1 and train_acc > 0.7, train_acc
    assert test_acc <= 1 and test_acc > 0.7, test_acc
num_epochs, lr = 10, 0.1
updater = torch.optim.SGD(params, lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
d2l.predict_ch3(net, test_iter)

4.性能优化

当我们比较训练和验证误差时，我们要注意两种常见的情况。首先，我们要注意这样的情况：训练误差和验证误差都很严重，但它们之间仅有一点差距。如果模型不能降低训练误差，这可能意味着模型过于简单（即表达能力不足），无法捕获试图学习的模式。此外，由于我们的训练和验证误差之间的泛化误差很小，我们有理由相信可以用一个更复杂的模型降低训练误差。这种现象被称为欠拟合（underfitting ）。

另一方面，当我们的训练误差明显低于验证误差时要小心，这表明严重的过拟合（ overfitting ）。

4.1 过拟合与欠拟合

过拟合：
是指学习时选择的模型所包含的参数过多，以至于出现这一模型对已知数据预测的很好，但对未知数据预测得很差的现象。这种情况下模型可能只是记住了训练集数据，而不是学习到了数据特征。
欠拟合：
模型描述能力太弱，以至于不能很好地学习到数据中的规律。产生欠拟合的原因通常是模型过于简单。

我们要知道机器学习的根本问题是解决优化和泛化的问题。
优化:是指调节模型以在训练数据上得到最佳性能。
泛化:是指训练好的模型在前所未见的数据(测试集)上的性能好坏。

4.2自适应梯度法

梯度下降算法存在计算量过大或者陷入局部最优的问题。人们如今已经提出动量法和自适应梯度法来解决上述的问题。

1.梯度下降存在的问题
批量梯度下降（BGD）
在梯度下降的每一步中，我们都用到了所有的训练样本，我们需要进行求和运算，在梯度下降中，在计算微积随机梯度下降（SGD）

随机梯度下降法（SGD）
由于随机梯度下降法一次迭代一个样本，导致迭代方向变化很大，不能很快的收敛到局部最优解。

小批量梯度下降（MBGD）
同SGD一样，每次梯度的方向不确定，可能陷入局部最优。

4.3 动量法（momentum）

普通的梯度下降法当接近最优值时梯度会比较小，由于学习率固定，普通的梯度下降法的收敛速度会变慢，有时甚至陷入局部最优。
改进目标：改进梯度下降算法存在的问题，即减少震荡，加速通往谷低
改进思想：利用累加历史梯度信息更新梯度

4.4 自适应梯度

自适应梯度法通过减小震荡方向步长，增大平坦方向步长来减小震荡,加速通往谷底方向。但是如何区分震荡方向与平坦方向？梯度幅度的平方较大的方向是震荡方向；梯度幅度的平方较小的方向是平坦方向。

4.4.1AdaGrad方法

⚫ 参数自适应变化：具有较大偏导的参数相应有一个较大的学习率，而具有小偏导的参数则对应一个较小的学习率

⚫ 具体来说，每个参数的学习率会缩放各参数反比于其历史梯度平方值总和的平方根

%matplotlib inline
import math
import torch
from d2l import torch as d2l

def init_adagrad_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)
def adagrad(params, states, hyperparams):
    eps = 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
           s[:] += torch.square(p.grad)
           p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adagrad, init_adagrad_states(feature_dim),
{'lr': 0.1}, data_iter, feature_dim);

4.4.2RMSProp方法

⚫ RMSProp 解决 AdaGrad 方法中学习率过度衰减的问题

⚫ RMSProp 使用指数衰减平均以丢弃遥远的历史，使其能够快速收敛；此外，RMSProp 还加入了超参数 𝜌 控制衰减速率。

⚫ 具体来说（对比 AdaGrad 的算法描述），即修改 𝑟 为

import math
import torch
from d2l import torch as d2l

def rmsprop_2d(x1, x2, s1, s2):
    g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
    s1 = gamma * s1 + (1 - gamma) * g1 ** 2   
    s2 = gamma * s2 + (1 - gamma) * g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2
def f_2d(x1, x2):
    return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta, gamma = 0.4, 0.9

def init_rmsprop_states(feature_dim):
    s_w = torch.zeros((feature_dim, 1))
    s_b = torch.zeros(1)
    return (s_w, s_b)
def rmsprop(params, states, hyperparams):
    gamma, eps = hyperparams['gamma'], 1e-6
    for p, s in zip(params, states):
        with torch.no_grad():
            s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)
            p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
        p.grad.data.zero_()
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(rmsprop, init_rmsprop_states(feature_dim),
            {'lr': 0.01, 'gamma': 0.9}, data_iter, feature_dim);

4.4.3Adam算法

⚫ Adam 在 RMSProp 方法的基础上更进一步：

➢ 除了加入历史梯度平方的指数衰减平均（ 𝑟 ）外，

➢ 还保留了历史梯度的指数衰减平均（ 𝑠 ），相当于动量。

⚫ Adam 行为就像一个带有摩擦力的小球，在误差面上倾向于平坦的极小值。

%matplotlib inline
import torch
from d2l import torch as d2l
def init_adam_states(feature_dim):
    v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    return ((v_w, s_w), (v_b, s_b))
def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s) in zip(params, states):
        with torch.no_grad():
            v[:] = beta1 * v + (1 - beta1) * p.grad
            s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
            v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
            s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
            p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr)
+ eps)
        p.grad.data.zero_()
    hyperparams['t'] += 1
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adam, init_adam_states(feature_dim),
{'lr': 0.01, 't': 1}, data_iter, feature_dim);