2.深度学习预备知识

最新推荐文章于 2023-08-08 09:15:25 发布

怕晒的向日葵--

最新推荐文章于 2023-08-08 09:15:25 发布

阅读量203

点赞数 1

文章标签：深度学习 python 机器学习

原文链接：https://blog.csdn.net/sumshine_/article/details/126807296?spm=1001.2014.3001.5502

版权

2.深度学习预备知识

2.1 数据操作

为了能够完成各种操作，我们需要某种方法来存储和操作数据。一般来说，我们需要做两件重要的事情：（1）获取数据；（2）在数据读入计算机后对其进行处理。

2.1.1 入门

了解并运行一些基本数值计算工具

[张量表示一个数值组成的数组，这个数组可能有多个维度]。具有一个轴的张量对应于数学上的向量（vector）。具有两个轴的张量对应于数学上的矩阵（matrix）。具有两个轴以上的张量没有特殊的数学名称。

多维张量的几何理解 - 简书 (jianshu.com)

函数名	作用	例子意义
`arange()`	创建行向量	创建了一个包含以0开始的前12个整数的行向量

import paddle 
x = paddle.arange(12)

Tensor(shape=[12], dtype=int64, place=CPUPlace, stop_gradient=True,
    [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11])

函数名	作用	例子意义
`zeros()`	设置张量所有元素均为0	创建一个形状为（2,3,4）的张量，其中所有元素都设置为0

paddle.zeros((2, 3, 4))

Tensor(shape=[2, 3, 4], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       [[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

函数名	作用	例子意义
`ones()`	设置张量所有元素均为1	创建一个形状为（2,3,4）的张量，其中所有元素都设置为1

paddle.ones((2, 3, 4))

Tensor(shape=[2, 3, 4], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       [[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

函数名	作用	例子意义
`randn()`	随机初始化张量中参数的值形状为（3,4）的张量	其中的每个元素都从均值为0、标准差为1的标准高斯分布（正态分布）中随机采样

paddle.randn((3, 4))

Tensor(shape=[3, 4], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       [[ 1.25745058,  0.16176178, -1.47831631,  0.33923152],
        [-1.79979932, -1.00609720, -0.13190337,  0.63592905],
        [-1.07031310, -1.92815316, -0.35858944, -0.83898044]])

函数名	作用	例子意义
`tenser()`	提供包含数值的Python列表（或嵌套列表），来为所需张量中的每个元素赋予确定值。在这里，最外层的列表对应于轴0，内层的列表对应于轴1	指定张量

paddle.to_tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

Tensor(shape=[3, 4], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [[2, 1, 4, 3],
        [1, 2, 3, 4],
        [4, 3, 2, 1]])

函数名	作用
`shape()`	访问张量（沿每个轴的长度）的形状

x.shape

[12]

函数名	作用
`size()`	访问张量（沿每个轴的长度）的形状

x.size

函数名	作用
`numel()`	张量中元素的总数

print(x.numel())
print(x.numel().item())

Tensor(shape=[1], dtype=int64, place=Place(cpu), stop_gradient=True,
       [12])
12

函数名	作用
`reshape()`	改变一个张量的形状而不改变元素数量和元素值

X = x.reshape([3, 4])
X

Tensor(shape=[3, 4], dtype=int64, place=Place(cpu), stop_gradient=True,
       [[0 , 1 , 2 , 3 ],
        [4 , 5 , 6 , 7 ],
        [8 , 9 , 10, 11]])

2.1.2 运算

——按元素（elementwise）运算

对于任意具有相同形状的张量，[常见的标准算术运算符（+、-、*、/和\**）都可以被升级为按元素运算]我们可以在同一形状的任意两个张量上调用按元素操作。

程序中的运算函数	意义
`+`	加
`-`	减
`*`	乘
`/`	除
`**`	幂

x = paddle.to_tensor([1.0, 2, 4, 8])
y = paddle.to_tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x**y

(Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [3. , 4. , 6. , 10.]),
 Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [-1.,  0.,  2.,  6.]),
 Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [2. , 4. , 8. , 16.]),
 Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [0.50000000, 1.        , 2.        , 4.        ]),
 Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [1. , 4. , 16., 64.]))

运算函数	意义
`exp()`	指数函数

paddle.exp(x)

Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
       [2.71828175   , 7.38905621   , 54.59814835  , 2980.95800781])

运算函数	意义
`sum()`	对张量中的所有元素进行求和，会产生一个单元素张量

X = paddle.arange(12, dtype='float32').reshape((3, 4))
X,X.sum()

(Tensor(shape=[3, 4], dtype=float32, place=Place(cpu), stop_gradient=True,
        [[0. , 1. , 2. , 3. ],
         [4. , 5. , 6. , 7. ],
         [8. , 9. , 10., 11.]]),
 Tensor(shape=[1], dtype=float32, place=Place(cpu), stop_gradient=True,
        [66.]))

运算函数	意义
`paddle.concat(tensons,axis=0/1)`例如paddle.concat((X, Y), axis=0), paddle.concat((X, Y), axis=1)X,Y均为(3,4)的张量	张量连结（concatenate）axis指张量连结的维度

X = paddle.arange(12, dtype='float32').reshape((3, 4))
Y = paddle.to_tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
paddle.concat((X, Y), axis=0), paddle.concat((X, Y), axis=1)

(Tensor(shape=[6, 4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [[0. , 1. , 2. , 3. ],
         [4. , 5. , 6. , 7. ],
         [8. , 9. , 10., 11.],
         [2. , 1. , 4. , 3. ],
         [1. , 2. , 3. , 4. ],
         [4. , 3. , 2. , 1. ]]),
 Tensor(shape=[3, 8], dtype=float32, place=CPUPlace, stop_gradient=True,
        [[0. , 1. , 2. , 3. , 2. , 1. , 4. , 3. ],
         [4. , 5. , 6. , 7. , 1. , 2. , 3. , 4. ],
         [8. , 9. , 10., 11., 4. , 3. , 2. , 1. ]]))

上面的例子分别演示了当我们沿行（轴-0，形状的第一个元素）和按列（轴-1，形状的第二个元素）连结两个矩阵时会发生什么情况。我们可以看到，第一个输出张量的轴-0长度 (66) 是两个输入张量轴-0长度的总和 (3+3)；第二个输出张量的轴-1长度 (8) 是两个输入张量轴-1长度的总和 (4+4)。

2.1.3 广播机制

在上面的部分中，我们看到了如何在相同形状的两个张量上执行按元素操作。在某些情况下，即使形状不同，我们仍然可以通过调用广播机制（broadcasting mechanism）来执行按元素操作。这种机制的工作方式如下：首先，通过适当复制元素来扩展一个或两个数组，以便在转换之后，两个张量具有相同的形状。其次，对生成的数组执行按元素操作。

在大多数情况下，我们将沿着数组中长度为1的轴进行广播，如下例子：

a = paddle.reshape(paddle.arange(3), (3, 1))
b = paddle.reshape(paddle.arange(2), (1, 2))
a, b

(Tensor(shape=[3, 1], dtype=int64, place=CPUPlace, stop_gradient=True,
        [[0],
         [1],
         [2]]),
 Tensor(shape=[1, 2], dtype=int64, place=CPUPlace, stop_gradient=True,
        [[0, 1]]))

由于 a 和 b 分别是 3×13×1 和 1×21×2 矩阵，如果我们让它们相加，它们的形状不匹配。我们将两个矩阵广播为一个更大的 3×23×2 矩阵，如下所示：矩阵 a将复制列，矩阵 b将复制行，然后再按元素相加。

a + b

Tensor(shape=[3, 2], dtype=int64, place=CPUPlace, stop_gradient=True,
       [[0, 1],
        [1, 2],
        [2, 3]])

2.1.4 索引和切片

——第一个元素的索引是0，最后一个元素索引是-1

索引、切片例子，X是一个tensor

例子	意义
`X[-1]`	选择最后一个元素
`X[1:3]`	选择第二个和第三个元素
`X[0:2,:]`	访问第1行和第2行，其中“:”代表沿轴1（列）的所有元素

2.1.5 节省内存

低层逻辑：运行一些操作可能会导致为新结果分配内存，机器学习运算量巨大，会导致内存不断被更新占用。

现实意义：节省内存，更重要的是防止需要引用最新值时由于一些问题还是指向了旧值地址，导致错误。

方法：切片表示法，例Y[:] =

Z = paddle.zeros_like(Y)
print('id(Z):', id(Z))
Z = X + Y
print('id(Z):', id(Z))

id(Z): 139847818344880
id(Z): 139847859902128

2.1.6 转换位其他Python对象

转换为 NumPy 张量很容易，反之也很容易。转换后的结果不共享内存。这个小的不便实际上是非常重要的：当你在 CPU 或 GPU 上执行操作的时候，如果 Python 的 NumPy 包也希望使用相同的内存块执行其他操作，你不希望停下计算来等它。

A = X.numpy()
B = paddle.to_tensor(A)
type(A), type(B)

(numpy.ndarray, paddle.Tensor)

要(将大小为1的张量转换为 Python 标量)，我们可以调用 item 函数或 Python 的内置函数。

a = paddle.to_tensor([3.5])
a, a.item(), float(a), int(a)

(Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [3.50000000]),
 3.5,
 3.5,
 3)

2.1.7 练习

运行本节中的代码。将本节中的条件语句 X == Y 更改为 X < Y 或 X > Y，然后看看你可以得到什么样的张量。

X = paddle.arange(12, dtype=paddle.float32).reshape((3,4))
Y = paddle.to_tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
print(X)
print(Y)
X == Y,X > Y,X< Y

Tensor(shape=[3, 4], dtype=float32, place=Place(cpu), stop_gradient=True,
       [[0. , 1. , 2. , 3. ],
        [4. , 5. , 6. , 7. ],
        [8. , 9. , 10., 11.]])
Tensor(shape=[3, 4], dtype=float32, place=Place(cpu), stop_gradient=True,
       [[2., 1., 4., 3.],
        [1., 2., 3., 4.],
        [4., 3., 2., 1.]])
(Tensor(shape=[3, 4], dtype=bool, place=Place(cpu), stop_gradient=True,
        [[False, True , False, True ],
         [False, False, False, False],
         [False, False, False, False]]),
 Tensor(shape=[3, 4], dtype=bool, place=Place(cpu), stop_gradient=True,
        [[False, False, False, False],
         [True , True , True , True ],
         [True , True , True , True ]]),
 Tensor(shape=[3, 4], dtype=bool, place=Place(cpu), stop_gradient=True,
        [[True , False, True , False],
         [False, False, False, False],
         [False, False, False, False]]))

2.用其他形状（例如三维张量）替换广播机制中按元素操作的两个张量。结果是否与预期相同？

a = paddle.arange(6).reshape((3, 1, 2))
b = paddle.arange(4).reshape((1, 2, 2))
a, b

(Tensor(shape=[3, 1, 2], dtype=int64, place=Place(cpu), stop_gradient=True,
        [[[0, 1]],
 
         [[2, 3]],
 
         [[4, 5]]]),
 Tensor(shape=[1, 2, 2], dtype=int64, place=Place(cpu), stop_gradient=True,
        [[[0, 1],
          [2, 3]]]))

a + b

Tensor(shape=[3, 2, 2], dtype=int64, place=Place(cpu), stop_gradient=True,
       [[[0, 2],
         [2, 4]],

        [[2, 4],
         [4, 6]],

        [[4, 6],
         [6, 8]]])

是相同的，但要注意两个问题：

（1）第3个维度必须相同，否则相加报错

a = paddle.arange(6).reshape((3, 1, 2))
b = paddle.arange(6).reshape((1, 2, 3))
a, b

(Tensor(shape=[3, 1, 2], dtype=int64, place=Place(cpu), stop_gradient=True,
        [[[0, 1]],
 
         [[2, 3]],
 
         [[4, 5]]]),
 Tensor(shape=[1, 2, 3], dtype=int64, place=Place(cpu), stop_gradient=True,
        [[[0, 1, 2],
          [3, 4, 5]]]))

a + b

ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [3, 1, 2] and the shape of Y = [1, 2, 3]. Received [2] in X is not equal to [3] in Y at i:2.
  [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/fluid/operators/elementwise/elementwise_op_function.h:240)
  [operator < elementwise_add > error]

2.2 数据预处理

在Python中常用的数据分析工具中，通常使用 pandas 软件包。像庞大的 Python 生态系统中的许多其他扩展包一样，pandas 可以与张量兼容

2.2.1 读取数据集

首先(创建一个人工数据集，并存储在csv（逗号分隔值）文件)，再将数据集按行写入 csv 文件中。

import os

os.makedirs(os.path.join('.', 'data'), exist_ok=True)
data_file = os.path.join('.', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # 列名
    f.write('NA,Pave,127500\n')  # 每行表示一个数据样本
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

要[从创建的 csv 文件中加载原始数据集]，我们导入 pandas 包并调用 read_csv 函数。该数据集有四行三列。其中每行描述了房间数量（“NumRooms”）、巷子类型（“Alley”）和房屋价格（“Price”）。

import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000

2.2.2 处理缺失值

为了处理缺失的数据，典型的方法包括 *插值* 和 *删除*

通过位置索引iloc，我们将 data 分成 inputs 和 outputs，其中前者为 data的前两列，而后者为 data的最后一列。对于 inputs 中缺少的数值，我们用同一列的均值替换 “NaN” 项。

inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN

**对于 inputs 中的类别值或离散值，我们将 “NaN” 视为一个类别,**巷子类型为 “Pave” 的行会将“Alley_Pave”的值设置为1，“Alley_nan”的值设置为0。缺少巷子类型的行会将“Alley_Pave”和“Alley_nan”分别设置为0和1

inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1

2.2.3转换为张量格式

[现在 inputs 和 outputs 中的所有条目都是数值类型，它们可以转换为张量格式。]当数据采用张量格式后，可以通过在 :numref:sec_ndarray 中引入的那些张量函数来进一步操作。

import paddle

X, y = paddle.to_tensor(inputs.values), paddle.to_tensor(outputs.values)
X, y

(Tensor(shape=[4, 3], dtype=float64, place=CPUPlace, stop_gradient=True,
        [[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]]),
 Tensor(shape=[4], dtype=int64, place=CPUPlace, stop_gradient=True,
        [127500, 106000, 178100, 140000]))

2.3 线性代数

2.3.1 标量

—— 标量（scalar）由只有一个元素的张量表示

1、实例

mport paddle

x = paddle.to_tensor([3.0])
y = paddle.to_tensor([2.0])

x + y, x * y, x / y, x**y

(Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [5.]),
 Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [6.]),
 Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [1.50000000]),
 Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [9.]))

2、数学知识

2.3.2 向量

—— 你可以将向量视为标量值组成的列表，一维张量，将标量从零阶推广到一阶

1、实例

x = paddle.arange(4)
x

Tensor(shape=[4], dtype=int64, place=CPUPlace, stop_gradient=True,
       [0, 1, 2, 3])

2、数学知识

注：点乘和点积（dot product）是一个东西，给定两个向量相同位置的按元素乘积的和

2.3.3 矩阵

—— **二维张量，将向量从一阶推广到二阶 **

正如向量将标量从零阶推广到一阶，矩阵将向量从一阶推广到二阶。矩阵，我们通常用粗体、大写字母来表示（例如， $\mathbf{X}$ 、 $\mathbf{Y}$ 和 $\mathbf{Z}$ ），在代码中表示为具有两个轴的张量。

在数学表示法中，我们使用 $\mathbf{A} \in \mathbb{R}^{m \times n}$ 来表示矩阵 $\mathbf{A}$ ，其由 $m$ 行和 $n$ 列的实值标量组成。直观地，我们可以将任意矩阵 $\mathbf{A} \in \mathbb{R}^{m \times n}$ 视为一个表格，其中每个元素 $a_{ij}$ 属于第 $i$ 行第 $j$ 列：

$\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}. (2.3.2)$

对于任意 $\mathbf{A} \in \mathbb{R}^{m \times n}$ , $\mathbf{A}$ 的形状是( $m$ , $n$ )或 $\times n$ 。当矩阵具有相同数量的行和列时，其形状将变为正方形；因此，它被称为 方矩阵（square matrix）。

当调用函数来实例化张量时，我们可以[通过指定两个分量 $m$ 和 $n$ 来创建一个形状为 $\times n$ 的矩阵]。

A = paddle.reshape(paddle.arange(20), (5, 4))
A

Tensor(shape=[5, 4], dtype=int64, place=CPUPlace, stop_gradient=True,
       [[0 , 1 , 2 , 3 ],
        [4 , 5 , 6 , 7 ],
        [8 , 9 , 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]])

矩阵的转置

paddle.transpose(A, perm=[1, 0])

Tensor(shape=[4, 5], dtype=int64, place=CPUPlace, stop_gradient=True,
       [[0 , 4 , 8 , 12, 16],
        [1 , 5 , 9 , 13, 17],
        [2 , 6 , 10, 14, 18],
        [3 , 7 , 11, 15, 19]])

对称矩阵（symmetric matrix） $\mathbf{A}$ 等于其转置： $\mathbf{A} = \mathbf{A}^\top$

B = paddle.to_tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
B

Tensor(shape=[3, 3], dtype=int64, place=CPUPlace, stop_gradient=True,
       [[1, 2, 3],
        [2, 0, 4],
        [3, 4, 5]])

现在我们将 B 与它的转置进行比较。

B == paddle.transpose(B, perm=[1, 0])

Tensor(shape=[3, 3], dtype=bool, place=CPUPlace, stop_gradient=True,
       [[True, True, True],
        [True, True, True],
        [True, True, True]])

矩阵是有用的数据结构：它们允许我们组织具有不同变化模式的数据。例如，我们矩阵中的行可能对应于不同的房屋（数据样本），而列可能对应于不同的属性。如果你曾经使用过电子表格软件或已阅读过 :numref:sec_pandas，这应该听起来很熟悉。因此，尽管单个向量的默认方向是列向量，但在表示表格数据集的矩阵中，将每个数据样本作为矩阵中的行向量更为常见。我们将在后面的章节中讲到这点。这种约定将支持常见的深度学习实践。例如，沿着张量的最外轴，我们可以访问或遍历小批量的数据样本。如果不存在小批量，我们也可以只访问数据样本

2.3.4 降维

方法：

1、求和

直接求和——沿所有的轴降低张量的维度，变为一个标量，向量、矩阵均适用

x = paddle.arange(4, dtype=paddle.float32)
x, x.sum()

(Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [0., 1., 2., 3.]),
 Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [6.]))

A.shape, A.sum()

([5, 4],
 Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [190.]))

指定张量沿哪一个轴来求和：

A_sum_axis0 = A.sum(axis=0)
A_sum_axis0, A_sum_axis0.shape

(Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [40., 45., 50., 55.]),
 [4])

指定 axis=1 将通过汇总所有列的元素降维（轴1）。因此，输入的轴1的维数在输出形状中消失。

A_sum_axis1 = A.sum(axis=1)
A_sum_axis1, A_sum_axis1.shape

(Tensor(shape=[5], dtype=float32, place=CPUPlace, stop_gradient=True,
        [6. , 22., 38., 54., 70.]),
 [5])

沿着行和列对矩阵求和，等价于对矩阵的所有元素进行求和。

A.sum(axis=[0, 1])  # Same as `A.sum()`

Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
       [190.])

2、求平均

直接求平均：

A.mean(), A.sum() / A.numel()

(Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [9.50000000]),
 Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [9.50000000]))

指定轴：

A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [8. , 9. , 10., 11.]),
 Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [8. , 9. , 10., 11.]))

2.3.4.1 非降维求和

有时在调用函数来[计算总和或均值时保持轴数不变]会很有用

sum_A = paddle.sum(A, axis=1, keepdim=True)
sum_A

Tensor(shape=[5, 1], dtype=float32, place=CPUPlace, stop_gradient=True,
       [[6. ],
        [22.],
        [38.],
        [54.],
        [70.]])

例如，由于 sum_A 在对每行进行求和后仍保持两个轴，我们可以(通过广播将 A 除以 sum_A)

A / sum_A

Tensor(shape=[5, 4], dtype=float32, place=CPUPlace, stop_gradient=True,
       [[0.        , 0.16666667, 0.33333334, 0.50000000],
        [0.18181819, 0.22727273, 0.27272728, 0.31818181],
        [0.21052632, 0.23684211, 0.26315790, 0.28947368],
        [0.22222222, 0.24074075, 0.25925925, 0.27777779],
        [0.22857143, 0.24285714, 0.25714287, 0.27142859]])

如果我们想沿[某个轴计算 A 元素的累积总和]，比如 axis=0（按行计算），我们可以调用 cumsum 函数。此函数不会沿任何轴降低输入张量的维度。

A.cumsum(axis=0)

Tensor(shape=[5, 4], dtype=float32, place=CPUPlace, stop_gradient=True,
       [[0. , 1. , 2. , 3. ],
        [4. , 6. , 8. , 10.],
        [12., 15., 18., 21.],
        [24., 28., 32., 36.],
        [40., 45., 50., 55.]])

2.3.5 点积（Dot Product）

到目前为止，我们只执行了按元素操作、求和及平均值。如果这就是我们所能做的，那么线性代数可能就不需要单独一节了。
但是，最基本的操作之一是点积。给定两个向量 $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ ，它们的点积（dot product） $\mathbf{x}^\top \mathbf{y}$ （或 $\langle \mathbf{x}, \mathbf{y} \rangle$ ）是相同位置的按元素乘积的和： $\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{d} x_i y_i$ 。

[~~点积是相同位置的按元素乘积的和~~]

y = paddle.ones(shape=[4], dtype='float32')
x, y, paddle.dot(x, y)

(Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [0., 1., 2., 3.]),
 Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=True,
        [1., 1., 1., 1.]),
 Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
        [6.]))

注意，(我们可以通过执行按元素乘法，然后进行求和来表示两个向量的点积)：

paddle.sum(x * y)

Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
       [6.])

点积在很多场合都很有用。例如，给定一组由向量 $\mathbf{x} \in \mathbb{R}^d$ 表示的值，和一组由 $\mathbf{w} \in \mathbb{R}^d$ 表示的权重。 $\mathbf{x}$ 中的值根据权重 $\mathbf{w}$ 的加权和可以表示为点积 $\mathbf{x}^\top \mathbf{w}$ 。当权重为非负数且和为1（即 $\left(\sum_{i=1}^{d} {w_i} = 1\right)$ ）时，点积表示 加权平均（weighted average）。将两个向量归一化得到单位长度后，点积表示它们夹角的余弦。我们将在本节的后面正式介绍长度（length）的概念。

2.3.6 矩阵-向量积

现在我们知道如何计算点积，我们可以开始理解 矩阵-向量积（matrix-vector products）。回顾分别在公式2.3.2和公式2.3.1中定义并画出的矩阵 $\mathbf{A} \in \mathbb{R}^{m \times n}$ 和向量 $\mathbf{x} \in \mathbb{R}^n$ 。让我们将矩阵 $\mathbf{A}$ 用它的行向量表示

$\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}, （2.3.5）$

其中每个 $\mathbf{a}^\top_{i} \in \mathbb{R}^n$ 都是行向量，表示矩阵的第 $i$ 行。[矩阵向量积 $\mathbf{A}\mathbf{x}$ 是一个长度为 $m$ 的列向量，其第 $i$ 个元素是点积 $\mathbf{a}^\top_i \mathbf{x}$ ]：

$\mathbf{A}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{x} \\ \mathbf{a}^\top_{2} \mathbf{x} \\ \vdots\\ \mathbf{a}^\top_{m} \mathbf{x}\\ \end{bmatrix}.（2.3.6）$

我们可以把一个矩阵 $\mathbf{A}\in \mathbb{R}^{m \times n}$ 乘法看作是一个从 $\mathbb{R}^{n}$ 到 $\mathbb{R}^{m}$ 向量的转换。这些转换证明是非常有用的。例如，我们可以用方阵的乘法来表示旋转。
我们将在后续章节中讲到，我们也可以使用矩阵-向量积来描述在给定前一层的值时，求解神经网络每一层所需的复杂计算。

在代码中使用张量表示矩阵-向量积，我们使用与点积相同的 dot 函数。当我们为矩阵 A 和向量 x 调用 np.dot(A, x)时，会执行矩阵-向量积。注意，A 的列维数（沿轴1的长度）必须与 x 的维数（其长度）相同。

A.shape, x.shape, paddle.mv(A, x)

([5, 4],
 [4],
 Tensor(shape=[5], dtype=float32, place=CPUPlace, stop_gradient=True,
        [14. , 38. , 62. , 86. , 110.]))

2.3.7 矩阵-矩阵乘法

如果你已经掌握了点积和矩阵-向量积的知识，那么 矩阵-矩阵乘法（matrix-matrix multiplication）应该很简单。

假设我们有两个矩阵 $\mathbf{A} \in \mathbb{R}^{n \times k}$ 和 $\mathbf{B} \in \mathbb{R}^{k \times m}$ ：

$\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} \\ a_{21} & a_{22} & \cdots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nk} \\ \end{bmatrix},\quad \mathbf{B}=\begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{k1} & b_{k2} & \cdots & b_{km} \\ \end{bmatrix}.（2.3.7）$

用行向量 $\mathbf{a}^\top_{i} \in \mathbb{R}^k$ 表示矩阵 $\mathbf{A}$ 的第 $i$ 行，并让列向量 $\mathbf{b}_{j} \in \mathbb{R}^k$ 作为矩阵 $\mathbf{B}$ 的第 $j$ 列。要生成矩阵积 $\mathbf{C} = \mathbf{A}\mathbf{B}$ ，最简单的方法是考虑 $\mathbf{A}$ 的行向量和 $\mathbf{B}$ 的列向量:

$\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix}, \quad \mathbf{B}=\begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix}.（2.3.8）$
当我们简单地将每个元素 $c_{ij}$ 计算为点积 $\mathbf{a}^\top_i \mathbf{b}_j$ :

$\mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix} \begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\ \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\ \vdots & \vdots & \ddots &\vdots\\ \mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m \end{bmatrix}.（2.3.9）$

[我们可以将矩阵-矩阵乘法 $\mathbf{AB}$ 看作是简单地执行 $m$ 次矩阵-向量积，并将结果拼接在一起，形成一个 $\times m$ 矩阵]。在下面的代码中，我们在 A 和 B 上执行矩阵乘法。这里的A 是一个5行4列的矩阵，B是一个4行3列的矩阵。相乘后，我们得到了一个5行3列的矩阵。

B = paddle.ones(shape=[4, 3], dtype='float32')
paddle.mm(A, B)

Tensor(shape=[5, 3], dtype=float32, place=CPUPlace, stop_gradient=True,
       [[6. , 6. , 6. ],
        [22., 22., 22.],
        [38., 38., 38.],
        [54., 54., 54.],
        [70., 70., 70.]])

矩阵-矩阵乘法可以简单地称为 矩阵乘法，不应与哈达玛积混淆。

2.3.8 范数

线性代数中最有用的一些运算符是范数（norms）。非正式地说，一个向量的范数告诉我们一个向量有多大。
这里考虑的大小（size）概念不涉及维度，而是分量的大小。

在线性代数中，向量范数是将向量映射到标量的函数 $f$ 。向量范数要满足一些属性。
给定任意向量 $\mathbf{x}$ ，第一个性质说，如果我们按常数因子 $\alpha$ 缩放向量的所有元素，其范数也会按相同常数因子的 绝对值 缩放：

$f(\alpha \mathbf{x}) = |\alpha| f(\mathbf{x}).（2.3.10）$

第二个性质是我们熟悉的三角不等式:

$f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y}).（2.3.11）$

第三个性质简单地说范数必须是非负的:

$f(\mathbf{x}) \geq 0.（2.3.12）$

这是有道理的，因为在大多数情况下，任何东西的最小的大小是0。最后一个性质要求范数最小为0，当且仅当向量全由0组成。

$\forall i, [\mathbf{x}]_i = 0 \Leftrightarrow f(\mathbf{x})=0.（2.3.13）$

你可能会注意到，范数听起来很像距离的度量。如果你还记得小学时的欧几里得距离(想想毕达哥拉斯定理)，那么非负性的概念和三角不等式可能会给你一些启发。
事实上，欧几里得距离是一个范数：具体而言，它是 $L_2$ 范数。假设 $n$ 维向量 $\mathbf{x}$ 中的元素是 $x_1, \ldots, x_n$ ，其 [ $L_2$ 范数是向量元素平方和的平方根：]

( $\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2},（2.3.14）$ )

其中，在 $L_2$ 范数中常常省略下标 $2$ ，也就是说， $\|\mathbf{x}\|$ 等同于 $\|\mathbf{x}\|_2$ 。在代码中，我们可以按如下方式计算向量的 $L_2$ 范数。

u = paddle.to_tensor([3.0, -4.0])
paddle.norm(u)

Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
       [5.])

在深度学习中，我们更经常地使用 $L_2$ 范数的平方。你还会经常遇到 [ $L_1$ 范数，它表示为向量元素的绝对值之和：]

( $\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.（2.3.15）$ )

与 $L_2$ 范数相比， $L_1$ 范数受异常值的影响较小。为了计算 $L_1$ 范数，我们将绝对值函数和按元素求和组合起来。

paddle.abs(u).sum()

Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
       [7.])

$L_2$ 范数和 $L_1$ 范数都是更一般的 $L_p$ 范数的特例：

$\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.（2.3.16）$

类似于向量的 $L_2$ 范数，[矩阵] $\mathbf{X} \in \mathbb{R}^{m \times n}$ (的 弗罗贝尼乌斯范数（Frobenius norm）是矩阵元素平方和的平方根：)

( $\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.（2.3.17）$ )

弗罗贝尼乌斯范数满足向量范数的所有性质。它就像是矩阵形向量的 $L_2$ 范数。调用以下函数将计算矩阵的弗罗贝尼乌斯范数。

paddle.norm(paddle.ones(shape=[4, 9], dtype='float32'))

Tensor(shape=[1], dtype=float32, place=CPUPlace, stop_gradient=True,
       [6.])

2.4 微积分

. 微分和积分是微积分的两个分支，前者可以应用于深度学习中的优化问题。

. 导数可以被解释为函数相对于其变量的瞬时变化率，它也是函数曲线的切线的斜率。

. 标量关于列向量的导数是一个行向量。

. 列向量关于标量的导数是一个列向量。

. 向量关于向量的导数结果是一个矩阵。

. 梯度是一个向量，指向值最大的方向，其分量是多变量函数相对于其所有变量的偏导数。

. 链式法则使我们能够微分复合函数

2.5 自动微分

深度学习框架通过自动计算导数，即自动微分（automatic differentiation）来加快求导。实际中，根据我们设计的模型，系统会构建一个计算图（computational graph），来跟踪计算是哪些数据通过哪些操作组合起来产生输出。自动微分使系统能够随后反向传播（方向累积）梯度。这里，反向传播（backpropagate）意味着跟踪整个计算图，填充关于每个参数的偏导数。

1、backward()反向传播函数和grad函数计算梯度

最基础例子：对函数y=2x⊤x（y等于标量2乘列向量x点积列向量x）关于列向量x求导（按照规则求导后结果为y=4x）

import torch
 
x = torch.arange(4.0)
x
 
x.requires_grad_(True)  # 等价于x=torch.arange(4.0,requires_grad=True)，梯度存的位置
x.grad  # 默认值为None
 
y = 2 * torch.dot(x, x) # y为标量
y
 
y.backward() # 通过调用反向传播函数来自动计算y关于x每个分量的梯度
x.grad # 打印梯度

在默认情况下，pytorch会累积梯度，单运用同一个变量进行运算，需要清除之前的值，运用下面这行代码

x.grad.zero_()

2、非标量变量的反向传播

当y不是标量时，机器学习中通常单独计算批量中每个样本的偏导数之和，如下代码所示。

# 对非标量调用backward需要传入一个gradient参数，该参数指定微分函数关于self的梯度。
# 在我们的例子中，我们只想求偏导数的和，所以传递一个1的梯度是合适的
x.grad.zero_()
y = x * x
# 等价于y.backward(torch.ones(len(x)))
y.sum().backward()
x.grad

3、运用detach()函数使在系统中函数类型转为常数类型

例子：计算z=u*x关于x的偏导数，同时将u作为常数处理，而不是z=x*x*x关于x的偏导数。

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x
 
z.sum().backward()
x.grad == u
 
output:tensor([True, True, True, True])
 
 
##由于记录了y的计算结果，可以随后在y上调用反向传播， 得到y=x*x关于的x的导数，即2*x。
x.grad.zero_()
y.sum().backward()
x.grad == 2 * x
 
output:tensor([True, True, True, True])

4、通过Python控制流（例如，条件、循环或任意函数调用），仍然可以计算得到的变量的梯度。

例子：在下面的代码中，while循环的迭代次数和if语句的结果都取决于输入a的值。

def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c
 
#计算梯度
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()
 
a.grad == d / a
 
output:tensor(True)