week 1 前置知识
Performance metrics
Accuracy
顾名思义,就是所有的预测正确(正类负类)的占总的比重。
Recall
即正确预测为正的占全部预测为正的比例。个人理解:真正正确的占所有预测为正的比例。
Precision
即正确预测为正的占全部实际为正的比例。个人理解:真正正确的占所有实际为正的比例。
F-score
公式转化之后为:
机器学习里两大主要任务
Classification
Regression
判别式机器学习(Discriminative machine learning )
activation function
softmax
Use to transform real-valued discriminative scores to discrete probabilities。用于将实价判别分数转换为离散概率. Softmax transforms real-valued discriminative scores (ranged in (−∞,+∞)) to probability values
(ranged in [0,1]) but preserving the order.
公式: Softmax ( x ) = e x i ∑ i e x i \operatorname{Softmax}(x)=\frac{e^{x_{i}}}{\sum_{i} e^{x_{i}}} Softmax(x)=∑iexiexi
eg:
Assume that we do sentiment classification.
Data point 𝑥is a sentence (e.g., 𝑥 ="𝑖𝑡𝑖𝑠𝑎𝑏𝑒𝑎𝑢𝑡𝑖𝑓𝑢𝑙𝑑𝑎𝑦𝑡𝑜𝑑𝑎𝑦“), we need to predict sentiment of this sentence. Three sentiment labels: positive (happy, class 1), negative (sad, class 2), neural (none happy and sad, class 3). The model gives three discriminative values for 𝑥.
▪ ℎ1 =2,ℎ2 =−1,ℎ3 =1means that highest possibility to classify 𝑥to the class 𝟏with the label positive.
We apply softmax function on the discriminative scores ℎ.
𝑝=softmax(ℎ):
p 1 = exp { 2 } exp { 2 } + exp { − 1 } + exp { 1 } ≈ 0.705 p_{1}=\frac{\exp \{2\}}{\exp \{2\}+\exp \{-1\}+\exp \{1\}} \approx 0.705 p1=exp{2}+exp{−1}+exp{1}exp{2}≈0.705
p 2 = exp { − 1 } exp { 2 } + exp { − 1 } + exp { 1 } ≈ 0.035 \mathrm{p}_{2}=\frac{\exp \{-1\}}{\exp \{2\}+\exp \{-1\}+\exp \{1\}} \approx 0.035 p2=exp{2}+exp{−1}+exp{1}exp{−1}≈0.035
p 3 = exp { 1 } exp { 2 } + exp { − 1 } + exp { 1 } ≈ 0.259 \mathrm{p}_{3}=\frac{\exp \{1\}}{\exp \{2\}+\exp \{-1\}+\exp \{1\}} \approx 0.259 p3=exp{2}+exp{−1}+exp{1}exp{1}≈0.259
𝑝=[0.705,0.035,0.259]are the probabilities to classify 𝑥 to the classes 𝟏,𝟐,𝟑respectively.
Sigmoid
公式: S ( x ) = 1 1 + e − x S(x)=\frac{1}{1+e^{-x}} S(x)=1+e−x1
Sigmoid函数的导数: S ′ ( x ) = e − x ( 1 + e − x ) 2 = S ( x ) ( 1 − S ( x ) ) S^{\prime}(x)=\frac{e^{-x}}{\left(1+e^{-x}\right)^{2}}=S(x)(1-S(x)) S′(x)=(1+e−x)2e−x=S(x)(1−S(x))
Sigmoid函数的特性与优缺点:
- Sigmoid函数的输出范围是0到1。由于输出值限定在0到1,因此它对每个神经元的输出进行了归一化。
- 用于将预测概率作为输出的模型。由于概率的取值范围是0到1,因此Sigmoid函数非常合适
- 梯度平滑,避免跳跃的输出值
- 函数是可微的。这意味着可以找到任意两个点的Sigmoid曲线的斜率
- 明确的预测,即非常接近1或0。
- 函数输出不是以0为中心的,这会降低权重更新的效率
- Sigmoid函数执行指数运算,计算机运行得较慢。
Tanh
它解决了Sigmoid函数的不以0为中心输出问题,然而,梯度消失的问题和幂运算的问题仍然存在。
公式: tanh ( x ) = e x − e − x e x + e − x \tanh (x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} tanh(x)=ex+e−xex−e−x
Font metrics not found for font: .
ReLU
公式: Re L U ( x ) = max ( 0 , x ) \operatorname{Re} L U(x)=\max (0, x) ReLU(x)=max(0,x)
σ ′ ( t ) = { 1 if t ≥ 0 0 otherwise \sigma^{\prime}(t)=\left\{\begin{array}{cc} 1 & \text { if } t \geq 0 \\ 0 & \text { otherwise } \end{array}\right. σ′(t)={10 if t≥0 otherwise
线性整流函数(ReLU函数)的特点:
- 当输入为正时,不存在梯度饱和问题。
- 计算速度快得多。ReLU 函数中只存在线性关系,因此它的计算速度比Sigmoid函数和tanh函数更快。
- Dead ReLU问题。当输入为负时,ReLU完全失效,在正向传播过程中,这不是问题。有些区域很敏感,有些则不敏感。但是在反向传播过程中,如果输入负数,则梯度将完全为零,Sigmoid函数和tanh函数也具有相同的问题
- ReLU函数的输出为0或正数,这意味着ReLU函数不是以0为中心的函数。
Loss
Cross-entropy loss
公式: l ( y , p ) = CE ( 1 y , p ) = − log p y l(y, p)=\operatorname{CE}\left(1_{y}, p\right)=-\log p_{y} l(y,p)=CE(1y,p)=−logpy
eg1:
Given a sentence 𝑥 with the label positive/happy (the class 1), assume that our model predicts it with prediction probabilities 𝑝=[0.3,0.4,0.3], the cross-entropy loss for this prediction:
l ( y , p ) = C E ( [ 1 , 0 , 0 ] , [ 0.3 , 0.4 , 0.2 ] ) = − 1. log 0.3 − 0. log 0.4 − 0. log 0.3 = − log 0.3 ≈ 1.204 l(y, p)=C E([1,0,0],[0.3,0.4,0.2])=-1 . \log 0.3-0 . \log 0.4-0 . \log 0.3=-\log 0.3 \approx 1.204 l(y,p)=CE([1,0,0],[0.3,0.4,0.2])=−1.log0.3−0.log0.4−0.log0.3=−log0.3≈1.204
eg2:
预测 | 真实 | 是否正确 |
---|---|---|
0.1 0.2 0.7 | 0 0 1 (猪) | 正确 |
0.1 0.7 0.2 | 0 1 0 (狗) | 正确 |
0.3 0.4 0.3 | 1 0 0 (猫) | 错误 |
$sample \ 1 \ loss =-(0 \times \log 0.3+0 \times \log 0.3+1 \times \log 0.4)=0.91 $
$sample \ 2\ \operatorname{loss}=-(0 \times \log 0.3+1 \times \log 0.4+0 \times \log 0.3)=0.91 $
$sample\ 3\ \operatorname{loss}=-(1 \times \log 0.1+0 \times \log 0.2+0 \times \log 0.7)=2.30 $
对所有样本的loss求平均:
L = 0.35 + 0.35 + 1.2 3 = 0.63 L=\frac{0.35+0.35+1.2}{3}=0.63 L=30.35+0.35+1.2=0.63
L1 loss
公式: S = ∑ i = 1 n ∣ Y i − f ( x i ) ∣ S=\sum_{i=1}^{n}\left|Y_{i}-f\left(x_{i}\right)\right| S=∑i=1n∣Yi−f(xi)∣
L2 loss
公式: S = ∑ i = 1 n ( Y i − f ( x i ) ) 2 S=\sum_{i=1}^{n}\left(Y_{i}-f\left(x_{i}\right)\right)^{2} S=∑i=1n(Yi−f(xi))2
Forward propagation
分类
eg: Example of spam email detection. From emails, assume that we extract three features.
𝑥 =[ 𝑥1,𝑥2,𝑥3 ]. There are two classes and labels: spam (𝑦=1)and non-spam (𝑦=2)
网络图像:
每一个隐藏层采用sigmoid函数,最后一层输出采用softmax函数
回归
Training deep nets
Deep model parameters: θ : = { ( W l , b l ) } l = 1 L \theta:=\left\{\left(W^{l}, b^{l}\right)\right\}_{l=1}^{L} θ:={(Wl,bl)}l=1L
Find model parameters (weight matrices and biases) so that the model predictions fit the training set as much as possible. 查找模型参数(重量矩阵和偏见),以便模型预测尽可能适合训练集
min θ L ( D ; θ ) : = − 1 N ∑ i = 1 N log p y i ( x i ) \min _{\theta} L(D ; \theta):=-\frac{1}{N} \sum_{i=1}^{N} \log p_{y_{i}}\left(x_{i}\right) minθL(D;θ):=−N1∑i=1Nlogpyi(xi)(minimize negative log likelihood)
Use optimizers SGD, Adagrad, Adam, RMSProp to update the model parameters gradually.