台湾大学 机器学习(李宏毅) 2020年春
分类模型:生成概率模型、朴素贝叶斯、逻辑斯蒂(logistic)回归、多分类Logistic回归、交叉熵损失、柔性最大值(softmax)函数
本文为本人学习过程中整理的学习笔记,想顺带学英语所以用英文呈现。发现错误还烦请指正。欢迎交流。
未经同意,请勿转载。
文章目录
Classification
Classification as Regression?
Limitations
If using regression method to conduct classification, it will penalize(惩罚) the examples that are “too correct” as the result of considering the sum of all sample’s distances.
Ideal Alternatives
-
Function (Model)
f ( x ) = { 1 , g ( x ) > 0 0 , otherwise f(x) = \begin{cases} 1, & g(x) > 0 \\ 0, & \text{otherwise} \\ \end{cases} f(x)={ 1,0,g(x)>0otherwise -
Loss function
Represent the times f ( x ) f(x) f(x) get incorrect results on training data.
L ( f ) = ∑ n δ ( f ( x n ) ≠ y ^ n ) L(f) = \sum_n \delta(f(x^n) \ne \hat{y}^n) L(f)=n∑δ(f(xn)=y^n) -
Example
Perceptron, SVM
Generative Model
It’s a probabilistic generative model.
Generative Laws
-
Bayes Theorem
Bayes theorem measures the relation between posterior probability and prior probability.
C i C_i Ci : the target belongs to Class i i i
x x x : the feature vector of the target
P ( C 1 ∣ x ) = P ( x ∣ C 1 ) P ( C 1 ) P ( x ∣ C 1 ) P ( C 1 ) + P ( x ∣ C 2 ) P ( C 2 ) P(C_1|x) = \frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)} P(C1∣x)=P(x∣C1)P(C1)+P(x∣C2)P(C2)P(x∣C1)P(C1)
P ( x ) = P ( x ∣ C 1 ) P ( C 1 ) + P ( x ∣ C 2 ) P ( C 2 ) P(x) = P(x|C_1)P(C_1) + P(x|C_2)P(C_2) P(x)=P(x∣C1)P(C1)+P(x∣C2)P(C2)
Distribution
Assume the points are sampled from a specific distribution, which should be selected according to the reality background of the training data.
For instance, if the feature is binary, we may choose Bernoulli distribution. However, if the feature is continuous, maybe we can choose Gaussian distribution instead.
Here we take Gaussian distribution as an example.
Gaussian Distribution
f μ , Σ ( x ) = 1 ( 2 π ) D / 2 ( det Σ ) 1 / 2 exp { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } f_{\mu,\Sigma}(x) = \frac{1}{(2\pi)^{D/2}(\det{\Sigma})^{1/2}} \exp\{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\} fμ,Σ(x)=(2π)D/2(detΣ)1/21exp{ −