Stanford机器学习__Lecture notes CS229. Logistic Regression(逻辑回归)(2)Perceptron Learning Algorithm

最新推荐文章于 2022-01-15 19:14:03 发布

风先生

最新推荐文章于 2022-01-15 19:14:03 发布

阅读量1k

点赞数

分类专栏： ML 文章标签：机器学习算法

本文链接：https://blog.csdn.net/qq_30490125/article/details/52492142

版权

ML 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

Stanford机器学习__Lecture notes CS229. Logistic Regression(逻辑回归)(2)

这里其实我们要说的是感知器算法。
之所以要把感知器算法(Perceptron Learning Algorithm)放在这里，是因为这两个算法在形式上的相似性。

我们在logistic回归中，考虑到二分类问题，其输出标记y∈(0,1)，而线性回归模型产生的预测值 $H(x) = w^Tx + b$ 是实值，于是我们需要将实值 $H(x)$ 转化为0/1值。最理想的是“单位阶跃函数”（unit-step function）。最后我们用对数几率函数（Logistic functino）作为 $g(.)$ 替代了“单位阶跃函数”得到了logistic回归函数。

$h θ (x) = 1 1 + e ( w T x )$ $h_\theta(x) = \frac{1}{1+e^{(w^Tx)}}$

当我们把对数几率函数（Logistic functino）替换为 $sign$ 函数时，我们就得到了一个二元线性分类器——感知器。

$g (z) = {10 if z \geq 0 if z < 0$ $g(z)= \begin{cases} 1 &\text{if z}\geq 0\\ 0 &\text{if z < 0}\\ \end{cases}$
此时，另 $h_\theta(x)=g\left(\theta^Tx\right)$
参数更新规则：
$θ j n e w = θ j o l d - α (h θ (x (i)) － y (i)) x (i) j,$ $θ_jnew=θ_jold−α(h_θ(x^{(i)})－y(i))x^{(i)}_j,$
到这里，我们就已经有了完整的感知器学习算法。
我们可以看到，这里的参数更新规则实际上跟logistic随机梯度下降的参数更新规则是一致的。

细心一点可以看出，其实 $\theta^Tx=0$ 在1维空间中代表一个点，在2维空间中代表一条直线，在3维空间中代表一个平面。
以2维空间为例:

对于所有满足 $\theta^Tx<0$ 的x，将落在直线一边的区域中(下图中的蓝色).对于所有满足 $\theta^Tx>0$ 的x，将落在直线另一边的区域中(下图中的红色).
此处输入图片的描述

Perceptron Learning

Perceptron Learning Algorithm的目的是要找到一个perceptron，能把正确地把不同类别的点区分开来。在二维平面上，任何找一条直线都可以用来做perceptron，只不过有些perceptron分类能力比较好(分错的少)，有些perceptron分类能力比较差(分错的多)。
此处输入图片的描述
自然，Perceptron Learning Algorithm需要找到一个好的perceptron。
Perceptron Learning要做的是，在“线性可分”的前提下，至少存在一个perceptron，可以做到百分百的正确率，对于任意的 $(y_i,x_i)$ ，有 $h_\theta(x_i)=y_i$ 。
我们把完美的Perceptron记作：

h θ (x) = 1 1 + e ( w T x )

$h_\theta(x) = \frac{1}{1+e^{(w^Tx)}}$
由一个初始的Perceptron开始，通过不断的learning，不断的调整

hθ(x) $h_\theta(x)$ 的参数

θ $\theta$ ，使他最终成为一个完美的perceptron。

Perceptron Learning Algorithm (PLA) - “知错就改”的算法

PLA的方法如下：

PLA的 $cost function(x) = - \sum_i^{错误分类的索引集合}y_i(θx_i+b)$

为了简化我们之后的论证过程，在这里做出小改动：
1.　我们将负样本标记为-1
2.　基于1的改动，我们也需要把g(.)函数替换为：

$s i g n (z) = {1 - 1 if z \geq 0 if z < 0$ $sign(z)= \begin{cases} 1 &\text{if z}\geq 0\\ -1 &\text{if z < 0}\\ \end{cases}$
3. 基于以上，我们需要变动参数的更新法则：
极小化cost function的梯度是对w和b求偏导，即：
$\nabla θ c o s t f u n c t i o n = - \sum y i x i \nabla b c o s t f u n c t i o n = - \sum y i$ $\nabla_θcost_function = -\sum y_ix_i\\ \nabla_bcost_function = -\sum y_i\\$
这样，我们只需要对每个错分类的样本进行下面的更新即可：
$θ j n e w = θ j o l d ＋ y i x (j) i,$ $θ_jnew=θ_jold＋y_ix^{(j)}_i,$

$For(t=0,1,...)\text{　,t 表示第t次更新——第t次遍历数据集}$
$x_i$ 表示第 $i$ 个样本，循环遍历整个训练集

如果 $y_{i(t)}\neq sign\left(\theta^Tx_{i(t)}\right)$ ，则

θ (t) n e w = θ (t) o l d ＋ y i (t) x (j) i (t),

$θ_{(t)}new=θ_{(t)}old＋y_{i(t)}x^{(j)}_{i(t)},$
直到找不到错误点，返回最后一次迭代的

θbest $\theta_{best}$ 。

Perceptron Learning Algorithm (PLA)收敛性证明

我们知道在数据线性可分的前提下，我们心目中有个完美的 $\theta_{best}$
，它能够完美的把圈圈和叉叉区分开来。那么如何证明PLA能够使
$\theta$ 不断接近 $\theta_{best}$ 呢？

这里就要用到夹角余弦的公式，如果令t表示 $\theta$ 的更新次数，
$\theta_t$ 更新之后的 $\theta_{i+1}$ 与 $\theta_{best}$ 之间的夹角余弦变大(夹角变小)了，则我们可以认为PLA是有效的。

当数据“线性可分”时， $\theta_{best}$ 是必然存在的，所以训练集中任意的 $(x_i,y_i)$ 都可以被正确分类,可知：

y i (t) θ b e s t x i (t) \geq min y i θ b e s t x i > 0

$y_{i(t)}\theta_{best}x_{i(t)}\geq \min_{}y_i\theta_{best}x_i>0$

２. 下面证明在数据线性可分时，简单的感知机算法会收敛。
初始时 $\theta_{0}=0$ ,每次遇见错误数据才会更新。

y i (t) \neq s i g n (θ T x i (t)) ⟺ y i (t) θ T x i (t) \leq 0

$y_{i(t)}\neq sign\left(\theta^Tx_{i(t)}\right)\iff y_{i(t)}\theta^Tx_{i(t)}\leq0$
t次更新以后：

θ b e s t θ t = θ b e s t (θ t - 1 + y i x i) \geq θ b e s t θ t - 1 + min y i θ b e s t x i \geq . . . \geq θ b e s t θ 0 + t min y i θ b e s t x i = t min y i θ b e s t x i (1) (i) (t) (t+1)

$\begin{align} \theta_{best}\theta_{t}=\theta_{best}(\theta_{t-1}+y_ix_i)&\geq\theta_{best}\theta_{t-1}+\min_{}y_i\theta_{best}x_i \tag 1\\ &\geq ...\tag i \\ &\geq\theta_{best}\theta_{0}+t\min_{}y_i\theta_{best}x_i \tag t\\ &=t\min_{}y_i\theta_{best}x_i \tag {t+1}\\ \end{align}$

看起来 $\theta_t$ 更接近 $\theta_{best}$ 了，但他们内积的增大并不能表示他们夹角的变小。也可能是 $|\theta_{t}|$ 变大了。
证：
根据上述的性质，我们可以求算 $\theta_{best}$ 与 $\theta_{t}$ 的夹角的余弦值，从 $\theta_{0}$ 开始，经过t次错误更正，变成 $\theta_{t}$ 。

| θ t + 1 | 2 = | θ t + y i (t) x i (t) | 2 = θ 2 t + 2 y i (t) θ t x i (t) + | y i (t) x i (t) | 2 \leq θ 2 t + | y i (t) x i (t) | 2 \leq θ 2 t + max i | y i x i | 2 (1) (2) (3) (4)

$\begin{align} |\theta_{t+1}|^2&=|\theta_{t}+y_{i(t)}x_{i(t)}|^2 \tag 1\\ &= \theta^2_{t}+2y_{i(t)}\theta_tx_{i(t)}+|y_{i(t)}x_{i(t)}|^2\tag 2 \\ &\leq\theta^2_{t}+|y_{i(t)}x_{i(t)}|^2 \tag 3\\ &\leq\theta^2_{t}+\max_i|y_{i}x_{i}|^2 \tag {4}\\ \end{align}$
根据上式可得：

θ 2 t \leq t max i | y i x i | 2

$\theta^2_{t}\leq t\max_{i}|y_{i}x_{i}|^2$

c o s (∠) = θ b e s t θ t | θ b e s t | | θ t | \geq t min y i θ b e s t x i | θ b e s t | | θ t | \geq t min y i θ b e s t x i | θ b e s t | t \sqrt max i | y i x i | = t \sqrt min y i θ b e s t x i | θ b e s t | max i | y i x i | = t \sqrt *const 常 数

$cos(\angle)=\frac{\theta_{best}\theta_{t}}{|\theta_{best}||\theta_{t}|}\geq \frac{t\min_{}y_i\theta_{best}x_i}{|\theta_{best}||\theta_{t}|}\geq\frac{t\min_{}y_i\theta_{best}x_i}{|\theta_{best}|\sqrt{t}\max_{i}|y_{i}x_{i}|}=\frac{\sqrt{t}\min_{}y_i\theta_{best}x_i}{|\theta_{best}|\max_{i}|y_{i}x_{i}|}=\sqrt{t}\text{*const常数}$
由于夹角余弦是小于等于1的，我们可以有：

$1 \geq θ b e s t θ t | θ b e s t | | θ t | \geq t \sqrt *const 常数$ $1\geq\frac{\theta_{best}\theta_{t}}{|\theta_{best}||\theta_{t}|}\geq\sqrt{t}\text{*const常数}$

上面的不等式告诉我们两点:

PLA能够帮助 $\theta_{t}$ 进步，因为 $\theta_{t}$ 与 $\theta_{best}$ 的夹角余弦随着更新错误点的次数 t 的增加而增加， $\theta_{t}$ 越来越接近 $\theta_{best}$ 。
PLA会停止(halt)，因为当数据是线性可分时，经过有限次数的迭代，一定能找到一个能够把数据完美区分开的perceptron。

Pocket Algorithm

当数据线性不可分时（存在噪音），简单的PLA 算法显然无法收敛。我们要讨论的是如何得到近似的结果。即寻找 $\theta_g$ 使错误分类的数据尽可能得少。
此处输入图片的描述

与简单PLA 的区别：迭代有限次数（提前设定）；随机地寻找分错的数据（而不是循环遍历）；只有当新得到的 $\theta$ 比之前得到的最好的 $\theta_g$ 还要好时，才更新 $\theta_g$ （这里的好指的是分出来的错误更少）。
由于计算 $\theta$ 后要和之前的 $\theta_g$ 比较错误率来决定是否更新 $\theta_g$ ，所以pocket algorithm 比简单的PLA 方法要低效。

参考博客：

https://www.douban.com/note/319669984/
http://www.tuicool.com/articles/eeame2

风先生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Stanford机器学习__Lecture notes CS229. Logistic Regression(逻辑回归)(2)Perceptron Learning Algorithm

Stanford机器学习__Lecture notes CS229. Logistic Regression(逻辑回归)(2)这里其实我们要说的是感知器算法。之所以要把感知器算法(Perceptron Learning Algorithm)放在这里，是因为这两个算法在形式上的相似性。我们在logistic回归中，考虑到二分类问题，其输出标记y∈(0,1)，而线性回归模型产生的预测值H(x)=wT
复制链接

扫一扫

专栏目录