摘要
感知机是一个二分类的线性分类模型,输出取 + 1 +1 +1 与 − 1 -1 −1 二值。感知机对应于输入空间中将实例划分为正负两类的分离超平面,属于判别模型。
感知机模型
定义(感知机): 假设输入空间是 X ∈ R n \mathcal{X}\in {R}^{n} X∈Rn ,输出空间是 Y = { + 1 , − 1 } \mathcal{Y} =\{+1, -1\} Y={+1,−1} 。输入 x ∈ X x\in \mathcal{X} x∈X 表示实例的特征空间,对应于输入空间的点;输出 y ∈ Y y\in \mathcal{Y} y∈Y 表示实例的类别。由输入空间到输出空间的如下函数:
f ( x ) = s i g n ( w ⋅ x + b ) f(x) = sign(w\cdot x + b) f(x)=sign(w⋅x+b)
称为感知机。
其中 s i g n ( x ) sign(x) sign(x) 表示:
s i g n ( x ) = { + 1 , x ≥ 0 − 1 , x < 0 sign(x) = \left\{ \begin{aligned} +1,\space x\geq 0\\ \\ -1, \space x<0 \end{aligned} \right. sign(x)=⎩⎪⎨⎪⎧+1, x≥0−1, x<0
感知机几何解释:
线性方程:
w
⋅
x
+
b
=
0
w\cdot x+b=0
w⋅x+b=0
对应于特征空间
R
n
\mathbf{R}^{n}
Rn 中的一个超平面
S
S
S ,其中
w
w
w 是超平面的法向量,
b
b
b 是超平面的截距。这个超平面将特征空间划分为两个部分。位于两部分的点分别被分为正、负两类。因此超平面
S
S
S 被称为分离超平面。
数据集的线性可分性
定义2.2(数据集的线性可分性)给定一个数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),⋯,(xN,yN)} ,其中 x i ∈ X = R n , y i ∈ Y = { + 1 , − 1 } , i = 1 , 2 , ⋯ , N . x_i\in\mathcal{X}=\mathbf{R}^n, y_i\in \mathcal{Y}=\{+1,-1\}, i = 1,2,\cdots,N. xi∈X=Rn,yi∈Y={+1,−1},i=1,2,⋯,N.,如果存在某个超平面 S S S:
w ⋅ x + b = 0 w \cdot x+b=0 w⋅x+b=0
能够将数据集的正实例点和负实例点完全正确地划分到超平面的两侧,即对所有的 y i = + 1 y_i=+1 yi=+1 的实例,有 w ⋅ x i + b > 0 w \cdot x_i+b>0 w⋅xi+b>0; 对所有 y i = − 1 y_i=-1 yi=−1 的实例,有 w ⋅ x i + b < 0 w\cdot x_i + b<0 w⋅xi+b<0;则称数据集 T T T 为线性可分数据集(linearly separable data set), 否则,称数据集 T T T 线性不可分.
感知机学习策略
后面我们都是假设训练数据集是线性可分。
损失函数的定义:
损失函数的一个自然选择是误分类点的总数,但是这样选择的损失函数不是参数的连续可导函数,后续过程中不易优化。所以我们选择了误分类点到超平面
S
S
S 之间的总距离。首先,输入空间中任意一点到分离超平面之间的距离是:
1
∣
∣
w
∣
∣
∣
w
⋅
x
0
+
b
∣
\frac{1}{||w||}|w\cdot x_0+b|
∣∣w∣∣1∣w⋅x0+b∣
其次,对于误分类点
(
x
i
,
y
i
)
(x_i, y_i)
(xi,yi) 来说:
−
y
i
(
w
⋅
x
i
+
b
)
>
0
-y_i(w\cdot x_i+b)>0
−yi(w⋅xi+b)>0
因此可以将上面距离改为:
−
1
∣
∣
w
∣
∣
y
i
(
w
⋅
x
0
+
b
)
-\frac{1}{||w||}y_i(w\cdot x_0+b)
−∣∣w∣∣1yi(w⋅x0+b)
这样做的一个好处就是可以将距离中的绝对值去掉,后面优化过程中更为方便。
假设超平面
S
S
S 的误分类点集合为
M
M
M ,那么所有无分类点到超平面
S
S
S 的总距离为:
−
1
∣
∣
w
∣
∣
∑
x
i
∈
M
y
i
(
w
⋅
x
0
+
b
)
-\frac{1}{||w||}\sum_{x_i\in M}y_i(w\cdot x_0+b)
−∣∣w∣∣1xi∈M∑yi(w⋅x0+b)
损失函数定义为:
L
(
w
,
b
)
=
−
1
∣
∣
w
∣
∣
∑
x
i
∈
M
y
i
(
w
⋅
x
0
+
b
)
L(w, b) = -\frac{1}{||w||}\sum_{x_i\in M}y_i(w\cdot x_0+b)
L(w,b)=−∣∣w∣∣1xi∈M∑yi(w⋅x0+b)
接下来就是最小化这个损失函数:
min
w
,
b
L
(
w
,
b
)
=
−
1
∣
∣
w
∣
∣
∑
x
i
∈
M
y
i
(
w
⋅
x
0
+
b
)
\min_{w, b} L(w, b) = -\frac{1}{||w||}\sum_{x_i\in M}y_i(w\cdot x_0+b)
w,bminL(w,b)=−∣∣w∣∣1xi∈M∑yi(w⋅x0+b)
我们暂时先单独考虑
M
M
M 中的某一个误分类点
(
x
i
,
y
i
)
(x_i, y_i)
(xi,yi)
L
(
w
,
b
)
=
−
y
i
(
w
⋅
x
0
+
b
)
∣
∣
w
∣
∣
=
−
y
i
⋅
w
⋅
x
0
∣
∣
w
∣
∣
−
y
i
⋅
b
∣
∣
w
∣
∣
(1)
\tag{1}L(w,b)=-\frac{y_i(w\cdot x_0+b)}{||w||} = -\frac{y_i\cdot w\cdot x_0}{||w||}-\frac{y_i\cdot b}{||w||}
L(w,b)=−∣∣w∣∣yi(w⋅x0+b)=−∣∣w∣∣yi⋅w⋅x0−∣∣w∣∣yi⋅b(1)
当
w
,
b
w, b
w,b 扩大一个倍数
c
c
c 后,即
w
→
c
⋅
w
,
b
→
c
⋅
b
w\rightarrow c\cdot w, b\rightarrow c\cdot b
w→c⋅w,b→c⋅b时 ,
∣
∣
w
∣
∣
→
c
∣
∣
w
∣
∣
||w||\rightarrow c||w||
∣∣w∣∣→c∣∣w∣∣,而(1)式右边第一项分子也会增加
c
c
c 倍,即
y
i
⋅
w
⋅
x
0
→
c
⋅
y
i
⋅
w
⋅
x
0
y_i\cdot w\cdot x_0\rightarrow c\cdot y_i\cdot w\cdot x_0
yi⋅w⋅x0→c⋅yi⋅w⋅x0,这时候
−
y
i
⋅
w
⋅
x
0
∣
∣
w
∣
∣
-\frac{y_i\cdot w\cdot x_0}{||w||}
−∣∣w∣∣yi⋅w⋅x0
整体并未改变,而(1)式右边第二项因为
b
b
b 也在改变,所以
y
i
⋅
b
→
c
⋅
y
i
⋅
b
y_i\cdot b\rightarrow c\cdot y_i\cdot b
yi⋅b→c⋅yi⋅b,同理这一项整体也并未改变。
也就是说,当我们将分离超平面
w
⋅
x
+
b
=
0
w\cdot x+b=0
w⋅x+b=0 整体扩大
c
c
c 倍以后,点
(
x
i
,
y
i
)
(x_i, y_i)
(xi,yi) 到此平面的距离并不会发生改变,那么我们就可以设置一个条件,将
w
⋅
x
+
b
=
0
w\cdot x+b=0
w⋅x+b=0 中系数
w
w
w 规定为
∣
∣
w
∣
∣
=
1
||w|| = 1
∣∣w∣∣=1,也就是在方程
w
⋅
x
+
b
=
0
w\cdot x+b=0
w⋅x+b=0 中同时除以
∣
∣
w
∣
∣
||w||
∣∣w∣∣,这时候超平面变为
w
∣
∣
w
∣
∣
⋅
x
+
b
∣
∣
w
∣
∣
=
0
\frac{w}{||w||}\cdot x+\frac{b}{||w||}=0
∣∣w∣∣w⋅x+∣∣w∣∣b=0
我们用
w
′
=
w
∣
∣
w
∣
∣
,
b
′
=
b
∣
∣
w
∣
∣
w^{'}=\frac{w}{||w||}, b^{'}=\frac{b}{||w||}
w′=∣∣w∣∣w,b′=∣∣w∣∣b来替换上式,即为:
w
′
⋅
x
+
b
′
=
0
w^{'}\cdot x+{b}^{'}=0
w′⋅x+b′=0
因此我们直接可以用这个超平面来当做分离超平面,简化就是
w
⋅
x
+
b
=
0
s
.
t
∣
∣
w
∣
∣
=
1
\begin{aligned} & w\cdot x+b=0 \\ s.t\hspace{0.5cm} & ||w|| = 1 \end{aligned}
s.tw⋅x+b=0∣∣w∣∣=1
我们总结一下上面说的意思,当我们规划化
w
w
w 的取值时,即让
∣
∣
w
∣
∣
=
1
||w|| = 1
∣∣w∣∣=1 时,对于最小化函数
L
(
w
,
b
)
L(w,b)
L(w,b) 而言,并没有影响,所以为了简化(1)式,我们定义损失函数为:
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
(
w
⋅
x
0
+
b
)
(2)
\tag{2} L(w, b) = -\sum_{x_i\in M}y_i(w\cdot x_0+b)
L(w,b)=−xi∈M∑yi(w⋅x0+b)(2)
极小化问题为:
min
w
,
b
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
(
w
⋅
x
0
+
b
)
\min_{w,b} L(w, b) = -\sum_{x_i\in M}y_i(w\cdot x_0+b)
w,bminL(w,b)=−xi∈M∑yi(w⋅x0+b)
损失函数梯度为:
∇
w
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
x
i
∇
b
L
(
w
,
b
)
=
−
∑
x
i
∈
M
y
i
\begin{array} {c}{\nabla_{w} L(w, b)=-\sum\limits_{x_{i} \in M} y_{i} x_{i}} \\ {\nabla_{b} L(w, b)=-\sum\limits_{x_{i} \in M} y_{i}} \end{array}
∇wL(w,b)=−xi∈M∑yixi∇bL(w,b)=−xi∈M∑yi
随机选择一个误分类点
(
x
i
,
y
i
)
(x_i, y_i)
(xi,yi) ,对
w
,
b
w,b
w,b 进行更新:
w
←
w
+
η
y
i
x
i
b
←
b
+
η
y
i
\begin{array}{c}{w \leftarrow w+\eta y_{i} x_{i}} \\ {b \leftarrow b+\eta y_{i}}\end{array}
w←w+ηyixib←b+ηyi
其中
η
(
0
<
η
≤
0
)
\eta(0<\eta\leq0)
η(0<η≤0) 为步长,也叫做学习率。
算法:
算法(感知机学习算法的原始形式)
输入:数据集 T={(x_1, y_1),...,(x_N, y_N)},学习率 eta
输出:w, b; 感知机模型 f(x)=sign(w*x+b)
(1)选取初始值 w, b
(2)在训练集中选取数据 (x_i, y_i);
(3)如果 y_i*(w*x_i+b) <= 0
w = w + eta*y_i*x_i
b = b + eat*y_i
(4)转至(2),直至训练集中没有误分类点。
程序:
class Perceptron():
def __init__(self, alpha=0.1, iterator_num=200):
self.alpha = alpha
self.iterator_num = iterator_num
self.weight = None
self.bais = 0
def __call__(self, x_train, y_train, mothed='SGD', plot=False):
self.train(x_train, y_train, mothed=mothed, plot=plot)
return self.weight, self.bais
def train(self, x_train, y_train, mothed='SGD', plot=False):
x_train = np.array(x_train)
y_train = np.array(y_train)
self.x_train = x_train
self.y_train = y_train
N, n_features = x_train.shape
assert N == len(y_train)
self.weight = np.zeros((n_features, 1))
if mothed == 'GD':
self.__gradient_decent__(x_train, y_train)
elif mothed == 'SGD':
self.__stochastic_gradient_decent__(x_train, y_train)
else:
raise ValueError("method must choose between 'GD' and 'SGD'")
if plot:
self.__plot(x_train, y_train)
def pricision(self, x_test, y_test):
y_hat = np.sign(x_test.dot(self.weight) + self.bais)
y_hat = y_hat.reshape(y_test.shape)
N = y_test.shape[0]
return sum(y_hat == y_test)/N
def __gradient_decent__(self, x_train, y_train):
y_train = y_train.reshape((len(y_train), 1))
f = y_train * (np.dot(x_train, self.weigth) + self.bais)
index = np.where(f <= 0)[0]
i = 0
while i < self.iterator_num and len(index):
self.weight += self.alpha*x_train[index, :].dot(y_train[index])
self.bais += self.alpha*np.sum(y_train[index, :])
f = y_train * (np.dot(x_train, self.weigth) + self.bais)
index = np.where(f <= 0)[0]
i += 1
def __stochastic_gradient_decent__(self, x_train, y_train):
y_train = y_train.reshape((len(y_train), 1))
f = (x_train.dot(self.weight) + self.bais) * y_train
index = np.where(f <= 0)[0]
i = 0
while i < self.iterator_num and len(index):
random_index = np.random.randint((f[index].shape[0]))
x, y = x_train[index[random_index], :], y_train[index[random_index]]
x = x.reshape(self.weight.shape)
self.weight += x*self.alpha*y[0]
self.bais += self.alpha*y[0]
f = y_train * (np.dot(x_train, self.weight) + self.bais)
index = np.where(f <= 0)[0]
i += 1
def __plot(self, x_train, y_train):
x1, x2 = x_train[:, 0], x_train[:, 1]
plt.scatter(x1, x2, marker='o', c=y_train)
x_min, x_max = min(x1), max(x1)
y_min = (-b-w[0]*x_min)/w[1]
y_max = (-b-w[0]*x_max)/w[1]
plt.scatter(x_train[:, 0], x_train[:, 1], marker='o', c=y_train)
plt.plot([x_min, x_max], [y_min, y_max], color='black')
plt.show()
percepton = Perceptron()
datasets = makeLinearDatasets(np.array([1, 1]), 1000)
X, y = datasets[:, :-1], datasets[:, -1]
x_train, y_train = X[:750], y[:750]
x_test, y_test = X[750:], y[750:]
w, b = percepton(x_train, y_train, plot=True)
percepton.pricision(x_test, y_test)