【图解例说机器学习】感知机 (Perceptron)





如图1,假设有一个线性可分的训练集,其中有三个样例 ( x 1 , x 2 , x 3 \mathrm x_1,\mathrm x_2, \mathrm x_3 x1,x2,x3),分别标记为正例(红色方块),反例(蓝色圆圈)。这里的 x ( 1 ) , x ( 2 ) x^{(1)},x^{(2)} x(1),x(2)为训练样例的 2 2 2个特征。我们的目的就是找到一个超平面 (在二维空间为一条直线) 能够将这三个样例分开。显然,这样的直线有无数条,比如图中的直线 f ( x ) = x ( 1 ) + x ( 2 ) − 3 = 0 , f ( x ) = 2 x ( 1 ) + x ( 2 ) − 5 = 0 f(\mathrm x)=x^{(1)}+x^{(2)}-3=0, f(\mathrm x)=2x^{(1)}+x^{(2)}-5=0 f(x)=x(1)+x(2)3=0,f(x)=2x(1)+x(2)5=0 就是其中的两条。我们发现 f ( x 1 ) > 0 , f ( x 2 > 0 ) , f ( x 3 ) < 0 f(\mathrm x_1)>0,f(\mathrm x_2>0),f(\mathrm x_3)<0 f(x1)>0,f(x2>0),f(x3)<0,于是乎,我们可以用函数表达式 f ( x ) f(\mathrm x) f(x)输出值的正负来判断新的样例 x \mathrm x x属于哪一类。


f ( x ) = I ( ω 0 + w T x ) (1) f(\mathrm x)=\mathbb I(\omega_0+\mathrm w^{\mathrm T}\mathrm x)\tag{1} f(x)=I(ω0+wTx)(1)
其中, I \mathbb I I是指示函数,定义为:
I ( x ) = { − 1 x < 0 + 1 x > 0 (2) \mathbb I(x)=\begin{cases} -1\quad x<0\\ +1\quad x>0 \end{cases}\tag{2} I(x)={1x<0+1x>0(2)
由公式(1),(2)可知,上面例子中, f ( x 1 ) = f ( x 2 = 1 > 0 ) , f ( x 3 ) = − 1 < 0 f(\mathrm x_1)=f(\mathrm x_2=1>0),f(\mathrm x_3)=-1<0 f(x1)=f(x2=1>0),f(x3)=1<0。 注意:上述指示函数的取值为 − 1 , + 1 -1,+1 1,+1是用来区分正例,反例。使用其它的值也是可以的。


分类好坏的依据是能够将训练样例正确地分类,一个自然而然的误差函数就是分类错误数。但是,这样的误差函数不是连续的,不利于我们求解最优的。为此,我们考虑误分类点到分类超平面的距离,作为我们的误差函数。在中学时,我们学过一个点 x x x到一条直线 a x + b y + c = 0 ax+by+c=0 ax+by+c=0的距离可以为 ∣ a x + b y + c ∣ / a 2 + b 2 \lvert ax+by+c\rvert/\sqrt{a^2+b^2} ax+by+c/a2+b2 。类似地,空间中一点 x \mathrm x x 到一个超平面 ω 0 + w T x = 0 \omega_0+\mathrm w^{\mathrm T}\mathrm x=0 ω0+wTx=0的距离为:
d = ω 0 + w T x ∣ w ∣ (3) d=\frac{\omega_0+\mathrm w^{\mathrm T}\mathrm x}{\lvert\mathrm w\rvert}\tag{3} d=wω0+wTx(3)

那么,对于误分类的样例 x i \mathrm x_i xi,其预测的输出为 y ^ i = f ( x i ) \hat y_i=f(\mathrm x_i) y^i=f(xi)。假定预测输出为负例,即 y ^ i = − 1 \hat y_i=-1 y^i=1,由于被错误分类,其实际的输出为 y i = + 1 y_i=+1 yi=+1,其输入为 ω 0 + w T x i > 0 \omega_0+\mathrm w^{\mathrm T}\mathrm x_i>0 ω0+wTxi>0。为此,我们根据公式(3),可以计算当 x i \mathrm x_i xi被误分类时,该点到分类超平面的距离为:
d ( x i ) = − ω 0 + w T x i ∣ w ∣ y i (4) d(\mathrm x_i)=-\frac{\omega_0+\mathrm w^{\mathrm T}\mathrm x_i}{\lvert\mathrm w\rvert} y_i\tag{4} d(xi)=wω0+wTxiyi(4)
注意:公式(4)是公式(3)在误分类情况下的具体表达式。在误分类的情况下,输入 ω 0 + w T x i \omega_0+\mathrm w^{\mathrm T}\mathrm x_i ω0+wTxi 与实际输出 y i y_i yi 异号。那么对于所有误分类的样例集合 D e r r o r \mathcal D_{error} Derror,误差函数可以表示为:
E = ∑ x i ∈ D e r r o r d ( x i ) = − ∑ i = 1 ∣ D e r r o r ∣ ω 0 + w T x i ∣ w ∣ y i (5) E=\sum\limits_{\mathrm x_i\in\mathcal D_{error}}{d(\mathrm x_i)}=-\sum\limits_{i=1}^{\lvert\mathcal D_{error}\rvert}\frac{\omega_0+\mathrm w^{\mathrm T}\mathrm x_i}{\lvert\mathrm w\rvert}y_i\tag{5} E=xiDerrord(xi)=i=1Derrorwω0+wTxiyi(5)
一般来说,为了使用梯度下降求解最小化损失函数 E E E方便,我们可以添加约束 ∣ w ∣ = 1 \lvert\mathrm w\rvert=1 w=1 将公式(5)转化为:
E = ∑ x i ∈ D e r r o r d ( x i ) = − ∑ i = 1 ∣ D e r r o r ∣ ( ω 0 + w T x i ) y i (6) E=\sum\limits_{\mathrm x_i\in\mathcal D_{error}}{d(\mathrm x_i)}=-\sum\limits_{i=1}^{\lvert\mathcal D_{error}\rvert}(\omega_0+\mathrm w^{\mathrm T}\mathrm x_i)y_i\tag{6} E=xiDerrord(xi)=i=1Derror(ω0+wTxi)yi(6)
然而,我们发现,约束 ∣ w ∣ = 1 \lvert\mathrm w\rvert=1 w=1 对最终优化结果 (得到最优的分类超平面) 没有影响。以上面给的例子来说,比如对于一个给定的训练集 x 1 , x 2 , x 3 \mathrm x_1,\mathrm x_2,\mathrm x_3 x1,x2,x3,其最优的分类超平面为 f ( x ) = x ( 1 ) + x ( 2 ) − 3 = 0 , ∣ w ∣ = 2 f(\mathrm x)=x^{(1)}+x^{(2)}-3=0, \lvert\mathrm w\rvert=\sqrt{2} f(x)=x(1)+x(2)3=0,w=2 。当添加约束 ∣ w ∣ = 1 \lvert\mathrm w\rvert=1 w=1 后,此时可以求得最优的超平面一样,只是需要将参数归一化: f ( x ) / 2 = x ( 1 ) / 2 + x ( 2 ) / 2 − 3 / 2 = 0 , ∣ w ∣ = 2 f(\mathrm x)/\sqrt{2}=x^{(1)}/\sqrt{2}+x^{(2)}/\sqrt{2}-3/\sqrt{2}=0, \lvert\mathrm w\rvert=\sqrt{2} f(x)/2 =x(1)/2 +x(2)/2 3/2 =0,w=2 。为此,我们可以不用考虑约束 ∣ w ∣ = 1 \lvert\mathrm w\rvert=1 w=1



为了找到最优的分类超平面 y ^ = f ( x ) \hat y=f(\mathrm x) y^=f(x), 我们需要最小化误差函数 E E E来求得最佳的参数 w ˉ = { ω 0 , w } \mathrm{\bar w}=\{\omega_0,\mathrm w\} wˉ={ω0,w}。这里我们采用梯度下降法。分别求 E E E关于 ω 0 , ω j \omega_0,\omega_j ω0,ωj的偏导数:
∂ E ∂ ω 0 = − ∑ i = 1 ∣ D e r r o r ∣ y i (7) \frac{\partial E}{\partial\omega_0}=-\sum\limits_{i=1}^{\lvert\mathcal D_{error}\rvert}{y}_i\tag{7}\\ ω0E=i=1Derroryi(7)

∂ E ∂ ω j = − ∑ i = 1 ∣ D e r r o r ∣ x i ( j ) y i (8) \frac{\partial E}{\partial\omega_j}=-\sum\limits_{i=1}^{\lvert\mathcal D_{error}\rvert}x_i^{(j)}{y}_i\tag{8} ωjE=i=1Derrorxi(j)yi(8)

由于偏导数只与被错误分类的样例有关,我们可以采用随机梯度下降法,即每次只用一个被错误分类的训练样例 (e.g., x i \mathrm x_i xi ) 来更新参数:
ω 0 t + 1 = ω 0 t + η t y i (9) \omega_0^{t+1}=\omega_0^t+\eta^ty_i\tag{9} ω0t+1=ω0t+ηtyi(9)

ω j t + 1 = ω j t + η t x i ( j ) y i (10) \omega_j^{t+1}=\omega_j^{t}+\eta^tx_i^{(j)}y_i\tag{10} ωjt+1=ωjt+ηtxi(j)yi(10)

公式(9)和(10)的直观解释:对于每一个被误分类的样例点,我们调整 ω 0 , w \omega_0,\mathrm w ω0,w的值,使分类超平面朝着该误分类的样例点移动,从而减少该分类点到分界面的距离,即误差。


我们还是以上面的例子来具体说明算法的具体步骤。这例子只考虑了2个特征 x ( 1 ) , x ( 2 ) x^{(1)},x^{(2)} x(1),x(2),于是乎我们要求的分类超平面为一条直线 f ( x ) = ω 0 + ω 1 x ( 1 ) + ω 2 x ( 2 ) f(\mathrm x)=\omega_0+\omega_1x^{(1)}+\omega_2x^{(2)} f(x)=ω0+ω1x(1)+ω2x(2)=0. 那么上述的随机梯度算法步骤如下:

  1. 初始化参数: ω 0 0 = ω 1 0 = ω 2 0 = 0 , η t = η = 1 \omega_0^0=\omega_1^0=\omega_2^0=0,\eta^t=\eta=1 ω00=ω10=ω20=0,ηt=η=1;

  2. 迭代过程:

    • 根据当前得到的分类直线,从训练集中找到一个会被误分类(即 f ( x i ) y i ≤ 0 f(\mathrm x_i)y_i\le 0 f(xi)yi0)的样例;

    • 比如,在第一次迭代时, x 1 \mathcal x_1 x1被误分类。我们可以把样例 x 1 \mathrm x_1 x1代入公式(9)和(10)中,我们有:
      ω 0 1 = ω 0 1 + η y 1 = 0 + 1 ∗ 1 = 1 (11) \omega_0^{1}=\omega_0^1+\eta y_1=0+1*1=1\tag{11} ω01=ω01+ηy1=0+11=1(11)

      ω 1 1 = ω 1 0 + η x 1 ( 1 ) y 1 = 0 + 1 ∗ 3 ∗ 1 = 3 (12) \omega_1^{1}=\omega_1^{0}+\eta x_1^{(1)}y_1=0+1*3*1=3\tag{12} ω11=ω10+ηx1(1)y1=0+131=3(12)

      ω 2 1 = ω 2 0 + η x 1 ( 2 ) y 1 = 0 + 1 ∗ 3 ∗ 1 = 3 (13) \omega_2^{1}=\omega_2^{0}+\eta x_1^{(2)}y_1=0+1*3*1=3\tag{13} ω21=ω20+ηx1(2)y1=0+131=3(13)

      此时得到的分类直线为 1 + 3 x ( 1 ) + 3 x ( 2 ) = 0 1+3x^{(1)}+3x^{(2)}=0 1+3x(1)+3x(2)=0

  3. 重复步骤2,直到训练集中找不到被误分类的训练样例。

− 3 + x ( 1 ) + x ( 2 ) = 0 (14) -3+x^{(1)}+x^{(2)}=0\tag{14} 3+x(1)+x(2)=0(14)
f ( x ) = I ( − 3 + x ( 1 ) + x ( 2 ) ) (15) f(\mathrm x)=\mathbb I(-3+x^{(1)}+x^{(2)})\tag{15} f(x)=I(3+x(1)+x(2))(15)
注意:这里的分类超平面有很多,与初始化的值和在步骤2中选取被误分类样例的不同有关。例如, − 5 + x ( 1 ) + 2 x ( 2 ) = 0 -5+x^{(1)}+2x^{(2)}=0 5+x(1)+2x(2)=0也是一个分类超平面。



  • 简单的例子

    在上述的小例子中,输入为 { x 1 = ( 3 , 3 ) , x 2 = ( 3 , 4 ) , x 3 = ( 1 , 1 ) } \{\mathrm x_1=(3,3),\mathrm x_2=(3,4),\mathrm x_3=(1,1)\} {x1=(3,3),x2=(3,4),x3=(1,1)} ,其类别为 { y 1 = + 1 , y 2 = + 1 , y 3 = − 1 } \{y_1=+1,y_2=+1,y_3=-1\} {y1=+1,y2=+1,y3=1}。我们假定一个新的测试样例为 x = ( 4 , 4 ) \mathrm x=(4,4) x=(4,4),其实际类别为 y = + 1 y=+1 y=+1。采用上面提及的随机梯度算法,我们可以得到如图2的实验结果 (具体实现的python源代码见附录):


图2中,我们可以看出每一次的迭代过程中, w ˉ \bar{\mathrm w} wˉ的取值。当迭代7次后,所有训练数据都被正确分类,此时误差为0。这时的分类超平面为直线 − 3 + x ( 1 ) + x ( 2 ) = 0 -3+x^{(1)}+x^{(2)}=0 3+x(1)+x(2)=0

  • Iris数据集
    在iris数据集中,有150个训练样例,4个feature, 总共分3类。我们只考虑了前2个feature,这么做是为了在二维图3和图4中展示分类结果。并且将类别2和类别3划分为同一类别,这样我们考虑的是一个二分类问题。

从图3和图4中可以看出,我们找到了一个分类直线 99 − 62.6 x ( 1 ) + 79.5 x ( 2 ) = 0 99-62.6x^{(1)}+79.5x^{(2)}=0 9962.6x(1)+79.5x(2)=0,可以正确对iris数据集正确分类。



# -*- coding: utf-8 -*-
# @Time : 2020/5/3 14:48
# @Author : tengweitw

import numpy as np
import matplotlib.pyplot as plt
# Set the format of labels
def LabelFormat(plt):
    ax = plt.gca()
    labels = ax.get_xticklabels() + ax.get_yticklabels()
    [label.set_fontname('Times New Roman') for label in labels]
    font = {'family': 'Times New Roman',
            'weight': 'normal',
            'size': 16,
    return font

x = [3, 4, 1]
y = [3, 3, 1]
c = [r'$\mathrm{x}_1$',r'$\mathrm{x}_2$',r'$\mathrm{x}_3$']






for i in range(0, len(x)):
    plt.annotate(c[i], xy=(x[i], y[i]), xytext=(x[i] + 0.05, y[i] + 0.05),fontsize=16)
plt.annotate('$2x^{(1)}+x^{(2)}-5=0$', xy=(1, 3), xycoords='data',
             xytext=(0, 60), textcoords='offset points', color='g', fontsize=16, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=0", color='k'))
plt.annotate('$x^{(1)}+x^{(2)}-3=0$', xy=(2.5, 0.5), xycoords='data',
             xytext=(30, 30), textcoords='offset points', color='g', fontsize=16, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=0", color='k'))

# Set the labels
font = LabelFormat(plt)
plt.xlabel('$x^{(1)}$', font)
plt.ylabel('$y^{(2)}$', font)



# -*- coding: utf-8 -*-
# @Time : 2020/5/5 11:40
# @Author : tengweitw

import numpy as np

def Perceptron_gradient_descend(train_data, train_target, test_data):
    # learning rate
    eta = 1
    M = np.size(train_data, 1) # number of features
    N = np.size(train_data, 0) # number of instances
    w_bar = np.zeros((M + 1, 1)) #initialization

    # the 1st column is 1 i.e., x_0=1
    temp = np.ones([N, 1])
    # X is a N*(1+M)-dim matrix
    X = np.concatenate((temp, train_data), axis=1)
    train_target = np.array(train_target).reshape(N,1)

    iter = 0
    num_iter = 10
    while iter < num_iter:
        print('The %s-th iteration:'%(iter+1))
        # Compute f(x_i)y_i and find a wrongly-classified instance
        z = np.matmul(X, w_bar)

        if index_instance.size:
            # Get the first instance, you can also pick the instance randomly
            print('There is no instance that is classified by mistake.\n')
        # update w according to eqs. (9) and (10)

        iter += 1

    # Predicting, let x0=1 to be multiplied by \omega_0
    x0 = np.ones((np.size(test_data, 0), 1))
    test_data1 = np.concatenate((x0, test_data), axis=1)
    y_predict_test_temp = np.matmul(test_data1, w_bar)
    if y_predict_test_temp>0: #Note that here is only one test data, otherwise changes needed

    return y_predict_test,w_bar

# x1 x2 x3
data = [[3,3],[4,3],[1,1]]
# The labels
label = [1,1,-1]

# testing points [4,4]
test_data = np.array([4,4]).reshape(1,2)
test_target = [1]

train_data = data
train_target = label

y_predict_test,w_bar=Perceptron_gradient_descend(train_data, train_target, test_data)
print('The point x={} whose true class is {}, is grouped as class {}.'.format(test_data,test_target,y_predict_test))


# -*- coding: utf-8 -*-
# @Time : 2020/5/5 11:42
# @Author : tengweitw

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets

# Create color maps for three types of labels
cmap_light = ListedColormap(['tomato', 'limegreen', 'cornflowerblue'])

# Set the format of labels
def LabelFormat(plt):
    ax = plt.gca()
    labels = ax.get_xticklabels() + ax.get_yticklabels()
    [label.set_fontname('Times New Roman') for label in labels]
    font = {'family': 'Times New Roman',
            'weight': 'normal',
            'size': 16,
    return font

# Plot the training points:
def PlotTrainPoint(train_data, train_target):
    for i in range(0, len(train_target)):
        if train_target[i] == 1:
            plt.plot(train_data[i][0], train_data[i][1], 'rs', markersize=6, markerfacecolor="r")
            plt.plot(train_data[i][0], train_data[i][1], 'bs', markersize=6, markerfacecolor="b")

# Plot the testing points:
def PlotTestPoint(test_data, y_predict_test):
    for i in range(0, len(y_predict_test)):
        if y_predict_test[i] == 1:
            plt.plot(test_data[i][0], test_data[i][1], 'rs', markerfacecolor='none', markersize=6)
            plt.plot(test_data[i][0], test_data[i][1], 'bs', markersize=6, markerfacecolor="none")

# Plot the super plane
def Plot_segment_plane(w_bar):
    x0 = 1
    x1 = np.linspace(4, 8, 100)
    x2 = -(w_bar[0] * x0 + w_bar[1] * x1) / w_bar[2]
    plt.plot(x1, x2, 'k-')

def Perceptron_stochastic_gradient_descend(train_data, train_target, test_data):
    # learning rate
    eta = 1
    M = np.size(train_data, 1)  # number of features
    N = np.size(train_data, 0)  # number of instances
    w_bar = np.zeros((M + 1, 1))  # initialization

    # the 1st column is 1 i.e., x_0=1
    temp = np.ones([N, 1])
    # X is a N*(1+M)-dim matrix
    X = np.concatenate((temp, train_data), axis=1)
    train_target = np.array(train_target).reshape(N, 1)

    iter = 0
    num_iter = 10000

    while iter < num_iter:
        # print('The %s-th iteration:'%(iter+1))
        # Compute f(x_i)y_i and find a wrongly-classified instance
        z = np.matmul(X, w_bar)
        fxy = z * train_target
        index_instance = np.argwhere(fxy <= 0)

        if index_instance.size:
            # Get the first instance, you can also pick the instance randomly
            index_x_selected = index_instance[0][0]
            print('There is no instance that is classified by mistake.\n')
            print('The derived parameters w=', np.transpose(w_bar))
        x_selected = X[index_x_selected]
        # update w
        w_bar = w_bar + eta * x_selected.reshape(M + 1, 1) * train_target[index_x_selected]

        iter += 1

    # Predicting
    x0 = np.ones((np.size(test_data, 0), 1))
    test_data1 = np.concatenate((x0, test_data), axis=1)
    y_predict_test = np.matmul(test_data1, w_bar)
    for i in range(len(y_predict_test)):
        if y_predict_test[i] > 0:
            y_predict_test[i] = 1
            y_predict_test[i] = -1

    return y_predict_test, w_bar

# import dataset of iris
iris = datasets.load_iris()

# The first two-dim feature for simplicity
data = iris.data[:, :2]
# Group 1 (labeled 0 initially) is labeled as +1
label = iris.target + 1

# Group 2 and 3 as one group, and label them as -1
label[50:] = -1

# Choose the 25,75,125th instance as testing points
test_data = [data[25, :], data[75, :], data[125, :]]
test_target = label[[25, 75, 125]]

data = np.delete(data, [25, 75, 125], axis=0)
label = np.delete(label, [25, 75, 125], axis=0)

train_data = data
train_target = label

y_predict_test, w_bar = Perceptron_stochastic_gradient_descend(train_data, train_target, test_data)

print('The point x={} \n whose true class is {}, is grouped as class {}.'.format(test_data, test_target, np.transpose(y_predict_test)))

PlotTrainPoint(train_data, train_target)
PlotTestPoint(test_data, y_predict_test)
plt.annotate('$99-62.6x^{(1)}+79.5x^{(2)}=0$', xy=(7.5, 4.7), xycoords='data',
             xytext=(-300, 20), textcoords='offset points', color='g', fontsize=16, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=0", color='k'))
font = LabelFormat(plt)
plt.xlabel('$x^{(1)}$', font)
plt.ylabel('$x^{(2)}$', font)

