Python Machine Learning Chapter 2 Training Machine Learning Algorithms for Classification 学习笔记

最新推荐文章于 2023-03-08 13:24:44 发布

王山而RR

最新推荐文章于 2023-03-08 13:24:44 发布

阅读量640

点赞数 1

本文链接：https://blog.csdn.net/RR7970/article/details/88365643

版权

python machine learning 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

本帖是学习Sebastian Raschka 的《Python Machine Learning》做的笔记，便于需要时查看。

Chapter 2 Training Machine Learning Algorithms for Classification including:

building an intuition for machine learning algorithms
using pandas, NumPy, matplotlib to read in, process, visualize data
implementing linear classification algprithms in Python

1. Artificial heurons - a brief glimpse into the early history of machine learning

Warren McCullock and Walter Pitts 发表si mplified brain cell 概念，称作McCullock-Pitts (MCP) neuron, in 1943 (W. S. McCulloch and W. Pitts). A Logical Calculus of the Ideas Immanent in Nervous Activity.

McCullock and Pitts described such a nerve cell as a simple logic gate with binary outputs; multiple signals arrive at the dendrites, are then integrated into the cell body, and, if the accumulated signal exceeds a certain threshold, an output signal is generated that will be passed on by the axon.

Frank Rosenblatt 在MCP神经模型的基础上发表感知机学习规则（perceptron learning rule）(F. Rosenblatt, The Perceptron, a Perceiving and Recognizing Automaton. Cornell Aeronautical Laboratory, 1957). Rosenblatt发布的算法自动学习最佳的系数，然后乘以特征值，预测新样本的类别。

二分类问题，有2个class，class=1代表positive， class=-1代表negative。定义激活函数（activation function）组合输入值x和系数w， $z=w_{1}x_{1} + ...+w_{m}x_{m}$ :

$w = \begin{vmatrix} w_{1}\\ .\\ .\\ .\\ w_{m}\end{vmatrix}$ , $\mathbf{x} = \begin{vmatrix} x_{1}\\ .\\ .\\ .\\ x_{m}\end{vmatrix}$

激活函数是一种unit step function，也称为Heaviside step function：

$\Phi (z) = \left\{\begin{matrix} 1 \, \, if z\geqslant \theta \\ -1\, \, otherwise\end{matrix}\right.$

MCP和Rosenblatt的思想是用简化论方法模拟单个神经在大脑里运转。感知机准则可以分为两步：

原始系数为0或者很小的随机数
每个训练样本计算 $\widehat{y}$ ，并更新系数。

每个输出值通过unit step function 分类，系数 $w_{j}$ 更新为： $w_{j}:=w_{j}+\Delta w_{j}$ . 此处 $\Delta w_{j} = \eta (y^{(i)}-\widehat{y}^{(i)})x_{j}^{(i)}$ . $\eta$ 是学习率，介于（0，1）， $y^{(i)}$ 是第i个样本的真实值， $\widehat{y}^{(i)}$ 是预测值。

the figure illustrates the general concept of the perceptron:

2. Implementing a perceptron learning algorithm in python

import numpy as np
class Perceptron(object):
    """Perceptron classifier.
    Parameters
    ------------
    eta : float   Learning rate (between 0.0 and 1.0)
    n_iter : int  Passes over the training dataset.
    Attributes
    -----------
    w_ : 1d-array   Weights after fitting.
    errors_ : list   Number of misclassifications in every epoch.
    """
    def __init__(self, eta=0.01, n_iter=10):
        self.eta = eta
        self.n_iter = n_iter
        
        def fit(self, X, y):
            """Fit training data.
            Parameters
            ----------
            X : {array-like}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.
            y : array-like, shape = [n_samples]
            Target values.
            Returns
            -------
            self : object
            """
            self.w_ = np.zeros(1 + X.shape[1])
            self.errors_ = []
            
            for _ in range(self.n_iter):
                errors = 0
                for xi, target in zip(X, y):
                    update = self.eta * (target - self.predict(xi))
                    self.w_[1:] += update * xi
                    self.w_[0] += update
                    errors += int(update != 0.0)
                    self.errors_.append(errors)
                    return self
                
                def net_input(self, X):
                    """Calculate net input"""
                    return np.dot(X, self.w_[1:]) + self.w_[0]
                
                def predict(self, X):
                    """Return class label after unit step"""
                    return np.where(self.net_input(X) >= 0.0, 1, -1)

3. Training a perceptron model on the Iris dataset

Iris包含150个iris flowers样本，分三个品种：Setosa, Versicolor and Viriginica. 特征有Sepal length, Sepal width, Petal length, Petal width, class label. 下面我们只看 Setosa和Versicolor两个品种的sepal length 和petal length的data。

先从pandas库中加载数据，查看数据的加载情况：

import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/'
                 'machine-learning-databases/iris/iris.data', header=None)
df.tail()

再根据要求提取数据，提取品种为Setosa和Versicolor，特征为sepal length和petal length的数据，将品种为Versicolor的class标为1，品种为Setosa的class标为-1，将其可视化：

import matplotlib.pyplot as plt
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', -1, 1)
X = df.iloc[0:100, [0, 2]].values
plt.scatter(X[:50, 0], X[:50, 1],
color='red', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1],
color='blue', marker='x', label='versicolor')
plt.xlabel('petal length')
plt.ylabel('sepal length')
plt.legend(loc='upper left')
plt.show()

接着用感知机算法训练iris数据，并绘制出每次迭代误分的个数：

ppn = Perceptron(eta=0.1, n_iter=10)
ppn.fit(X, y)
plt.plot(range(1, len(ppn.errors_) + 1), ppn.errors_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Number of misclassifications')
plt.show()

从图中可以看出，迭代到第6次时，训练集的分类已经完美了，下面可视化判定边界：

from matplotlib.colors import ListedColormap
def plot_decision_regions(X, y, classifier, resolution=0.02):
    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

plot_decision_regions(X, y, classifier=ppn)
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='upper left')
plt.show()

4. Adaptive linear neurons and the convergence of learning

ADAptive LInear NEuron (Adaline)是Bernard Widrow提出的，Adaline提出了最小化损失函数（minimizing cost function），系数更新基于linear activation function而不是unit step function。过程如下图：

上图中，用linear activation function计算model error和 update weights，而不是二分类标签。

5. Minimizing cost functions with gradient decent

监督学习中，定义目标函数（objective function）优化训练过程是一个重要的组成部分。目标函数也就是损失函数要去最小化的。在Adaline中，用cost function J学习weights sum of squared errors(SSE):

利用梯度下降（gradient descent）找最小化cost function 的weights。梯度下降是下山（climbing down a hill）直到局部或全局损失最小。在每次迭代中，step size由学习率（learning rate）和梯度的斜率（slope）决定。

update the weights by taking a step away from the gradient $\Delta J(w)$ of cost function J(w) :

$\mathbf{w} := \mathbf{w}+\Delta \mathbf{w}$

the weigh change $\Delta \mathbf{w}$ is defined as the negative gradient multiplied by learning rate $\eta$ :

$\Delta \mathbf{w}=-\eta \Delta J(\mathbf{w})$

to compute the gradient of the cost function, we need to compute the partial derivative of cost function with respect to each weight $w_{j}$ : $\frac{\partial J}{\partial w_{j}}=-\sum (y^{(i)}-\Phi (z^{(i)}))x_{j}^{(i)}$ , therefore, $\Delta w_{j}=-\eta \frac{\partial J}{\partial w_{j}}=-u\sum (y^{(i)}-\Phi (z^{(i)}))x_{j}^{(i)}$ .

6. Implementing an Adaline Linear Neuron in Python

class AdalineGD(object):
    """ADAptive LInear NEuron classifier.
    Parameters
    ------------
    eta : float   Learning rate (between 0.0 and 1.0)
    n_iter : int
    Passes over the training dataset. 
    Attributes
    -----------
    w_ : 1d-array
    Weights after fitting.
    errors_ : list    Number of misclassifications in every epoch.
    """
    def __init__(self, eta=0.01, n_iter=50):
        self.eta = eta
        self.n_iter = n_iter
        
        def fit(self, X, y):
            """ Fit training data.
            Parameters
            ----------
            X : {array-like}, shape = [n_samples, n_features]
            Training vectors,
            where n_samples is the number of samples and
            n_features is the number of features.
            y : array-like, shape = [n_samples]
            Target values.
            Returns
            -------
            self : object
            """
            self.w_ = np.zeros(1 + X.shape[1])
            self.cost_ = []
            
            for i in range(self.n_iter):
                output = self.net_input(X)
                errors = (y - output)
                self.w_[1:] += self.eta * X.T.dot(errors)
                self.w_[0] += self.eta * errors.sum()
                cost = (errors**2).sum() / 2.0
                self.cost_.append(cost)
                return self
            
            def net_input(self, X):
                """Calculate net input"""
                return np.dot(X, self.w_[1:]) + self.w_[0]
            
            def activation(self, X):
                """Compute linear activation"""
                return self.net_input(X)
            
            def predict(self, X):
                """Return class label after unit step"""
                return np.where(self.activation(X) >= 0.0, 1, -1)

对比不同学习率的结果：

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
ada1 = AdalineGD(n_iter=10, eta=0.01).fit(X, y)
ax[0].plot(range(1, len(ada1.cost_) + 1), np.log10(ada1.cost_), marker='o')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('log(Sum-squared-error)')
ax[0].set_title('Adaline - Learning rate 0.01')
ada2 = AdalineGD(n_iter=10, eta=0.0001).fit(X, y)
ax[1].plot(range(1, len(ada2.cost_) + 1), ada2.cost_, marker='o')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Sum-squared-error')
ax[1].set_title('Adaline - Learning rate 0.0001')
plt.show()

当学习率 $\eta =0.0001$ 很小时，就要求很多次迭代去收敛。

梯度下降需要特征缩放（feature scaling），这里用standardization，即将数据标准正态分布化。先将数据标准化，再用Adaline

X_std = np.copy(X)
X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()
ada = AdalineGD(n_iter=15, eta=0.01)
ada.fit(X_std, y)
plot_decision_regions(X_std, y, classifier=ada)
plt.title('Adaline - Gradient Descent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc='upper left')
plt.show()
plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Sum-squared-error')
plt.show()

7. Large scale machine learning and stochastic gradient descent

用stochastic gradient descent （也称为iterative or on-line gradient descent）解决大规模问题.

update the weights incrementally for each training sample:

$\eta (y^{(1)}-\Phi (z^{(i)}))x^{(i)}$

随机梯度下降能更快的收敛，因为系数更新更频繁了。由于每个梯度基于单个训练样本计算，error surface is noisier。此外，随机梯度下降可以用于online learning。

python code 如下：

from numpy.random import seed
class AdalineSGD(object):
    """ADAptive LInear NEuron classifier.
    Parameters
    ------------
    eta : float
    Learning rate (between 0.0 and 1.0)
    n_iter : int
    Passes over the training dataset.
    Attributes
    -----------
    w_ : 1d-array
    Weights after fitting.
    errors_ : list
    Number of misclassifications in every epoch.
    shuffle : bool (default: True)
    Shuffles training data every epoch
    if True to prevent cycles.
    random_state : int (default: None)
    Set random state for shuffling
    and initializing the weights.
    """
    def __init__(self, eta=0.01, n_iter=10, shuffle=True, random_state=None):
        self.eta = eta
        self.n_iter = n_iter
        self.w_initialized = False
        self.shuffle = shuffle
        
        if random_state:
            seed(random_state)
        
        def fit(self, X, y):
            """ Fit training data.
            Parameters
            ----------
            X : {array-like}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.
            y : array-like, shape = [n_samples]
            Target values.
            Returns
            -------
            self : object
            """
            self._initialize_weights(X.shape[1])
            self.cost_ = []
            
            for i in range(self.n_iter):
                if self.shuffle:
                    X, y = self._shuffle(X, y)
                    cost = []
                    for xi, target in zip(X, y):
                        cost.append(self._update_weights(xi, target))
                        avg_cost = sum(cost)/len(y)
                        self.cost_.append(avg_cost)
                        return self
        
        def partial_fit(self, X, y):
            """Fit training data without reinitializing the weights"""
            if not self.w_initialized:
                self._initialize_weights(X.shape[1])
            if y.ravel().shape[0] > 1:
                for xi, target in zip(X, y):
                    self._update_weights(xi, target)
            else:
                self._update_weights(X, y)
            return self
        
        def _shuffle(self, X, y):
            """Shuffle training data"""
            r = np.random.permutation(len(y))
            return X[r], y[r]
        
        def _initialize_weights(self, m):
            """Initialize weights to zeros"""
            self.w_ = np.zeros(1 + m)
            self.w_initialized = True
        
        def _update_weights(self, xi, target):
            """Apply Adaline learning rule to update the weights"""
            output = self.net_input(xi)
            error = (target - output)
            self.w_[1:] += self.eta * xi.dot(error)
            self.w_[0] += self.eta * error
            cost = 0.5 * error**2
            return cost
        
        def net_input(self, X):
            """Calculate net input"""
            return np.dot(X, self.w_[1:]) + self.w_[0]
        
        def activation(self, X):
            """Compute linear activation"""
            return self.net_input(X)
        
        def predict(self, X):
            """Return class label after unit step"""
            return np.where(self.activation(X) >= 0.0, 1, -1)

_shuffle: via the permutation function in np.random, generate a random sequence of unique numbers in the range 0 to 100, these numbers used as indices to shuffle feature matrix and class label vector.

plot the training results:

ada = AdalineSGD(n_iter=15, eta=0.01, random_state=1)
ada.fit(X_std, y)
plot_decision_regions(X_std, y, classifier=ada)
plt.title('Adaline - Stochastic Gradient Descent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc='upper left')
plt.show()
plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Average Cost')
plt.show()

王山而RR

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Python Machine Learning Chapter 2 Training Machine Learning Algorithms for Classification 学习笔记

本帖是学习Sebastian Raschka 的《Python Machine Learning》做的笔记，便于需要时查看。Chapter 2 Training Machine Learning Algorithms for Classification including:building an intuition for machine learning algorithms usi...
复制链接

扫一扫