pythonML学习记录ch3-基于逻辑回归的分类概率建模

最新推荐文章于 2023-08-23 13:31:39 发布

01加加龙

最新推荐文章于 2023-08-23 13:31:39 发布

阅读量404

点赞数

分类专栏：【python机器学习】文章标签：学习逻辑回归分类

本文链接：https://blog.csdn.net/Yogurtboyyy/article/details/127447686

版权

【python机器学习】专栏收录该内容

7 篇文章 0 订阅

订阅专栏

逻辑回归是一种分类模型而非回归模型。

一、几个需要认识的定义：

1、让步比： $\frac{p}{(1-p)}$ ,代表阳性事件的概率，它指的是要预测的事件，例如：病人有某种疾病的可能性、某人买彩票中了一千万的可能性...

2、让步比的对数形式： $logit(p) = log^ \frac{p}{(1-p)}$ ,logit函数输入值的取值范围在0到1之间，转换或计算的结果值为整个实数范围，可以用它来表示特征值和对数概率之间的线性关系:

$logit(p(y=1|x)) = w_0x_o + w_1x_1 + ... + w_mx_m = \sum_{i=0} ^{m}w_ix_i = w^Tx$

这里p(y=1|x)是某个特定样本属于x类给定特征标签为1的条件概率。

3、sigmoid函数： $\Phi (z) = \frac{1}{1 + e^{-z}}$ ,它是logit函数的逆形式。

sigmoid函数形状如下图：

二、学习逻辑代价函数的权重：

首先定义在建立逻辑回归模型时想要最大化的可能性L，假设数据集中的样本都是相互独立的个体。公式如下：

$L(w) = p(y|x;w) = \prod _i^{n}P(y^{(i)}|x^{(i)};w) = \prod _i^{n}(\phi(z^{(i)}))^{y^{(i)}}(1-\phi(z^{(i)}))^{1-y^{(i)}}$

在实践中，最大化该方程的自然对数，也被称为对数似然函数：

$l(w) = log(L(w)) = \Sigma_{i=1}^{n} [y^{(i)}log(\phi(z^{(i)})) + (1-y^{(i)})log(1-\phi(z^{(i)}))]$

用梯度下降方法最小化代价函数J：

$J(w) = \Sigma_{i=1}^{n}[-y^{(i)}log(\phi(z^{(i)}))-(1-y^{(i)})log(1-\phi(z^{(i)}))]$

为更好地理解这个代价函数，让我们计算一个样本训练实例的代价如下：

$J(\phi(z),y;w) = -ylog(\phi(z))-(1-y)log(1-\phi(z))$

从方程中可以看到，如果y=0,第一项为零，如果y=1，第二项为零：

$J(\phi(z),y;w) =\left\{ \begin{matrix} -log(\phi(z))& \quad \quad if\quad y = 1\\ -log(1-\phi(z)) & \quad \quad if\quad y = 0 \end{matrix} \right.$

通过下述简短代码来绘制一张图来说明 $\phi(z)$ 不同样本实例分类的代价：

def cost_1(z):
    return - np.log(sigmoid(z))


def cost_0(z):
    return - np.log(1 - sigmoid(z))

z = np.arange(-10, 10, 0.1)
phi_z = sigmoid(z)

c1 = [cost_1(x) for x in z]
plt.plot(phi_z, c1, label='J(w) if y=1')

c0 = [cost_0(x) for x in z]
plt.plot(phi_z, c0, linestyle='--', label='J(w) if y=0')

plt.ylim(0.0, 5.1)
plt.xlim([0, 1])
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
#plt.savefig('images/03_04.png', dpi=300)
plt.show()

图示如下：

结论：如果分类为1，则概率越小表示分类错误程度越高；如果分类为0，则概率越大表示分类错误程度越高。

三、把转换的Adaline用于逻辑回归算法

用新的代价函数取代第2章中实现的Adaline代价函数J：

$J(w) = \Sigma_{i=1}^{n}[-y^{(i)}log(\phi(z^{(i)}))-(1-y^{(i)})log(1-\phi(z^{(i)}))]$

对训练样本进行分类的过程中，用该公式来计算每次迭代的代价。另外需要用S激活函数替代线性激活函数，同时把阈值函数的返回类标签从0变成1，不再返回-1和1。若能在Adaline编码中完成这三步，即可获得下述逻辑回归的代码实现：

class LogisticRegressionGD(object):
    """Logistic Regression Classifier using gradient descent.

    Parameters
    ------------
    eta : float
      Learning rate (between 0.0 and 1.0)
    n_iter : int
      Passes over the training dataset.
    random_state : int
      Random number generator seed for random weight
      initialization.


    Attributes
    -----------
    w_ : 1d-array
      Weights after fitting.
    cost_ : list
      Sum-of-squares cost function value in each epoch.

    """
    def __init__(self, eta=0.05, n_iter=100, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state

    def fit(self, X, y):
        """ Fit training data.

        Parameters
        ----------
        X : {array-like}, shape = [n_samples, n_features]
          Training vectors, where n_samples is the number of samples and
          n_features is the number of features.
        y : array-like, shape = [n_samples]
          Target values.

        Returns
        -------
        self : object

        """
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape[1])
        self.cost_ = []

        for i in range(self.n_iter):
            net_input = self.net_input(X)
            output = self.activation(net_input)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)
            self.w_[0] += self.eta * errors.sum()
            
            # note that we compute the logistic `cost` now
            # instead of the sum of squared errors cost
            # 1、替换掉以平方和求代价的方式
            cost = -y.dot(np.log(output)) - ((1 - y).dot(np.log(1 - output)))
            self.cost_.append(cost)
        return self
    
    def net_input(self, X):
        """Calculate net input"""
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def activation(self, z):
        """Compute logistic sigmoid activation"""
        # 2、用s激活函数替换线性激活函数
        return 1. / (1. + np.exp(-np.clip(z, -250, 250)))

    def predict(self, X):
        """Return class label after unit step"""
        # 3、不再返回 -1 和 1，用 0和1代替
        return np.where(self.net_input(X) >= 0.0, 1, 0)
        # equivalent to:
        # return np.where(self.activation(self.net_input(X)) >= 0.5, 1, 0)

注意：拟合逻辑回归模型只适用于二元分类，以下用Iris-setosa和Iris-versicolor两种花的数据来做验证。

代码如下：

X_train_01_subset = X_train[(y_train == 0) | (y_train == 1)]
y_train_01_subset = y_train[(y_train == 0) | (y_train == 1)]

lrgd = LogisticRegressionGD(eta=0.05, n_iter=1000, random_state=1)
lrgd.fit(X_train_01_subset,
         y_train_01_subset)

plot_decision_regions(X=X_train_01_subset, 
                      y=y_train_01_subset,
                      classifier=lrgd)

plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')

plt.tight_layout()
#plt.savefig('images/03_05.png', dpi=300)
plt.show()

决策区域图如下所示：

四、用sk-learn训练逻辑回归模型

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=100.0, random_state=1)
lr.fit(X_train_std, y_train)

plot_decision_regions(X_combined_std, y_combined,
                      classifier=lr, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
#plt.savefig('images/03_06.png', dpi=300)
plt.show()

决策区域图如下：

五、通过正则化解决过拟合问题

过拟合是什么？欠拟合又是什么？

过拟合是指模型在训练数据上表现良好，但无法概括未见过的新数据或测试数据；欠拟合是指模型不足以捕捉训练数据中的复杂模式，因此对未见过的数据表现不良。

以下几个图片可以很好的阐明过拟合与欠拟合的情况：

正则化是什么？

正则化是处理共线性（特征之间的高相关性），消除数据中的噪声，并最终避免过拟合的非常有效的方法。正则化的逻辑是引入额外的信息（偏置）来惩罚极端的参数值（权重）。

最常见的正则化是所谓的L2正则化，具体如下：

$\frac{\lambda }{2}\left \| w \right \|^2 = \frac{\lambda }{2} \Sigma_{j=1}^{m} w_j^2$

这里的 $\lambda$ 为所谓的正则化参数

逻辑回归的代价函数可以通过增加一个简单的正则项来调整，这将在模型训练的过程中缩小权重：

$J(w) = \Sigma_{i=1}^{n}[-y^{(i)}log(\phi(z^{(i)}))-(1-y^{(i)})log(1-\phi(z^{(i)}))] + \frac{\lambda }{2}\left \| w \right \|^2$

参数C来自于支持向量机的约定，C与λ成反比，因此降低C意味着增加正则化的强度，可通过绘制两个权重系数的L2正则路径实现可视化，代码如下：

weights, params = [], []
for c in np.arange(-5, 5):
    lr = LogisticRegression(C=10.**c, random_state=1)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10.**c)

weights = np.array(weights)
plt.plot(params, weights[:, 0],
         label='petal length')
plt.plot(params, weights[:, 1], linestyle='--',
         label='petal width')
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='upper left')
plt.xscale('log')
#plt.savefig('images/03_08.png', dpi=300)
plt.show()

如下图所示，减小逆正则化参数C可以增大正则化的强度，权重系数会变小。