机器学习入门 07 逻辑回归-CSDN博客

本文链接：https://blog.csdn.net/lihaogn/article/details/85468164

1 什么是逻辑回归（Logistic Regression）

1.1 介绍

一般用来解决分类问题，只能解决二分类问题。

将样本的特征和样本的概率联系起来，概率是一个数，所以可以叫做回归问题。

在多项式回归中， $\hat { y } ={ f }(x)={ \theta }^{ \intercal }\cdot { x }_{ b }$ ， $\theta }^{ \intercal }$ 是系数， ${ x }_{ b }$ 是添加了 $}_{ b }\equiv 1$ 的向量（矩阵）。 $\hat { y }$ 的值域为 $(-\infty ,\infty)$ ，而概率 $\hat { p }$ 的值域为 $[0, 1]$ ，这里使用Sigmoid函数将y的值域范围转换为p的值域范围。

Sigmoid函数： $\sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } }$ ， $\left\{ \begin{matrix} t>0,\quad p>0.5 \\ t<0,\quad p<0.5 \end{matrix} \right .$

函数图像：

令 $\hat { p } =\sigma ({ \theta }^{ \intercal }\cdot { x }_{ b })=\frac { 1 }{ 1+{ e }^{ -{ \theta }^{ \intercal }\cdot { x }_{ b } } }$ ，最终分类 $\hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5 \\ 0,\quad \hat { p } \le 0.5\quad \end{matrix} \right.$

问题：

对于给定的样本数据集X和y，如何找到参数 $\theta$ ，使得用这样的方式，可以最大程度的获得样本数据集X对应的分类输出y。

2 逻辑回归的损失函数

$cost=\left\{ \begin{matrix} -log(\hat { p } )\quad if\quad y=1 \\ -log(1-\hat { p } )\quad if\quad y=0 \end{matrix} \right. \Rightarrow cost=-ylog(\hat { p } )-(1-y)log(1-\hat { p } )\\ \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log({ \hat { p } }^{ (i) })+(1-{ y }^{ (i) })log(1-{ \hat { p } }^{ (i) })] }, { \hat { p } }^{ (i) }=\sigma ({ X }_{ b }^{ (i) }\theta )=\frac { 1 }{ 1+{ e }^{ -{ X }_{ b }^{ (i) }\theta } }$

3 使用梯度下降法求 $\theta$ 使得 $J(\theta )$ 最小

$\\ \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log({ \hat { p } }^{ (i) })+(1-{ y }^{ (i) })log(1-{ \hat { p } }^{ (i) })] } ,\quad { \hat { p } }^{ (i) }=\sigma ({ X }_{ b }^{ (i) }\theta )=\frac { 1 }{ 1+{ e }^{ -{ X }_{ b }^{ (i) }\theta } } \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log(\sigma ({ X }_{ b }^{ (i) }\theta ))+(1-{ y }^{ (i) })log(1-\sigma ({ X }_{ b }^{ (i) }\theta ))] } \\ \left\{ \begin{matrix} \sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } } ={ (1+{ e }^{ -t }) }^{ -1 },{ \sigma (t) }\prime ={ (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t } \\ { [log\sigma (t)] }\prime =\frac { 1 }{ \sigma (t) } \sigma (t)\prime =\frac { 1 }{ { (1+{ e }^{ -t }) }^{ -1 } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=\frac { { e }^{ -t } }{ 1+{ e }^{ -t } } =\frac { 1+{ e }^{ -t }-1 }{ 1+{ e }^{ -t } } =1-\sigma (t) \\ [log(1-\sigma (t))]\prime =\frac { 1 }{ 1-\sigma (t) } (-1){ \sigma (t) }\prime =-\frac { 1 }{ 1-\sigma (t) } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=-\frac { 1 }{ \frac { 1+{ e }^{ -t } }{ 1+{ e }^{ -t } } -\frac { 1 }{ 1+{ e }^{ -t } } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t } \\ =-\frac { 1+{ e }^{ -t } }{ { e }^{ -t } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=-{ (1+{ e }^{ -t }) }^{ -1 }=-\sigma (t) \end{matrix} \right. \\ \left\{ \begin{matrix} \frac { d({ y }^{ (i) }log(\sigma ({ X }_{ b }^{ (i) }\theta ))) }{ d{ \theta }_{ j } } ={ y }^{ (i) }(1-\sigma ({ X }_{ b }^{ (i) }\theta )){ X }_{ j }^{ (i) } \\ \frac { d((1-{ y }^{ (i) })log(1-\sigma ({ X }_{ b }^{ (i) }\theta ))) }{ d{ \theta }_{ j } } =(1-{ y }^{ (i) })(-\sigma ({ X }_{ b }^{ (i) }\theta )){ X }_{ j }^{ (i) } \end{matrix}\begin{matrix} + \\ \Longrightarrow \end{matrix} \right .[{ y }^{ (i) }-\sigma ({ X }_{ b }^{ (i) }\theta )]{ X }_{ j }^{ (i) }\\ \Rightarrow \frac { dJ(\theta ) }{ d{ \theta }_{ j } } =\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ j }^{ (i) }\\ \Rightarrow \nabla J(\theta )=\left( \begin{matrix} { \partial J }/{ \partial { \theta }_{ 0 } } \\ { \partial J }/{ \partial { \theta }_{ 1 } } \\ \vdots \\ { \partial J }/{ \partial { \theta }_{ n } } \end{matrix} \right) =\frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) =\frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) (\Rightarrow \frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) )\\ =\frac { 1 }{ m } { X }_{ b }^{ \intercal }[\sigma ({ X }_{ b }\theta )-{ y }]$

4 Python实现逻辑回归算法

逻辑回归模块：

import numpy as np
from sklearn.metrics import accuracy_score

class LogisticRegression:

    def __init__(self):
        """初始化Logistic Regression模型"""
        self.coef_ = None
        self.intercept_ = None
        self._theta = None

    def _sigmoid(self, t):
        return 1./(1.+np.exp(-t))

    def fit(self, X_train, y_train, eta=0.01, n_iters=1e4):
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Logistic Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        def J(theta, X_b, y):
            """目标函数"""
            y_hat = self._sigmoid(X_b.dot(theta))
            try:
                return -np.sum(y*np.log(y_hat)+(1-y)*np.log(1-y_hat)) / len(y)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            """梯度"""
            return X_b.T.dot(self._sigmoid(X_b.dot(theta)) - y) / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break

                cur_iter += 1

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(
            X_b, y_train, initial_theta, eta, n_iters)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict_proba(self, X_predict):
        """给定待预测数据集X_predict，返回表示X_predict的结果概率向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return self._sigmoid(X_b.dot(self._theta))

    def predict(self, X_predict):
        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        proba=self.predict_proba(X_predict)
        return np.array(proba>=0.5,dtype='int')

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "LogisticRegression()"

使用上述模块：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
# 准备数据
iris=datasets.load_iris()

X=iris.data
y=iris.target

# 只选取其中两个分类
X=X[y<2,:2]
y=y[y<2]

X.shape # (100, 2)
# 绘图
plt.scatter(X[y==0,0],X[y==0,1],color='r')
plt.scatter(X[y==1,0],X[y==1,1],color='b')
plt.show()

在这里插入图片描述

from sklearn.model_selection import train_test_split
import LogisticRegression
# 分割数据集
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=666)

log_reg=LogisticRegression.LogisticRegression()
log_reg.fit(X_train,y_train)
# 模型准确率
log_reg.score(X_test,y_test) # 1.0

5 决策边界

$\left\{ \begin{matrix} \hat { p } =\sigma ({ \theta }^{ \intercal }\cdot { x }_{ b })=\frac { 1 }{ 1+{ e }^{ -{ \theta }^{ \intercal }\cdot { x }_{ b } } } ,\quad \hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5 \\ 0,\quad \hat { p } <0.5 \end{matrix} \right. \\ p=\sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } } \Rightarrow \begin{matrix} t>0,\quad p>0.5 \\ t<0,\quad p<0,5 \end{matrix} \end{matrix} \right. \\ \Rightarrow \hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5,\quad { \theta }^{ \intercal }\cdot { x }_{ b }\ge 0\quad \\ 0,\quad \hat { p } <0.5,\quad { \theta }^{ \intercal }\cdot { x }_{ b }<0 \end{matrix} \right. \\ \Rightarrow { \theta }^{ \intercal }\cdot { x }_{ b }=0$

决策边界为: $\theta }^{ \intercal }\cdot { x }_{ b }=0$

若X有两个特征： $\theta }_{ 0 }+{ \theta }_{ 1 }{ x }_{ 1 }+{ \theta }_{ 2 }{ x }_{ 2 }=0\quad \Rightarrow { \quad x }_{ 2 }=\frac { -{ \theta }_{ 0 }-{ \theta }_{ 1 }{ x }_{ 1 } }{ { \theta }_{ 2 } }$

绘图演示：

def x2(x1):
    return (-log_reg.coef_[0]*x1-log_reg.intercept_)/log_reg.coef_[1]
    
x1_plot=np.linspace(4,8,1000)
x2_plot=x2(x1_plot)

plt.scatter(X[y==0,0],X[y==0,1],color='r')
plt.scatter(X[y==1,0],X[y==1,1],color='b')
plt.plot(x1_plot,x2_plot)
plt.show()