机器学习入门 07 逻辑回归

1 什么是逻辑回归(Logistic Regression)

1.1 介绍

​ 一般用来解决分类问题,只能解决二分类问题。

​ 将样本的特征和样本的概率联系起来,概率是一个数,所以可以叫做回归问题。

​ 在多项式回归中, y ^ = f ( x ) = θ ⊺ ⋅ x b \hat { y } ={ f }(x)={ \theta }^{ \intercal }\cdot { x }_{ b } y^=f(x)=θxb θ ⊺ { \theta }^{ \intercal } θ是系数, x b { x }_{ b } xb是添加了 x b ≡ 1 { x }_{ b }\equiv 1 xb1的向量(矩阵)。 y ^ \hat { y } y^的值域为 ( − ∞ , ∞ ) (-\infty ,\infty) (,),而概率 p ^ \hat { p } p^的值域为 [ 0 , 1 ] [0,1] [0,1],这里使用Sigmoid函数将y的值域范围转换为p的值域范围。

​ Sigmoid函数: σ ( t ) = 1 1 + e − t \sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } } σ(t)=1+et1 { t &gt; 0 , p &gt; 0.5 t &lt; 0 , p &lt; 0.5 \left\{ \begin{matrix} t&gt;0,\quad p&gt;0.5 \\ t&lt;0,\quad p&lt;0.5 \end{matrix} \right . {t>0,p>0.5t<0,p<0.5

​ 函数图像:

1

​ 令 p ^ = σ ( θ ⊺ ⋅ x b ) = 1 1 + e − θ ⊺ ⋅ x b \hat { p } =\sigma ({ \theta }^{ \intercal }\cdot { x }_{ b })=\frac { 1 }{ 1+{ e }^{ -{ \theta }^{ \intercal }\cdot { x }_{ b } } } p^=σ(θxb)=1+eθxb1,最终分类 y ^ = { 1 , p ^ ≥ 0.5 0 , p ^ ≤ 0.5 \hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5 \\ 0,\quad \hat { p } \le 0.5\quad \end{matrix} \right. y^={1,p^0.50,p^0.5

问题:

​ 对于给定的样本数据集X和y,如何找到参数 θ \theta θ,使得用这样的方式,可以最大程度的获得样本数据集X对应的分类输出y。

2 逻辑回归的损失函数

c o s t = { − l o g ( p ^ ) i f y = 1 − l o g ( 1 − p ^ ) i f y = 0 ⇒ c o s t = − y l o g ( p ^ ) − ( 1 − y ) l o g ( 1 − p ^ ) J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( p ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − p ^ ( i ) ) ] , p ^ ( i ) = σ ( X b ( i ) θ ) = 1 1 + e − X b ( i ) θ cost=\left\{ \begin{matrix} -log(\hat { p } )\quad if\quad y=1 \\ -log(1-\hat { p } )\quad if\quad y=0 \end{matrix} \right. \Rightarrow cost=-ylog(\hat { p } )-(1-y)log(1-\hat { p } )\\ \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log({ \hat { p } }^{ (i) })+(1-{ y }^{ (i) })log(1-{ \hat { p } }^{ (i) })] }, { \hat { p } }^{ (i) }=\sigma ({ X }_{ b }^{ (i) }\theta )=\frac { 1 }{ 1+{ e }^{ -{ X }_{ b }^{ (i) }\theta } } cost={log(p^)ify=1log(1p^)ify=0cost=ylog(p^)(1y)log(1p^)J(θ)=m1i=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))],p^(i)=σ(Xb(i)θ)=1+eXb(i)θ1

3 使用梯度下降法求 θ \theta θ 使得 J ( θ ) J(\theta ) J(θ)最小

J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( p ^ ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − p ^ ( i ) ) ] , p ^ ( i ) = σ ( X b ( i ) θ ) = 1 1 + e − X b ( i ) θ J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( σ ( X b ( i ) θ ) ) + ( 1 − y ( i ) ) l o g ( 1 − σ ( X b ( i ) θ ) ) ] { σ ( t ) = 1 1 + e − t = ( 1 + e − t ) − 1 , σ ( t ) ′ = ( 1 + e − t ) − 2 e − t [ l o g σ ( t ) ] ′ = 1 σ ( t ) σ ( t ) ′ = 1 ( 1 + e − t ) − 1 ( 1 + e − t ) − 2 e − t = e − t 1 + e − t = 1 + e − t − 1 1 + e − t = 1 − σ ( t ) [ l o g ( 1 − σ ( t ) ) ] ′ = 1 1 − σ ( t ) ( − 1 ) σ ( t ) ′ = − 1 1 − σ ( t ) ( 1 + e − t ) − 2 e − t = − 1 1 + e − t 1 + e − t − 1 1 + e − t ( 1 + e − t ) − 2 e − t = − 1 + e − t e − t ( 1 + e − t ) − 2 e − t = − ( 1 + e − t ) − 1 = − σ ( t ) { d ( y ( i ) l o g ( σ ( X b ( i ) θ ) ) ) d θ j = y ( i ) ( 1 − σ ( X b ( i ) θ ) ) X j ( i ) d ( ( 1 − y ( i ) ) l o g ( 1 − σ ( X b ( i ) θ ) ) ) d θ j = ( 1 − y ( i ) ) ( − σ ( X b ( i ) θ ) ) X j ( i ) + ⟹ [ y ( i ) − σ ( X b ( i ) θ ) ] X j ( i ) ⇒ d J ( θ ) d θ j = 1 m ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) X j ( i ) ⇒ ∇ J ( θ ) = ( ∂ J / ∂ θ 0 ∂ J / ∂ θ 1 ⋮ ∂ J / ∂ θ n ) = 1 m ( ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) X 0 ( i ) ) ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) X 1 ( i ) ) ⋮ ∑ i = 1 m ( σ ( X b ( i ) θ ) − y ( i ) ) X n ( i ) ) ) = 1 m ( ∑ i = 1 m ( p ^ ( i ) − y ( i ) ) X 0 ( i ) ) ∑ i = 1 m ( p ^ ( i ) − y ( i ) ) X 1 ( i ) ) ⋮ ∑ i = 1 m ( p ^ ( i ) − y ( i ) ) X n ( i ) ) ) ( ⇒ 1 m ( ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) X 0 ( i ) ) ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) X 1 ( i ) ) ⋮ ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) X n ( i ) ) ) ) = 1 m X b ⊺ [ σ ( X b θ ) − y ] \\ \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log({ \hat { p } }^{ (i) })+(1-{ y }^{ (i) })log(1-{ \hat { p } }^{ (i) })] } ,\quad { \hat { p } }^{ (i) }=\sigma ({ X }_{ b }^{ (i) }\theta )=\frac { 1 }{ 1+{ e }^{ -{ X }_{ b }^{ (i) }\theta } } \\ J(\theta )=-\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ [{ y }^{ (i) }log(\sigma ({ X }_{ b }^{ (i) }\theta ))+(1-{ y }^{ (i) })log(1-\sigma ({ X }_{ b }^{ (i) }\theta ))] } \\ \left\{ \begin{matrix} \sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } } ={ (1+{ e }^{ -t }) }^{ -1 },{ \sigma (t) }\prime ={ (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t } \\ { [log\sigma (t)] }\prime =\frac { 1 }{ \sigma (t) } \sigma (t)\prime =\frac { 1 }{ { (1+{ e }^{ -t }) }^{ -1 } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=\frac { { e }^{ -t } }{ 1+{ e }^{ -t } } =\frac { 1+{ e }^{ -t }-1 }{ 1+{ e }^{ -t } } =1-\sigma (t) \\ [log(1-\sigma (t))]\prime =\frac { 1 }{ 1-\sigma (t) } (-1){ \sigma (t) }\prime =-\frac { 1 }{ 1-\sigma (t) } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=-\frac { 1 }{ \frac { 1+{ e }^{ -t } }{ 1+{ e }^{ -t } } -\frac { 1 }{ 1+{ e }^{ -t } } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t } \\ =-\frac { 1+{ e }^{ -t } }{ { e }^{ -t } } { (1+{ e }^{ -t }) }^{ -2 }{ e }^{ -t }=-{ (1+{ e }^{ -t }) }^{ -1 }=-\sigma (t) \end{matrix} \right. \\ \left\{ \begin{matrix} \frac { d({ y }^{ (i) }log(\sigma ({ X }_{ b }^{ (i) }\theta ))) }{ d{ \theta }_{ j } } ={ y }^{ (i) }(1-\sigma ({ X }_{ b }^{ (i) }\theta )){ X }_{ j }^{ (i) } \\ \frac { d((1-{ y }^{ (i) })log(1-\sigma ({ X }_{ b }^{ (i) }\theta ))) }{ d{ \theta }_{ j } } =(1-{ y }^{ (i) })(-\sigma ({ X }_{ b }^{ (i) }\theta )){ X }_{ j }^{ (i) } \end{matrix}\begin{matrix} + \\ \Longrightarrow \end{matrix} \right .[{ y }^{ (i) }-\sigma ({ X }_{ b }^{ (i) }\theta )]{ X }_{ j }^{ (i) }\\ \Rightarrow \frac { dJ(\theta ) }{ d{ \theta }_{ j } } =\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ j }^{ (i) }\\ \Rightarrow \nabla J(\theta )=\left( \begin{matrix} { \partial J }/{ \partial { \theta }_{ 0 } } \\ { \partial J }/{ \partial { \theta }_{ 1 } } \\ \vdots \\ { \partial J }/{ \partial { \theta }_{ n } } \end{matrix} \right) =\frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ (\sigma ({ X }_{ b }^{ (i) }\theta )-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) =\frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ \hat { p } }^{ (i) }-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) (\Rightarrow \frac { 1 }{ m } \left( \begin{matrix} \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 0 }^{ (i) }) \\ \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ 1 }^{ (i) }) \\ \vdots \\ \sum _{ i=1 }^{ m }{ ({ \hat { y } }^{ (i) }-{ y }^{ (i) }) } { X }_{ n }^{ (i) }) \end{matrix} \right) )\\ =\frac { 1 }{ m } { X }_{ b }^{ \intercal }[\sigma ({ X }_{ b }\theta )-{ y }] J(θ)=m1i=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))],p^(i)=σ(Xb(i)θ)=1+eXb(i)θ1J(θ)=m1i=1m[y(i)log(σ(Xb(i)θ))+(1y(i))log(1σ(Xb(i)θ))]σ(t)=1+et1=(1+et)1,σ(t)=(1+et)2et[logσ(t)]=σ(t)1σ(t)=(1+et)11(1+et)2et=1+etet=1+et1+et1=1σ(t)[log(1σ(t))]=1σ(t)1(1)σ(t)=1σ(t)1(1+et)2et=1+et1+et1+et11(1+et)2et=et1+et(1+et)2et=(1+et)1=σ(t)dθjd(y(i)log(σ(Xb(i)θ)))=y(i)(1σ(Xb(i)θ))Xj(i)dθjd((1y(i))log(1σ(Xb(i)θ)))=(1y(i))(σ(Xb(i)θ))Xj(i)+[y(i)σ(Xb(i)θ)]Xj(i)dθjdJ(θ)=m1i=1m(σ(Xb(i)θ)y(i))Xj(i)J(θ)=J/θ0J/θ1J/θn=m1i=1m(σ(Xb(i)θ)y(i))X0(i))i=1m(σ(Xb(i)θ)y(i))X1(i))i=1m(σ(Xb(i)θ)y(i))Xn(i))=m1i=1m(p^(i)y(i))X0(i))i=1m(p^(i)y(i))X1(i))i=1m(p^(i)y(i))Xn(i))(m1i=1m(y^(i)y(i))X0(i))i=1m(y^(i)y(i))X1(i))i=1m(y^(i)y(i))Xn(i)))=m1Xb[σ(Xbθ)y]

4 Python实现逻辑回归算法

逻辑回归模块:

import numpy as np
from sklearn.metrics import accuracy_score

class LogisticRegression:

    def __init__(self):
        """初始化Logistic Regression模型"""
        self.coef_ = None
        self.intercept_ = None
        self._theta = None

    def _sigmoid(self, t):
        return 1./(1.+np.exp(-t))

    def fit(self, X_train, y_train, eta=0.01, n_iters=1e4):
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Logistic Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        def J(theta, X_b, y):
            """目标函数"""
            y_hat = self._sigmoid(X_b.dot(theta))
            try:
                return -np.sum(y*np.log(y_hat)+(1-y)*np.log(1-y_hat)) / len(y)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            """梯度"""
            return X_b.T.dot(self._sigmoid(X_b.dot(theta)) - y) / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break

                cur_iter += 1

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(
            X_b, y_train, initial_theta, eta, n_iters)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict_proba(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果概率向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return self._sigmoid(X_b.dot(self._theta))

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        proba=self.predict_proba(X_predict)
        return np.array(proba>=0.5,dtype='int')

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "LogisticRegression()"

使用上述模块:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
# 准备数据
iris=datasets.load_iris()

X=iris.data
y=iris.target

# 只选取其中两个分类
X=X[y<2,:2]
y=y[y<2]

X.shape # (100, 2)
# 绘图
plt.scatter(X[y==0,0],X[y==0,1],color='r')
plt.scatter(X[y==1,0],X[y==1,1],color='b')
plt.show()

在这里插入图片描述

from sklearn.model_selection import train_test_split
import LogisticRegression
# 分割数据集
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=666)

log_reg=LogisticRegression.LogisticRegression()
log_reg.fit(X_train,y_train)
# 模型准确率
log_reg.score(X_test,y_test) # 1.0

5 决策边界

{ p ^ = σ ( θ ⊺ ⋅ x b ) = 1 1 + e − θ ⊺ ⋅ x b , y ^ = { 1 , p ^ ≥ 0.5 0 , p ^ &lt; 0.5 p = σ ( t ) = 1 1 + e − t ⇒ t &gt; 0 , p &gt; 0.5 t &lt; 0 , p &lt; 0 , 5 ⇒ y ^ = { 1 , p ^ ≥ 0.5 , θ ⊺ ⋅ x b ≥ 0 0 , p ^ &lt; 0.5 , θ ⊺ ⋅ x b &lt; 0 ⇒ θ ⊺ ⋅ x b = 0 \left\{ \begin{matrix} \hat { p } =\sigma ({ \theta }^{ \intercal }\cdot { x }_{ b })=\frac { 1 }{ 1+{ e }^{ -{ \theta }^{ \intercal }\cdot { x }_{ b } } } ,\quad \hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5 \\ 0,\quad \hat { p } &lt;0.5 \end{matrix} \right. \\ p=\sigma (t)=\frac { 1 }{ 1+{ e }^{ -t } } \Rightarrow \begin{matrix} t&gt;0,\quad p&gt;0.5 \\ t&lt;0,\quad p&lt;0,5 \end{matrix} \end{matrix} \right. \\ \Rightarrow \hat { y } =\left\{ \begin{matrix} 1,\quad \hat { p } \ge 0.5,\quad { \theta }^{ \intercal }\cdot { x }_{ b }\ge 0\quad \\ 0,\quad \hat { p } &lt;0.5,\quad { \theta }^{ \intercal }\cdot { x }_{ b }&lt;0 \end{matrix} \right. \\ \Rightarrow { \theta }^{ \intercal }\cdot { x }_{ b }=0 p^=σ(θxb)=1+eθxb1,y^={1,p^0.50,p^<0.5p=σ(t)=1+et1t>0,p>0.5t<0,p<0,5y^={1,p^0.5,θxb00,p^<0.5,θxb<0θxb=0

决策边界为: θ ⊺ ⋅ x b = 0 { \theta }^{ \intercal }\cdot { x }_{ b }=0 θxb=0

若X有两个特征: θ 0 + θ 1 x 1 + θ 2 x 2 = 0 ⇒ x 2 = − θ 0 − θ 1 x 1 θ 2 { \theta }_{ 0 }+{ \theta }_{ 1 }{ x }_{ 1 }+{ \theta }_{ 2 }{ x }_{ 2 }=0\quad \Rightarrow { \quad x }_{ 2 }=\frac { -{ \theta }_{ 0 }-{ \theta }_{ 1 }{ x }_{ 1 } }{ { \theta }_{ 2 } } θ0+θ1x1+θ2x2=0x2=θ2θ0θ1x1

绘图演示:

def x2(x1):
    return (-log_reg.coef_[0]*x1-log_reg.intercept_)/log_reg.coef_[1]
    
x1_plot=np.linspace(4,8,1000)
x2_plot=x2(x1_plot)

plt.scatter(X[y==0,0],X[y==0,1],color='r')
plt.scatter(X[y==1,0],X[y==1,1],color='b')
plt.plot(x1_plot,x2_plot)
plt.show()

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值