机器学习基础-7.逻辑回归

最新推荐文章于 2023-11-10 11:08:58 发布

闲人勿-

最新推荐文章于 2023-11-10 11:08:58 发布

阅读量791

点赞数

分类专栏：机器学习文章标签：逻辑回归

本文链接：https://blog.csdn.net/qq_25560849/article/details/80626799

版权

机器学习专栏收录该内容

12 篇文章 6 订阅

订阅专栏

一、逻辑回归

1.概念

逻辑回归虽然名字上是回归，但是实际上解决的是分类问题。它将样本的特征和样本发生的概率联系起来，概率是个数，如果不进行最后的分类则可以视为回归。一般会设定一个阈值，如果该概率大于这个阈值就会分为一类，如果小于该阈值就会分到另外一类，所以说逻辑回归实际上解决的是分类问题。逻辑回归只能解决二分类问题，如果要解决多分类，则需要进行一定的改进。

之前使用回归算法时，求得的y值的值域为负无穷到正无穷，但是概率的大小为0到1，因此需要找到一个函数将上面的式子进行变换。这里使用的是sigmoid函数。

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(t):
    return 1 /(1+np.exp(-t))
x = np.linspace(-10,10,500)
y = sigmoid(x)
plt.plot(x,y)

从图中可以看出，当t>0，p>0.5，当t<0，p<0.5，分界点可以设为0，因此有如下的分类方式。

现在的问题就是，已知一个样本集，如何获得参数θ，使得最终的分类结果最接近实际的分类，也就是如何找到一个损失函数。

2.求解思路

因为是使用p作为分类指标，容易理解，如果实际上的样本是属于1分类，则p的值越小损失函数cost越大，如果实际上的样本是属于0分类，则p的值越大损失函数cost越大。

对此可以使用log函数来计算，如下所示。该函数可以很好地符合上面cost函数的要求。

因此对于m个样本，损失函数可以写成。

该函数是个凸函数，现在的目标就是求得该损失函数的最小值，虽然无法直接求解，但是可以采用梯度下降法求解出极值点。接下来的难度就是求解出该损失函数的梯度。

因此代入上面的损失函数中，括号左边的式子求梯度为。

现在求括号右边的梯度。

代入上面的损失函数中，括号右边的求梯度为。

将两个梯度代入，进行化简后，得到θj的梯度式子。

将该式子代入总的损失函数中，求得梯度式子如下。

对比之前学习的线性回归，可以得到该梯度的向量化形式。

3.代码实现

import numpy as np
from .metrics import accuracy_score

class LogisticRegression:

    def __init__(self):
        """初始化Linear Regression模型"""
        self.coef_ = None
        self.intercept_ = None
        self._theta = None
    def _sigmoid(self,t):
          return 1./(1.+ np.exp(-t))

    def fit(self, X_train, y_train, eta=0.01, n_iters=1e4):
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        def J(theta, X_b, y):
            y_hat = self._sigmoid(X_b.dot(theta))
            try:
                return -np.sum(y*np.log(y_hat)+(1-y)*np.log(1-y_hat)) / len(y)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            return X_b.T.dot(self._sigmoid(X_b.dot(theta)) - y) / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break

                cur_iter += 1

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict_proba(self, X_predict):
        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return self._sigmoid(X_b.dot(self._theta))


    def predict(self, X_predict):
        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        proba = self.predict_proba(X_predict)
        # true为1，false为0
        return np.array(proba >=0.5,dtype='int')

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "LogisticRegression()"

代码应用，这里使用sklearn的iris数据集。

import numpy as np
import matplotlib.pyplot as plt  
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y= iris.target

X = X[y<2,:2]#这里只拿出0,1样本，并且只取2个特征值
y = y[y<2]

plt.scatter(X[y==0,0],X[y==0,1],color='red')
plt.scatter(X[y==1,0],X[y==1,1],color='blue')

from ml.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,seed=666)
from ml.LogisticRegression import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(x_train,y_train)
log_reg.score(x_test,y_test) #值为1
log_reg.predict(x_test) #可以看到分类的0,1结果

4.决策边界

在上面的逻辑回归中，划分0,1的参考概率是0.5，当p>0.5就分类为1，当p<0.5就分类为0。再往回推导，得到决策边界。

def x2(x1):
    return (-log_reg.coef_[0]*x1-log_reg.intercept_)/log_reg.coef_[1]
x1 = np.linspace(4,8,100)
y1 = x2(x1)
plt.scatter(X[y==0,0],X[y==0,1],color='red')
plt.scatter(X[y==1,0],X[y==1,1],color='blue')
plt.plot(x1,y1)

画出边界线如下所示。这里只取了2个特征，所以决策边界是一条直线，在更高维度的情况下，将是一个面。为了展示，通常会取出2个特征在平面上进行绘制。

在这里，由于iris数据分布可以用直线区分，实际上大多数只能用不规则的边界对样本进行区分，如下图。

def plot_decision_boundary(model,axis): #该模块可以画出不规则的边界，当做工具用即可
    x0,x1 = np.meshgrid(
        np.linspace(axis[0],axis[1],int((axis[1]-axis[0])*100)).reshape(-1,1),
        np.linspace(axis[2],axis[3],int((axis[3]-axis[2])*100)).reshape(-1,1)
    )
    x_new = np.c_[x0.ravel(),x1.ravel()]
    y_predict = model.predict(x_new)
    zz = y_predict.reshape(x0.shape)
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A','#FFF59D','#90CAF9'])
    plt.contourf(x0,x1,zz,linewidth =5,cmap=custom_cmap)

plot_decision_boundary(log_reg,axis=[4,7.5,1.5,4.5])
plt.scatter(X[y==0,0],X[y==0,1],color='red')
plt.scatter(X[y==1,0],X[y==1,1],color='blue')

如果采用knn算法，对iris数据进行分类，并画出决策边界，如下图所示。

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(iris.data[:,:2],iris.target) #这里将所有样本进行训练，为了可视化，只取2个特征。target的取值为3
plot_decision_boundary(knn_clf,axis=[4,8,1.5,4.5])
plt.scatter(iris.data[iris.target==0,0],iris.data[iris.target==0,1])
plt.scatter(iris.data[iris.target==1,0],iris.data[iris.target==1,1])
plt.scatter(iris.data[iris.target==2,0],iris.data[iris.target==2,1])

可以看到决策边界和不规则，产生了过拟合的现象，这个时候需要适当调整k值。将k设置为50，如下图所示。

二、在逻辑回归中使用多项式特征

1.思想

import numpy as np
import matplotlib.pyplot as plt  
np.random.seed(666)
x = np.random.normal(0,1,size=(200,2))

y = np.array(x[:,0]**2+x[:,1]**2<1.5,dtype='int')
plt.scatter(x[y==0,0],x[y==0,1],color='red')
plt.scatter(x[y==1,0],x[y==1,1],color='blue')

如上图所示，模拟生成一组数据。现在首先采用原始的逻辑回归进行分类，并绘制出决策边界。

from ml.LogisticRegression import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(x,y)
log_reg.score(x,y) #0.65,分数很低
plot_decision_boundary(log_reg,[-4,4,-3,3])
plt.scatter(x[y==0,0],x[y==0,1],color='red')
plt.scatter(x[y==1,0],x[y==1,1],color='blue')

直观上看，如果能够用圆形而非直线去拟合，将会有更好的效果。

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomialLogisticRegression(degree):#采用管道将步骤进行封装
    return Pipeline([
       ('poly',PolynomialFeatures(degree=degree)),
        ('std_scaler',StandardScaler()),
        ('logistic',LogisticRegression())     
    ])

plr = PolynomialLogisticRegression(degree=2)
plr.fit(x,y)
plot_decision_boundary(plr,[-4,4,-4,4])
plt.scatter(x[y==0,0],x[y==0,1],color='red')
plt.scatter(x[y==1,0],x[y==1,1],color='blue')

2.sklearn中的逻辑回归

添加正则化，解决模型的过拟合问题。与之前博客中介绍的正则化不同的是，在sklearn中，正则化的形式如下。在创建LogisticRegression的时候，可以引入这个参数c。

import numpy as np
import matplotlib.pyplot as plt  
np.random.seed(666)
x = np.random.normal(0,1,size=(200,2))
y = np.array(x[:,0]**2+x[:,1]<1.5,dtype='int')
for _ in range(20):
    y[np.random.randint(200)] =1#强制将20个数变成1分类，也就是模拟引入了噪音
plt.scatter(x[y==0,0],x[y==0,1],color='red')
plt.scatter(x[y==1,0],x[y==1,1],color='blue')

from ml.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,seed=666)
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(x_train,y_train)

可以看出默认使用L2正则项。

plot_decision_boundary(log_reg,axis=[-4,4,-4,4])
plt.scatter(x[y==0,0],x[y==0,1],color='red')
plt.scatter(x[y==1,0],x[y==1,1],color='blue')

现在尝试使用多项式。

def PolynomialLogisticRegression(degree):
    return Pipeline([
       ('poly',PolynomialFeatures(degree=degree)),
        ('std_scaler',StandardScaler()),
        ('logistic',LogisticRegression())     
    ])

plr = PolynomialLogisticRegression(degree=2)
plr.fit(x,y)
plot_decision_boundary(plr,[-4,4,-4,4])
plt.scatter(x[y==0,0],x[y==0,1],color='red')
plt.scatter(x[y==1,0],x[y==1,1],color='blue')

改变上面degree的值为20，再绘制出图像。可以看出开始出现过拟合现象。

现在引入参数C，并且将正则项由l2改为l1。

def PolynomialLogisticRegression(degree,C,penalty='l2'):#默认是l2
    return Pipeline([
       ('poly',PolynomialFeatures(degree=degree)),
        ('std_scaler',StandardScaler()),
        ('logistic',LogisticRegression(C=C,penalty=penalty))     
    ])

plr = PolynomialLogisticRegression(degree=20,C=0.1,penalty='l1')#传入c和penalty值
plr.fit(x,y)
plot_decision_boundary(plr,[-4,4,-4,4])
plt.scatter(x[y==0,0],x[y==0,1],color='red')
plt.scatter(x[y==1,0],x[y==1,1],color='blue')

三、多分类

1.OvR

如图所示，假设有四种分类，ovr（one vs rest）的思想是选取其中一个类别，其余的三类总体作为另外一个类别，使模型退化到二分类问题，并计算出概率值。如果有n个分类，就要进行n次分类，将最大值对应的分类作为最终的分类结果。

2.OvO

相比较ovr，ovo的思想是每次只拿出2个分类进行比较，计算出哪个概率比较大，概率大的一方就视为赢一局。全部比较之后，选择赢局最多的分类。这种方式相比较ovr，更耗时间，但是精度提高了。

3.代码实现

import numpy as np
import matplotlib.pyplot as plt  
from sklearn import datasets

iris = datasets.load_iris()
x = iris.data[:,:2] #依然先用两个特征，后面将会使用全部的特征
y = iris.target
from ml.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,seed=666)

from sklearn.linear_model import LogisticRegression #默认multi_class='ovr'
log_reg = LogisticRegression()
log_reg.fit(x_train,y_train)
plot_decision_boundary(log_reg,axis=[4,8.5,1.5,4.5])

plt.scatter(x[y==0,0],x[y==0,1])
plt.scatter(x[y==1,0],x[y==1,1])
plt.scatter(x[y==2,0],x[y==2,1])

接下来采用ovo方式。对比ovr可以发现分值有所提高。

log_reg2 = LogisticRegression(multi_class='multinomial',solver='newton-cg')#要使用ovo必须传入这两个参数
log_reg2.fit(x_train,y_train)
plot_decision_boundary(log_reg2,axis=[4,8.5,1.5,4.5])

plt.scatter(x[y==0,0],x[y==0,1])
plt.scatter(x[y==1,0],x[y==1,1])
plt.scatter(x[y==2,0],x[y==2,1])

4.ovo and ovr

在sklearn中封装了ovo和ovr，使得所有二分类问题，可以通过该对象进行多分类问题。

from sklearn.multiclass import OneVsRestClassifier
ovr = OneVsRestClassifier(log_reg)
ovr.fit(x_train,y_train)
ovr.score(x_test,y_test)

from sklearn.multiclass import OneVsOneClassifier
ovr = OneVsOneClassifier(log_reg2)
ovr.fit(x_train,y_train)
ovr.score(x_test,y_test)

闲人勿-

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
机器学习基础-7.逻辑回归

一、逻辑回归1.概念逻辑回归虽然名字上是回归，但是实际上解决的是分类问题。它将样本的特征和样本发生的概率联系起来，概率是个数，如果不进行最后的分类则可以视为回归。一般会设定一个阈值，如果该概率大于这个阈值就会分为一类，如果小于该阈值就会分到另外一类，所以说逻辑回归实际上解决的是分类问题。逻辑回归只能解决二分类问题，如果要解决多分类，则需要进行一定的改进。之前使用回归算法时，求得的y...
复制链接

扫一扫

专栏目录