python对逻辑回归进行显著性_python实现逻辑回归

最新推荐文章于 2023-11-14 10:47:48 发布

weixin_39992665

最新推荐文章于 2023-11-14 10:47:48 发布

阅读量667

点赞数

文章标签： python对逻辑回归进行显著性

逻辑回归是一种常见的分类算法，通常用来处理二分类问题。逻辑回归名字中带有回归两个字，却用于处理分类问题，是因为逻辑回归利用了回归的思想，先利用回归思想计算出一个预测值，再将该预测值转化为分为某一类别的概率。利用回归方法算出来的预测值的值域是从负无穷到正无穷的，而概率P的值域是从0到1，那么如何将预测值转化为一个概率值呢，这里就要利用到sigmoid函数了，该函数的表达式为：

c53509ed9b52

image.png

先来简单看一下该函数的图像

import numpy as np

import matplotlib.pyplot as plt

def sigmoid(t):

return 1. / (1. + np.exp(-t))

x_plot = np.linspace(-10., 10., 5000)

y_plot = sigmoid(x_plot)

plt.plot(x_plot, y_plot)

plt.show()

c53509ed9b52

image.png

从图中可以看出，sigmoid函数将（-10， 10）中的元素映射到了区间（0， 1）中，实际上，sigmoid函数能将（−∞，+∞）的元素都映射到（0， 1）中。因此，想要利用逻辑回归解决分类问题，我们只需要先利用回归思想获得一个处于（−∞，+∞）的预测值，再利用sigmoid函数转化为处于（0， 1）的概率，再根据概率值对样本进行分类。在处理二分类问题时，我们认为当sigmoid(t)的值大于0.5时，将该样本分类为1的概率大于0.5，因此将改样本分类为1，当sigmoid(t)的值小于0.5时，将该样本分类为0。sigmoid(t)可以看做是样本 t 分类为1的概率。

利用sigmoid函数我们就将逻辑回归的分类问题转化为了回归问题的求解。依然是利用梯度下降法对损失函数求极小值，数学推导过程在这里略去。直接上结论：

J(θ) = -(y*(ln(σ(Xθ))) + (1-y)(1-ln(σ(Xθ)))) / m

其中的 σ 代表sigmoid函数,m代表样本数量

▽J(θ) = (X.T(σ(Xθ) - y)) / m

其中X.T代表X矩阵的转置

得到了损失函数J(θ)以及损失函数对θ的偏导数之后，就可以用梯度下降法来实现逻辑回归了

使用的示例数据是sklearn中的鸢尾花数据集

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn import datasets

# 鸢尾花数据集有三种花，因此舍弃分类为2的花，用来进行二分类

iris = datasets.load_iris()

x = iris.data

y = iris.target

X = x[y<2, :]

y = y[y < 2]

# 进行训练验证数据集分割

X = np.hstack((np.ones((X.shape[0],1)), X))

X_train, X_test, y_train, y_test = train_test_split(X, y)

initial_theta = np.zeros(X.shape[1])

定义sigmoid函数

def sigmoid(t):

return 1. / (1.+np.exp(-t))

定义损失函数:

def j(X, y, theta):

return -(y.dot(np.log(sigmoid(X.dot(theta)))) + (1-y).dot(1-np.log(sigmoid(X.dot(theta))))) / len(y)

定义损失函数的梯度:

def dj(X, y, theta):

return (X.T.dot(sigmoid(X.dot(theta)) - y)) / len(y)

梯度下降过程:

def gd(X, y, theta=initial_theta, eta=0.1, n_iters=1e4, epsilon=1e-8):

cur_iters = 0

theta = initial_theta

while cur_iters < n_iters:

next_theta = theta - eta*dj(X, y, theta)

if abs(j(X, y, theta) - j(X, y, next_theta)) < epsilon:

break

else:

theta = next_theta

cur_iters += 1

return theta

通过梯度下降过程，我们就能找到一个最优的theta向量

best_theta = gd(X_train, y_train)

看看模型在训练数据集以及验证数据集中的准确度,首先定义一个预测函数

def predict(X, best_theta=best_theta):

temp = X.dot(best_theta)

y_predict = np.array(temp>0, dtype=int)

return y_predict

y_train_predict = predict(X_train, best_theta)

train_score = np.sum(y_train_predict==y_train)/len(y_train)

y_test_predict = predict(X_test, best_theta)

test_score = np.sum(y_test_predict==y_test)/len(y_test)

print('在训练数据集的准确度为{}, 在验证数据集的准确度为{}'.format(train_score, test_score))

结果如下

在训练数据集的准确度为1.0, 在验证数据集的准确度为1.0

可以看到，通过对75个样本的训练，我们对剩下25个样本的预测准确率达到了100%，其实在实际应用中一般不会有这么高的准确率，因为鸢尾花数据集比较好区分而且有四个特征所以才能达到100%的准确率。但是通过这个例子也可以很好的学习到逻辑回归的思路。

weixin_39992665

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫