二项逻辑斯蒂回归的Python实现

版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/qq_41080850/article/details/85793054

说明:下文所使用的训练数据集ex2data1.txt来自Andrew Ng的机器学习公开课,数据集中包含有学生两次测试的得分和学生的录取情况。

代码实现:

%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# 读入数据:
data = pd.read_csv('ex2data1.txt',names=['Exam1','Exam2','Admitted'])
data.head() # 查看data中的前五条数据

# 查看学生的录取情况与两次测试的分之间的关系:
fig,axes = plt.subplots()
sns.scatterplot(x='Exam1',y='Exam2',hue='Admitted',s=100,style='Admitted',data=data,ax=axes)
axes.set_title("Student's admission situation")
fig.savefig('Admitting.png')

# 定义sigmoid函数:
def sigmoid(z):
    return 1/(1 + np.exp(-z))

# 查看sigmoid函数的图像:
x = np.arange(-10,10,0.1)
fig, axes = plt.subplots()
axes.plot(x, sigmoid(x), 'r')
axes.set_title('sigmoid function')
fig.savefig('sigmoid.png')

# 数据预处理:

# 向data中插入值全为1的一列
data.insert(0,'Ones',1)

# 提取数据的特征X和标签y
X = data.iloc[:,:-1]   # X为data中不包括索引列的前三列
y = data.iloc[:,-1]    # y为data中的最后一列

# 将X和y转换成array形式
X  = X.values        # X是二维数组
y = y.values         # y是一维数组
theta = np.zeros(3)  # theta是一维数组


# 定义代价函数:
def cost(theta,X,y):
    return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))
# X @ theta表示二维数组与一维数组的点积,它等价于np.dot(X,theta)


# 定义求梯度的函数(向量化计算梯度):
def gradient(theta, X, y):
    return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)


# 通过调用scipy.optimize.optimize求解逻辑斯蒂回归模型中的theta参数:
import scipy.optimize as opt
res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient)
# theta参数存在res.x中,res.x的值为array([-25.1574502 , 0.20620065, 0.20144018])


# 定义预测函数:
def predict(theta, X):
    probability = sigmoid(X @ theta)
    return [1 if x >= 0.5 else 0 for x in probability]


# 计算预测准确率:
theta_min = res.x
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (a, b) in zip(predictions, y)]
accuracy = (sum(correct) % len(correct))
# print ('accuracy = {0}%'.format(accuracy))
# accuracy的值为89%


# 绘制决策边界:
# 决策边界的原始方程为theta_0*x_0 + theta+1*x_1 + theta_2*x_2 = 0
new_theta = -(res.x / res.x[2])
x = np.arange(100,step=0.1)
y = new_theta[0] + new_theta[1]*x  # 化简后的决策边界方程

fig,axes = plt.subplots()
sns.scatterplot(x='Exam1',y='Exam2',hue='Admitted',s=100,style='Admitted',data=data,ax=axes)
axes.set_title("Student's admission situation")

axes.plot(x,y,'black')
fig.savefig('Decision_boundary.png')

加入决策边界后的散点图如下所示:

参考:

Andrew Ng机器学习公开课

李航《统计学习方法》

https://chenrudan.github.io/blog/2016/01/09/logisticregression.html(强烈推荐)

https://blog.csdn.net/han_xiaoyang/article/details/49123419

https://blog.csdn.net/lilyth_lilyth/article/details/10032993

http://seaborn.pydata.org/generated/seaborn.scatterplot.html?highlight=seaborn%20scatterplot#seaborn.scatterplot

https://www.jianshu.com/p/482abac8798c

展开阅读全文

没有更多推荐了,返回首页