逻辑回归
问题描述:特征集为学生的两门课的成绩,标签集为是否被大学录取。
说明:
- 这里调用 scipy 库函数执行梯度下降的具体迭代,不用手动设置步长和迭代次数,但 cost 如何计算、梯度如何求取需要以函数形式传递给 scipy;
- numpy 对 array 执行矩阵运算时,对数据格式比较严格,程序中调用了好几次 shape() 用于将形如 (100, ) 的数据格式转化为 (100, 1),这样执行矩阵乘法才能得到正确结果。
首先导入并可视化数据:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.optimize as opt
def loadData(filename):
return pd.read_csv(filename, header = None, names = ['Exam 1', 'Exam 2', 'Admitted'])
def showData(data):
positive = data[data['Admitted'].isin([1])]
negative = data[data['Admitted'].isin([0])]
fig, ax = plt.subplots(figsize = (12, 8))
ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o', label='Admitted')
ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x', label='Not Admitted')
ax.legend()
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
plt.show()
data = loadData('ex2data1.txt')
print(data.head())
print(data.describe())
showData(data)
Exam 1 Exam 2 Admitted
0 34.623660 78.024693 0
1 30.286711 43.894998 0
2 35.847409 72.902198 0
3 60.182599 86.308552 1
4 79.032736 75.344376 1
Exam 1 Exam 2 Admitted
count 100.000000 100.000000 100.000000
mean 65.644274 66.221998 0.600000
std 19.458222 18.582783 0.492366
min 30.058822 30.603263 0.000000
25% 50.919511 48.179205 0.000000
50% 67.032988 67.682381 1.000000
75% 80.212529 79.360605 1.000000
max 99.827858 98.869436 1.000000
Process finished with exit code 0
接着对数据预处理(这里将 X,y,theta 都转化为 numpy 的 array 格式):
def initData(data):
data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:, 0: cols - 1]
y = data.iloc[:, cols - 1: cols]
X = np.array(X.values)
y = np.array(y.values)
theta = np.zeros(3)
return X, y, theta
data = loadData('ex2data1.txt')
X, y, theta = initData(data)
print(X.shape, theta.shape, y.shape)
(100, 3) (3,) (100, 1)
Process finished with exit code 0
根据如下公式计算 cost:
# 辅助函数:计算 sigmoid
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# 计算 cost
def cost(theta, X, y):
# theta 参数个数
n = len(theta)
first = -y * np.log(sigmoid(X @ theta.reshape(n, 1)))
second = -(1 - y) * np.log(1 - sigmoid(X @ theta.reshape(n, 1)))
return np.sum(first + second) / (len(X))
根据如下公式进行梯度下降:
# 计算单次梯度下降
def gradient(theta, X, y):
# theta 参数个数
n = len(theta)
# 样本数
m = len(y)
grad = np.zeros(n)
error = sigmoid(X @ theta.reshape(n, 1)) - y
for i in range(n):
term = error * X[:, i].reshape(m, 1)
grad[i] = np.sum(term) / m
return grad
调用 scipy 使用 TNC 寻找最优参数:
data = loadData('ex2data1.txt')
X, y, theta = initData(data)
# 利用 SciPy 的 truncated newton(TNC) 寻找最优参数
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
print(result)