机器学习——逻辑回归
(一)逻辑回归原理推导
从二元的分类问题开始讨论,将因变量(dependent variable)可能属于的两个类分别称为负向类(negative class)和正向类(positive class),则因变量 y ∈ 0 , 1 y\in 0,1 y∈0,1,其中 0 表示负向类,1 表示正向类。
逻辑回归中选择对数几率函数(logistic function)作为激活函数,对数几率函数是Sigmoid函数(形状为S的函数)的重要代表。
g
(
z
)
=
1
1
+
e
−
z
g(z)=\frac{1}{1+e^{-z}}
g(z)=1+e−z1
自变量取值为任意实数,值域[0,1]
解释:将任意的输入映射到[0,1]区间,在线性回归中可以得到一个预测值,再将该值映射到Sigmoid函数中,这样就完成了由值到概率的转换,也就是分类任务。
预测函数:
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x)=g(\theta^{T}x)=\frac{1}{1+e^{-\theta ^{T}x}} hθ(x)=g(θTx)=1+e−θTx1
其中 θ 0 + θ 1 x 1 + . . . + θ n x n = ∑ i = 1 n θ i x i = θ T x \theta _{0}+\theta _{1}x_{1}+...+\theta _{n}x_{n}=\sum_{i=1}^{n}\theta _{i}x_{i}=\theta ^{T}x θ0+θ1x1+...+θnxn=∑i=1nθixi=θTx
在逻辑回归中,预测:
当
h
θ
(
x
)
h_{\theta}(x)
hθ(x) >= 0.5时,预测 ? = 1。
当
h
θ
(
x
)
h_{\theta}(x)
hθ(x) < 0.5时,预测 ? = 0 。
根据上面绘制出的 S 形函数图像,我们知道当 :
z
z
z = 0 时
g
(
z
)
g(z)
g(z) = 0.5
z
z
z> 0 时
g
(
z
)
g(z)
g(z)> 0.5
z
z
z< 0 时
g
(
z
)
g(z)
g(z) < 0.5
又
z
=
θ
T
x
z=\theta^{T}x
z=θTx ,即:
θ
T
x
\theta^{T}x
θTx >= 0 时,预测 ? = 1
θ
T
x
\theta^{T}x
θTx < 0 时,预测 ? = 0
判定边界(对不同类别的数据分割的边界,边界的两旁应该是不同类别的数据),现在假设我们有一个模型:
并且参数? 是向量[-3 1 1]。 则当
−
3
+
x
1
+
x
2
≥
0
-3+x_{1}+x_{2}\geq 0
−3+x1+x2≥0,即
x
1
+
x
2
≥
3
x_{1}+x_{2}\geq 3
x1+x2≥3时,模型将预测 ? =1。 我们可以绘制直线
x
1
+
x
2
≥
3
x_{1}+x_{2}\geq 3
x1+x2≥3,这条线便是我们模型的分界线,将预测为 1 的区域和预测为 0 的区域分隔开。
假使我们的数据呈现这样的分布情况:
因为需要用曲线才能分隔 ? = 0 的区域和 ? = 1 的区域,我们需要二次方特征:
h
θ
(
x
)
=
g
(
θ
0
+
θ
1
x
1
+
θ
2
x
2
+
θ
3
x
1
2
+
θ
4
x
2
2
)
h_{\theta}(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}+\theta _{3}x_{1}^{2}+\theta _{4}x_{2}^{2})
hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x22)是[-1 0 0 1 1],则我们得到的判定边界恰好是圆点在原点且半径为 1 的圆形。
理论上说,只要 h θ ( x ) h_{\theta}(x) hθ(x)设计足够合理,准确的说是 g ( θ T x ) g(\theta^{T}x) g(θTx)中 θ T x \theta^{T}x θTx足够复杂,我们能在不同的情形下,拟合出不同的判定边界,从而把不同的样本点分隔开来。
分类任务:
P ( y = 1 ∣ x ; θ ) = h θ ( x ) P(y=1|x;\theta)=h_{\theta}(x) P(y=1∣x;θ)=hθ(x)
P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) P(y=0|x;\theta)=1-h_{\theta}(x) P(y=0∣x;θ)=1−hθ(x)
整合:
P ( y ∣ x ; θ ) = h θ ( x ) y ( 1 − h θ ( x ) ) 1 − y P(y|x;\theta)=h_{\theta}(x)^{y}(1-h_{\theta}(x))^{1-y} P(y∣x;θ)=hθ(x)y(1−hθ(x))1−y
解释:对于二分类任务(0,1),整合后
- y取0只保留 ( 1 − h θ ( x ) ) 1 − y (1-h_{\theta}(x))^{1-y} (1−hθ(x))1−y
- y取1只保留 h θ ( x ) y h_{\theta}(x)^{y} hθ(x)y
似然函数:
L
(
θ
)
=
∏
i
=
1
m
P
(
y
i
∣
x
i
;
θ
)
=
∏
i
=
1
m
(
h
θ
(
x
i
)
y
i
)
(
1
−
h
θ
(
x
i
)
)
1
−
y
i
L(\theta)=\prod_{i=1}^{m}P(y_{i}|x_{i};\theta)=\prod_{i=1}^{m}(h_{\theta}(x_{i})^{y_{i}})(1-h_{\theta}(x_{i}))^{1-y_{i}}
L(θ)=∏i=1mP(yi∣xi;θ)=∏i=1m(hθ(xi)yi)(1−hθ(xi))1−yi
对数似然 :
l ( θ ) = l o g L ( θ ) = ∑ i = 1 m ( y i l o g h θ ( x i ) + ( 1 − y i ) l o g ( 1 − h θ ( x i ) ) ) l(\theta)=logL(\theta)=\sum_{i=1}^{m}(y_{i}logh_{\theta}(x_{i})+(1-y_{i})log(1-h_{\theta}(x_{i}))) l(θ)=logL(θ)=∑i=1m(yiloghθ(xi)+(1−yi)log(1−hθ(xi)))
此时应用梯度上升求最大值,引入 J ( θ ) = − 1 m l ( θ ) J(\theta)=-\frac{1}{m}l(\theta) J(θ)=−m1l(θ)转换为梯度下降任务。梯度下降算法是调整参数θ使得代价函数J(θ)取得最小值的最基本方法之一。
∂ J ( θ ) ∂ θ j = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \frac{\partial J(\theta)}{\partial\theta_{j}}=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} ∂θj∂J(θ)=m1∑i=1m(hθ(x(i))−y(i))xj(i)
参数更新:
θ
j
:
=
θ
j
−
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\theta_{j}:=\theta_{j}-\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}
θj:=θj−m1∑i=1m(hθ(x(i))−y(i))xj(i)
(二)逻辑回归代码推导
逻辑回归算法,实现对数据的预处理操作,梯度下降优化,损失与预测:
import numpy as np
from scipy.optimize import minimize
from utils.features import prepare_for_training
from utils.hypothesis import sigmoid
class LogisticRegression:
#初始化函数
def __init__(self,data,labels,polynomial_degree = 0,sinusoid_degree = 0,normalize_data=False):
"""
1.对数据进行预处理操作
2.先得到所有的特征个数
3.初始化参数矩阵
"""
(data_processed,
features_mean,
features_deviation) = prepare_for_training(data, polynomial_degree, sinusoid_degree,normalize_data=False)
self.data = data_processed
self.labels = labels
self.unique_labels = np.unique(labels) #不同的标签数量
self.features_mean = features_mean
self.features_deviation = features_deviation
self.polynomial_degree = polynomial_degree
self.sinusoid_degree = sinusoid_degree
self.normalize_data = normalize_data
num_features = self.data.shape[1]
num_unique_labels = np.unique(labels).shape[0]
self.theta = np.zeros((num_unique_labels,num_features)) #1.要做多少类别 2.特征数量
#训练函数
def train(self,max_iterations=1000):
cost_histories = [] #损失值
num_features = self.data.shape[1] #特征个数
#训练多个二分类模型
for label_index,unique_label in enumerate(self.unique_labels):
current_initial_theta = np.copy(self.theta[label_index].reshape(num_features,1))
current_lables = (self.labels == unique_label).astype(float) #判断标签是不是跟训练类别一致
(current_theta,cost_history) = LogisticRegression.gradient_descent(self.data,current_lables,current_initial_theta,max_iterations) #梯度下降
self.theta[label_index] = current_theta.T
cost_histories.append(cost_history)
return self.theta,cost_histories
@staticmethod
#梯度下降
def gradient_descent(data,labels,current_initial_theta,max_iterations):
cost_history = []
num_features = data.shape[1]
result = minimize(
#要优化的目标:
lambda current_theta:LogisticRegression.cost_function(data,labels,current_theta.reshape(num_features,1)),
#初始化的权重参数
current_initial_theta,
#选择优化策略
method = 'CG',
# 梯度下降迭代计算公式
jac = lambda current_theta:LogisticRegression.gradient_step(data,labels,current_theta.reshape(num_features,1)),
# 记录结果
callback = lambda current_theta:cost_history.append(LogisticRegression.cost_function(data,labels,current_theta.reshape((num_features,1)))),
# 迭代次数
options={'maxiter': max_iterations}
)
if not result.success:
raise ArithmeticError('Can not minimize cost function'+result.message)
optimized_theta = result.x.reshape(num_features,1)
return optimized_theta,cost_history
@staticmethod
#计算损失
def cost_function(data,labels,theat):
num_examples = data.shape[0] #总共数据量
predictions = LogisticRegression.hypothesis(data,theat)
y_is_set_cost = np.dot(labels[labels == 1].T,np.log(predictions[labels == 1])) #属于当前类别的总的损失
y_is_not_set_cost = np.dot(1-labels[labels == 0].T,np.log(1-predictions[labels == 0])) #不是当前类别的总的损失
cost = (-1/num_examples)*(y_is_set_cost+y_is_not_set_cost) #总损失
return cost
@staticmethod
#预测函数
def hypothesis(data,theat):
predictions = sigmoid(np.dot(data,theat))
return predictions
@staticmethod
def gradient_step(data,labels,theta):
num_examples = labels.shape[0]
predictions = LogisticRegression.hypothesis(data,theta)
label_diff = predictions- labels
gradients = (1/num_examples)*np.dot(data.T,label_diff)
return gradients.T.flatten()
def predict(self,data):
num_examples = data.shape[0]
data_processed = prepare_for_training(data, self.polynomial_degree, self.sinusoid_degree,self.normalize_data)[0] #数据预处理
prob = LogisticRegression.hypothesis(data_processed,self.theta.T) #得到预测完概率值
max_prob_index = np.argmax(prob, axis=1) #算出三个里面最大的概率值索引
class_prediction = np.empty(max_prob_index.shape,dtype=object)
for index,label in enumerate(self.unique_labels): #枚举每一个类别
class_prediction[max_prob_index == index] = label #所有类别实际的名字
return class_prediction.reshape((num_examples,1))
导入数据,画图展示:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from logistic_regression import LogisticRegression
data = pd.read_csv('../data/iris.csv')
iris_types = ['SETOSA','VERSICOLOR','VIRGINICA'] #三种不同类别的花
x_axis = 'petal_length'
y_axis = 'petal_width'
for iris_type in iris_types:
plt.scatter(data[x_axis][data['class']==iris_type],
data[y_axis][data['class']==iris_type],
label = iris_type
)
plt.show()
接下来训练多分类模型:
num_examples = data.shape[0]
x_train = data[[x_axis,y_axis]].values.reshape((num_examples,2))
y_train = data['class'].values.reshape((num_examples,1))
max_iterations = 1000 #最大迭代次数
polynomial_degree = 0
sinusoid_degree = 0
logistic_regression = LogisticRegression(x_train,y_train,polynomial_degree,sinusoid_degree)
thetas,cost_histories = logistic_regression.train(max_iterations) #开始训练
labels = logistic_regression.unique_labels
plt.plot(range(len(cost_histories[0])),cost_histories[0],label = labels[0])
plt.plot(range(len(cost_histories[1])),cost_histories[1],label = labels[1])
plt.plot(range(len(cost_histories[2])),cost_histories[2],label = labels[2])
plt.show()
不同的线代表梯度下降的结果,橙色,绿色,蓝色,分别代表做了三个二分类,在做三个二分类时,损失值如图变化。
绘制决策边界
#预测结果
y_train_prections = logistic_regression.predict(x_train)
#计算准确率
precision = np.sum(y_train_prections == y_train)/y_train.shape[0] * 100
print ('precision:',precision)
x_min = np.min(x_train[:,0])
x_max = np.max(x_train[:,0])
y_min = np.min(x_train[:,1])
y_max = np.max(x_train[:,1])
samples= 150 #样本个数
#取数据
X = np.linspace(x_min,x_max,samples)
Y = np.linspace(y_min,y_max,samples)
Z_SETOSA = np.zeros((samples,samples)) #初始化【150 150】
Z_VERSICOLOR = np.zeros((samples,samples))
Z_VIRGINICA = np.zeros((samples,samples))
for x_index,x in enumerate(X):
for y_index,y in enumerate(Y):
data = np.array([[x,y]])
prediction = logistic_regression.predict(data)[0][0]
if prediction == 'SETOSA':
Z_SETOSA[x_index][y_index] =1
elif prediction == 'VERSICOLOR':
Z_VERSICOLOR[x_index][y_index] =1
elif prediction == 'VIRGINICA':
Z_VIRGINICA[x_index][y_index] =1
#对于每一个类别分别展示
for iris_type in iris_types:
plt.scatter(
x_train[(y_train == iris_type).flatten(),0],
x_train[(y_train == iris_type).flatten(),1],
label = iris_type
)
plt.contour(X,Y,Z_SETOSA)
plt.contour(X,Y,Z_VERSICOLOR)
plt.contour(X,Y,Z_VIRGINICA)
plt.show()
非线性决策边界举例:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from logistic_regression import LogisticRegression
data = pd.read_csv('../data/microchips-tests.csv')
# 类别标签
validities = [0, 1]
# 选择两个特征
x_axis = 'param_1'
y_axis = 'param_2'
# 散点图
for validity in validities:
plt.scatter(
data[x_axis][data['validity'] == validity],
data[y_axis][data['validity'] == validity],
label=validity
)
plt.xlabel(x_axis)
plt.ylabel(y_axis)
plt.title('Microchips Tests')
plt.legend()
plt.show()
num_examples = data.shape[0]
x_train = data[[x_axis, y_axis]].values.reshape((num_examples, 2))
y_train = data['validity'].values.reshape((num_examples, 1))
# 训练参数
max_iterations = 100000
regularization_param = 0
polynomial_degree = 5
sinusoid_degree = 0
# 逻辑回归
logistic_regression = LogisticRegression(x_train, y_train, polynomial_degree, sinusoid_degree)
# 训练
(thetas, costs) = logistic_regression.train(max_iterations)
columns = []
for theta_index in range(0, thetas.shape[1]):
columns.append('Theta ' + str(theta_index));
# 训练结果
labels = logistic_regression.unique_labels
plt.plot(range(len(costs[0])), costs[0], label=labels[0])
plt.plot(range(len(costs[1])), costs[1], label=labels[1])
plt.xlabel('Gradient Steps')
plt.ylabel('Cost')
plt.legend()
plt.show()
# 预测
y_train_predictions = logistic_regression.predict(x_train)
# 准确率
precision = np.sum(y_train_predictions == y_train) / y_train.shape[0] * 100
print('Training Precision: {:5.4f}%'.format(precision))
num_examples = x_train.shape[0]
samples = 150
x_min = np.min(x_train[:, 0])
x_max = np.max(x_train[:, 0])
y_min = np.min(x_train[:, 1])
y_max = np.max(x_train[:, 1])
X = np.linspace(x_min, x_max, samples)
Y = np.linspace(y_min, y_max, samples)
Z = np.zeros((samples, samples))
# 结果展示
for x_index, x in enumerate(X):
for y_index, y in enumerate(Y):
data = np.array([[x, y]])
Z[x_index][y_index] = logistic_regression.predict(data)[0][0]
positives = (y_train == 1).flatten()
negatives = (y_train == 0).flatten()
plt.scatter(x_train[negatives, 0], x_train[negatives, 1], label='0')
plt.scatter(x_train[positives, 0], x_train[positives, 1], label='1')
plt.contour(X, Y, Z)
plt.xlabel('param_1')
plt.ylabel('param_2')
plt.title('Microchips Tests')
plt.legend()
plt.show()
可以看到逻辑回归不仅能解决线性问题,非线性问题也能解决。
(三)逻辑回归实验分析
在线性回归问题中,得到了具体的回归值,逻辑回归中借助sigmoid函数完成了数值映射,通过概率值比较来完成分类任务。
画出sigmoid函数
import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
t = np.linspace(-10, 10, 100)
sig = 1 / (1 + np.exp(-t))
plt.figure(figsize=(9, 3))
plt.plot([-10, 10], [0, 0], "k-")
plt.plot([-10, 10], [0.5, 0.5], "k:")
plt.plot([-10, 10], [1, 1], "k:")
plt.plot([0, 0], [-1.1, 1.1], "k-")
plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \frac{1}{1 + e^{-t}}$")
plt.xlabel("t")
plt.legend(loc="upper left", fontsize=20)
plt.axis([-10, 10, -0.1, 1.1])
plt.title('Logistic function')
plt.show()
from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys()) #查看中有哪些属性
[‘data’, ‘target’, ‘target_names’, ‘DESCR’, ‘feature_names’, ‘filename’]
X = iris['data'][:,3:]
y = (iris['target'] == 2).astype(np.int) #把该标签设置为1,其他标签设置为0
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
from sklearn.linear_model import LogisticRegression
log_res = LogisticRegression()
log_res.fit(X,y) #训练模型
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘warn’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘warn’,
tol=0.0001, verbose=0, warm_start=False)
#测试数据
X_new = np.linspace(0,3,1000).reshape(-1,1) #在0到3之间取1000个数,reshape成一维的,第一个-1,第二个1
y_proba = log_res.predict_proba(X_new) #用刚才创建的模型得到预测概率值
y_proba
array([[0.98554411, 0.01445589],
[0.98543168, 0.01456832],
[0.98531838, 0.01468162],
…,
[0.02618938, 0.97381062],
[0.02598963, 0.97401037],
[0.02579136, 0.97420864]])
左边表示属于当前类别,右边表示不属于当前类别 。随着输入特征数值的变化,结果概率值也会随之变化。
plt.figure(figsize=(12,5))
decision_boundary = X_new[y_proba[:,1]>=0.5][0]
plt.plot([decision_boundary,decision_boundary],[-1,2],'k:',linewidth = 2) #中间边界
plt.plot(X_new,y_proba[:,1],'g-',label = 'Iris-Virginica')
plt.plot(X_new,y_proba[:,0],'b--',label = 'Not Iris-Virginica')
plt.arrow(decision_boundary,0.08,-0.3,0,head_width = 0.05,head_length=0.1,fc='b',ec='b') #蓝色箭头
plt.arrow(decision_boundary,0.92,0.3,0,head_width = 0.05,head_length=0.1,fc='g',ec='g') #绿色箭头
plt.text(decision_boundary+0.02,0.15,'Decision Boundary',fontsize = 16,color = 'k',ha='center') #中间字体
plt.xlabel('Peta width(cm)',fontsize = 16)
plt.ylabel('y_proba',fontsize = 16)
plt.axis([0,3,-0.02,1.02])
plt.legend(loc = 'center left',fontsize = 16)
可以看到,随着花瓣宽度的增加,是Iris-Virginica花的可能性越来越大,不是Iris-Virginica花的可能性越来越小,当花瓣宽度在1.6的时候,很难确定究竟是什么种类。
X = iris['data'][:,(2,3)] #选择所有样本的第二维,第三维
y = (iris['target']==2).astype(np.int) #一个类别为1,其他类别为0
from sklearn.linear_model import LogisticRegression
log_res = LogisticRegression(C = 10000) #构建逻辑回归模型,C越大,正则化惩罚力度越小
log_res.fit(X,y) #训练
LogisticRegression(C=10000, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘warn’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘warn’,
tol=0.0001, verbose=0, warm_start=False)
X[:,0].min(),X[:,0].max() #第一个维度最小值,最大值
(1.0, 6.9)
X[:,1].min(),X[:,1].max()
(0.1, 2.5)
构建坐标数据:
x0,x1 = np.meshgrid(np.linspace(2.9,7,500).reshape(-1,1),np.linspace(0.8,2.7,200).reshape(-1,1))
整合坐标点,得到所有测试输入数据坐标点:
X_new = np.c_[x0.ravel(),x1.ravel()]
X_new #测试数据
array([[2.9 , 0.8 ],
[2.90821643, 0.8 ],
[2.91643287, 0.8 ],
…,
[6.98356713, 2.7 ],
[6.99178357, 2.7 ],
[7. , 2.7 ]])
X_new.shape
(100000, 2)
预测,得到所有点的概率值:
y_proba = log_res.predict_proba(X_new) #用训练好的模型得概率值
x0.shape
(200, 500)
x1.shape
(200, 500)
绘制等高线,完成决策边界:
plt.figure(figsize=(10,4))
plt.plot(X[y==0,0],X[y==0,1],'bs') #第一类点蓝色
plt.plot(X[y==1,0],X[y==1,1],'g^') #第二类点绿色
zz = y_proba[:,1].reshape(x0.shape) #概率值,1代表属于1这个类别
contour = plt.contour(x0,x1,zz,cmap=plt.cm.brg) #等高线值
plt.clabel(contour,inline = 1)
plt.axis([2.9,7,0.8,2.7])
plt.text(3.5,1.5,'NOT Vir',fontsize = 16,color = 'b')
plt.text(6.5,2.3,'Vir',fontsize = 16,color = 'g')
zz设置的是属于类别1(绿色三角)的概率,离绿色三角越近,等高线值越大,随着等高线往下走,属于绿色类别的概率越来越小。
多类别分类:
softmax计算概率:
P
k
^
=
σ
(
s
(
x
)
)
k
=
e
x
p
(
s
k
(
x
)
)
∑
j
=
1
k
e
x
p
(
s
j
(
x
)
)
\hat{P_{k}}=\sigma(s(x))_{k}=\frac{exp(s_{k}(x))}{\sum_{j=1}^{k}exp(s_{j}(x))}
Pk^=σ(s(x))k=∑j=1kexp(sj(x))exp(sk(x))
损失函数(交叉熵):
J
(
θ
)
=
−
1
m
∑
i
=
1
m
∑
k
=
1
k
y
k
(
i
)
l
o
g
(
P
k
(
i
)
^
)
J(\theta )=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{k}y_{k}^{(i)}log(\hat{P_{k}^{(i)}})
J(θ)=−m1i=1∑mk=1∑kyk(i)log(Pk(i)^)
X = iris['data'][:,(2,3)]
y = iris['target']
softmax_reg = LogisticRegression(multi_class = 'multinomial',solver='lbfgs') #逻辑回归多分类问题
softmax_reg.fit(X,y) #训练操作
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘multinomial’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘lbfgs’,
tol=0.0001, verbose=0, warm_start=False)
softmax_reg.predict([[5,2]])
array([2]) #属于第三个类别
softmax_reg.predict_proba([[5,2]])
array([[2.43559894e-04, 2.14859516e-01, 7.84896924e-01]]) #三个类别的概率值
x0, x1 = np.meshgrid(
np.linspace(0, 8, 500).reshape(-1, 1),
np.linspace(0, 3.5, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]
y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)
zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)
plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris-Virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris-Versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris-Setosa")
from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
plt.show()
比较三种不同停止策略对逻辑回归的影响:
STOP_ITER = 0 #根据迭代次数
STOP_COST = 1 #根据损失值目标函数变化
STOP_GRAD = 2 #根据梯度
def stopCriterion(type, value, threshold):
#设定三种不同的停止策略
if type == STOP_ITER: return value > threshold
elif type == STOP_COST: return abs(value[-1]-value[-2]) < threshold
elif type == STOP_GRAD: return np.linalg.norm(value) < threshold
#计算梯度
def gradient(X, y, theta):
grad = np.zeros(theta.shape)
error = (model(X, theta)- y).ravel()
for j in range(len(theta.ravel())): #for each parmeter
term = np.multiply(error, X[:,j])
grad[0, j] = np.sum(term) / len(X)
return grad
import numpy.random
#洗牌
def shuffleData(data):
np.random.shuffle(data)
cols = data.shape[1]
X = data[:, 0:cols-1]
y = data[:, cols-1:]
return X, y
import time
def descent(data, theta, batchSize, stopType, thresh, alpha):
#梯度下降求解
init_time = time.time()
i = 0 # 迭代次数
k = 0 # batch
X, y = shuffleData(data)
grad = np.zeros(theta.shape) # 计算的梯度
costs = [cost(X, y, theta)] # 损失值
while True:
grad = gradient(X[k:k+batchSize], y[k:k+batchSize], theta) #有三种不同的梯度下降方法,所以X[k:k+batchSize],是动态的,根据传入数据,决定是什么梯度下降
k += batchSize #取batch数量个数据
if k >= n:
k = 0
X, y = shuffleData(data) #重新洗牌
theta = theta - alpha*grad #参数更新
costs.append(cost(X, y, theta)) # 计算新的损失
i += 1
if stopType == STOP_ITER: value = i
elif stopType == STOP_COST: value = costs
elif stopType == STOP_GRAD: value = grad
if stopCriterion(stopType, value, thresh): break
return theta, i-1, costs, grad, time.time() - init_time
def runExpe(data, theta, batchSize, stopType, thresh, alpha):
#执行梯度下降得到最终结果
theta, iter, costs, grad, dur = descent(data, theta, batchSize, stopType, thresh, alpha)
#根据传入参数,显示是什么类型的梯度下降以及停止策略
name = "Original" if (data[:,1]>2).sum() > 1 else "Scaled"
name += " data - learning rate: {} - ".format(alpha)
if batchSize==n: strDescType = "Gradient"
elif batchSize==1: strDescType = "Stochastic"
else: strDescType = "Mini-batch ({})".format(batchSize)
name += strDescType + " descent - Stop: "
if stopType == STOP_ITER: strStop = "{} iterations".format(thresh)
elif stopType == STOP_COST: strStop = "costs change < {}".format(thresh)
else: strStop = "gradient norm < {}".format(thresh)
name += strStop
#画图展示结果
print ("***{}\nTheta: {} - Iter: {} - Last cost: {:03.2f} - Duration: {:03.2f}s".format(
name, theta, iter, costs[-1], dur))
fig, ax = plt.subplots(figsize=(12,4))
ax.plot(np.arange(len(costs)), costs, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title(name.upper() + ' - Error vs. Iteration')
return theta
准备工作做完,下面比较不同的停止策略。
1.根据迭代次数停止
#选择的梯度下降方法是基于所有样本的
n=100
runExpe(orig_data, theta, n, STOP_ITER, thresh=5000, alpha=0.000001) #迭代5000次,学习率0.000001
2.根据损失值停止
runExpe(orig_data, theta, n, STOP_COST, thresh=0.000001, alpha=0.001)
迭代次数差不多需要110000次,所以同样对于全部样本,所需时间也多一些,但是cost值从0.63下降到0.37左右,结果比根据迭代次数停止效果好。
3.根据梯度变化停止
runExpe(orig_data, theta, n, STOP_GRAD, thresh=0.05, alpha=0.001)
迭代40000次结果,显然没有迭代110000次效果好,可以看到最影响结果变化的就是迭代次数,通常希望迭代次数能够大一些。
对比不同的梯度下降方法
1.随机梯度下降
runExpe(orig_data, theta, 1, STOP_ITER, thresh=5000, alpha=0.001)
损失值忽上忽下,很不稳定,再来试试把学习率调小一些。
runExpe(orig_data, theta, 1, STOP_ITER, thresh=15000, alpha=0.000002)
现在接近收敛,但是结果也不太理想。
2.小批量梯度下降
runExpe(orig_data, theta, 16, STOP_ITER, thresh=15000, alpha=0.001)
浮动仍然比较大,我们来尝试下对数据进行标准化将数据按其属性(按列进行)减去其均值,然后除以其方差。最后得到的结果是,对每个属性/每列来说所有数据都聚集在0附近,方差值为1。
from sklearn import preprocessing as pp
scaled_data = orig_data.copy()
scaled_data[:, 1:3] = pp.scale(orig_data[:, 1:3])
runExpe(scaled_data, theta, 16, STOP_GRAD, thresh=0.002*2, alpha=0.001)
效果比较好,原来不收敛,现在收敛,cost也降到了0.2147左右, 所以对数据做预处理是非常重要的,当数据出现浮动,先在数据层面入手,看下对数据做一系列处理后,能不能使结果变好,先改数据再改模型。
3.批量梯度下降
runExpe(scaled_data, theta, n, STOP_GRAD, thresh=0.02, alpha=0.001)
批量梯度下降需要更多的迭代次数,才能达到小批量梯度下降差不多相同的cost。综合比较用小批量梯度下降比较合适。
案例:交易数据异常检测
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
data = pd.read_csv("creditcard.csv") #导入信用卡数据
data.head() #显示前五行数据
v1,…v28代表28个特征,里面有一个class列,认为class等于0的为没有发生问题的,class等于1的为异常,因为这是一个银行信用卡数据,所以大部分应该都是正常的,通常可以把这份数据分为两部分,一部分是x,也就是输入的特征数,另一部分当做y,也可以是这份数据的label数据,那么这个class就是这个y,0是正常的,1是异常的。接下来对于这些特征用逻辑回归建模。
count_classes = pd.value_counts(data['Class'], sort = True).sort_index() #算class这一列有几个不同的属性
count_classes.plot(kind = 'bar') #画条形图
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
可以看到样本分布规则,正常样本非常多,异常样本很少。对于这种样本极度不规则的现象,有两种方案,过采样(对1号样本进行样本生成策略,生成20多万个数据跟0号样本一样多)和下采样(让0和1的数据一样少,在0里面选择和1一样少的数据再重新组合到一起)。下面采用下采样方案。
同时观察到Amount列数据浮动比较大,但是在做机器学习建模的时候,要保证特征之间分布差异是差不多的,机器学习算法可能会有误区,一些数值比较大的数据重要程度比较大,所以为了保证每个特征的重要程度是一样的,需要对Amount数据归一化或者标准化。
from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1)) #转换Amount为-1到1附近
data = data.drop(['Time','Amount'],axis=1) 删除这两列
data.head()
特征数据已经处理完毕。下面采取下采样:
X = data.ix[:, data.columns != 'Class'] #特征数据,把所有样本都取进来,不包括‘class’列
y = data.ix[:, data.columns == 'Class'] #‘class列的所有数据’
number_records_fraud = len(data[data.Class == 1]) #Class == 1的样本个数
fraud_indices = np.array(data[data.Class == 1].index) #得到Class == 1.的index拿出来
normal_indices = data[data.Class == 0].index
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) #在0样本中随机选择number_records_fraud个样本,使两种样本一样少
random_normal_indices = np.array(random_normal_indices)
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
under_sample_data = data.iloc[under_sample_indices,:]
X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))
下采样后一共有984个数据,一半正样本,一半负样本。把样本不均衡的数据变成了均衡的数据。下面交叉验证(线性回归已经讲过原理)。
from sklearn.cross_validation import train_test_split
# 原始数据集切分,(测试需要用)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))
# 下采样数据集切分
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
,y_undersample ,test_size = 0.3 ,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))
#Recall = TP/(TP+FN)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report
def printing_Kfold_scores(x_train_data,y_train_data):
fold = KFold(len(y_train_data),5,shuffle=False) #交叉验证,把原始数据切分成五份
# Different C parameters
c_param_range = [0.01,0.1,1,10,100]
results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range
# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
j = 0
for c_param in c_param_range:
print('-------------------------------------------')
print('C parameter: ', c_param)
print('-------------------------------------------')
print('')
recall_accs = []
for iteration, indices in enumerate(fold,start=1):
# 建立逻辑回归模型,c_param是惩罚力度
lr = LogisticRegression(C = c_param, penalty = 'l1')
#用建立的lr模型.fit训练数据
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
# 使用训练数据中的测试指标预测值
y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
recall_accs.append(recall_acc)
print('Iteration ', iteration,': recall score = ', recall_acc)
# 得到平均recall_accs
results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('Mean recall score ', np.mean(recall_accs))
print('')
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
print('*********************************************************************************')
return best_c
best_c = printing_Kfold_scores(X_train,y_train)
对于每个参数,都有对应的五次交叉验证结果以及平均值,每次交叉验证的结果值都不一样,可能会相差10个百分点,第一个参数的平均recall值最高,所以把0.01当做默认参数。
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
绘制混淆矩阵
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)
# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# 绘制混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
上图为下采样数据下的混淆矩阵图。可以观察到真实是0的样本预测也是0有129个,真实是1预测也是1的样本有137个。下面还要在原始数据上测试。
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)
# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# 绘制混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
Recall metric in the testing dataset: 0.918367346939
上图为在原始数据的混淆矩阵,有8581个样本来没有异常却被检测为异常,对于下采样的问题就出来了,虽然recall值能达到符合样本的趋势,但是误杀比较严重。
下面试试直接用原始数据集,不用上采样也不用下采样。
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
Recall metric in the testing dataset: 0.619047619048
可以看到recall值并不如下采样效果好,但是并没有那么多误杀了。
下面人为控制模型评估标准,设置不同阈值:
lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
plt.figure(figsize=(10,10))
j = 1
for i in thresholds:
y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
plt.subplot(3,3,j)
j += 1
# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
class_names = [0,1]
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Threshold >= %s'%i)
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.986394557823
Recall metric in the testing dataset: 0.931972789116
Recall metric in the testing dataset: 0.884353741497
Recall metric in the testing dataset: 0.836734693878
Recall metric in the testing dataset: 0.748299319728
Recall metric in the testing dataset: 0.571428571429
观察到,随着阈值增加,recall值不断下降。当thresholds等于0.1时,所有的异常样本都检测出来了,但是所有的样本也都预测成了异常样本,虽然recall值很高,但是精度很低,所以评估不能从单方面评估。当thresholds大于等于0.4开始发生了变化,误杀减小了,检测不到的异常点也越来越多,当thresholds等于0.9,要求太严格了,没有误杀,但是有63个异常点没检测出来,所以可以根据实际需求选择thresholds值。(过采样方案就~~此处省略1000字,自己想象^^,过程类似)
(四)逻辑回归作业
1.建立一个逻辑回归模型来预测一个学生是否能进入大学。假设你是一所大学的行政管理人员,你想根据两门考试的结果,来决定每个申请人是否被录取。你有以前申请人的历史数据,可以将其用作逻辑回归训练集。对于每一个训练样本,你有申请人两次测评的分数以及录取的结果。
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('ex2data1.txt', names=['exam1', 'exam2', 'admitted'])
data.head() #默认展示前五个
data.describe() #基本统计量
创建两个分数的散点图:
positive = data[data.admitted.isin(['1'])] # 1
negetive = data[data.admitted.isin(['0'])] # 0
fig, ax = plt.subplots(figsize=(6,5))
ax.scatter(positive['exam1'], positive['exam2'], c='b', label='Admitted')
ax.scatter(negetive['exam1'], negetive['exam2'], s=50, c='r', marker='x', label='Not Admitted')
# 设置图例显示在图的上方
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width , box.height* 0.8])
ax.legend(loc='center left', bbox_to_anchor=(0.2, 1.12),ncol=3)
# 设置横纵坐标名
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
plt.show()
def sigmoid(z):
return 1 / (1 + np.exp(- z))
x1 = np.arange(-10, 10, 0.1)
plt.plot(x1, sigmoid(x1), c='r')
plt.show() #画出 logistic回归的假设函数
def cost(theta, X, y):
first = (-y) * np.log(sigmoid(X @ theta))
second = (1 - y)*np.log(1 - sigmoid(X @ theta))
return np.mean(first - second)
# 添加一列,方便矩阵乘法计算
if 'Ones' not in data.columns:
data.insert(0, 'Ones', 1)
# 获取训练集数据
X = data.iloc[:, :-1].as_matrix() # Convert the frame to its Numpy-array representation.
y = data.iloc[:, -1].as_matrix() # Return is NOT a Numpy-matrix, rather, a Numpy-array.
theta = np.zeros(X.shape[1])
X.shape #检查矩阵的维度
(100, 3)
theta.shape
(3,)
y.shape
(100,)
cost(theta, X, y)
0.6931471805599453
def gradient(theta, X, y):
return (X.T @ (sigmoid(X @ theta) - y))/len(X)
gradient(theta, X, y) #批量梯度下降
array([ -0.1 , -12.00921659, -11.26284221])
import scipy.optimize as opt #行速度通常远远超过梯度下降
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
result
(array([-25.16131853, 0.20623159, 0.20147149]), 36, 0)
res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='TNC', jac=gradient)
res
fun: 0.203497701589475
jac: array([9.14875519e-09, 9.99037356e-08, 4.79345707e-07])
message: ‘Local minimum reached (|pg| ~= 0)’
nfev: 36
nit: 17
status: 0
success: True
x: array([-25.16131853, 0.20623159, 0.20147149])
cost(result[0], X, y)
0.203497701589475
def predict(theta, X):
probability = sigmoid(X@theta)
return [1 if x >= 0.5 else 0 for x in probability] # return a list
#学习好了参数θ后,用这个模型预测某个学生是否能被录取。
final_theta = result[0]
predictions = predict(final_theta, X)
correct = [1 if a==b else 0 for (a, b) in zip(predictions, y)]
accuracy = sum(correct) / len(X)
accuracy
0.89 #预测精度达到了89%
#用skearn中的方法来检验
from sklearn.metrics import classification_report
print(classification_report(predictions, y))
precision recall f1-score support
0 0.85 0.87 0.86 39
1 0.92 0.90 0.91 61
avg / total 0.89 0.89 0.89 100
x1 = np.arange(130, step=0.1)
x2 = -(final_theta[0] + x1*final_theta[1]) / final_theta[2]
#绘制决策边界
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(positive['exam1'], positive['exam2'], c='b', label='Admitted')
ax.scatter(negetive['exam1'], negetive['exam2'], s=50, c='r', marker='x', label='Not Admitted')
ax.plot(x1, x2)
ax.set_xlim(0, 130)
ax.set_ylim(0, 130)
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_title('Decision Boundary')
plt.show()
2.设想你是工厂的生产主管,你有一些芯片在两次测试中的测试结果。对于这两次测试,你想决定是否芯片要被接受或抛弃。
data2 = pd.read_csv('ex2data2.txt', names=['Test 1', 'Test 2', 'Accepted'])
data2.head()
def plot_data():
positive = data2[data2['Accepted'].isin([1])]
negative = data2[data2['Accepted'].isin([0])]
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(positive['Test 1'], positive['Test 2'], s=50, c='b', marker='o', label='Accepted')
ax.scatter(negative['Test 1'], negative['Test 2'], s=50, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
plot_data()
def feature_mapping(x1, x2, power):
data = {}
for i in np.arange(power + 1):
for p in np.arange(i + 1):
data["f{}{}".format(i - p, p)] = np.power(x1, i - p) * np.power(x2, p)
# data = {"f{}{}".format(i - p, p): np.power(x1, i - p) * np.power(x2, p)
# for i in np.arange(power + 1)
# for p in np.arange(i + 1)
# }
return pd.DataFrame(data)
x1 = data2['Test 1'].as_matrix()
x2 = data2['Test 2'].as_matrix()
_data2 = feature_mapping(x1, x2, power=6)
_data2.head()
在这个高维特征向量上训练的logistic回归分类器会有一个更复杂的决策边界,所以在二维图中绘制时,会出现非线性。
虽然特征映射允许我们构建一个更有表现力的分类器,但它也更容易过拟合。接下来实现正则化的logistic回归来拟合数据,并且可以看到正则化如何帮助解决过拟合的问题。
# 这里因为做特征映射的时候已经添加了偏置项,所以不用手动添加了。
X = _data2.as_matrix()
y = data2['Accepted'].as_matrix()
theta = np.zeros(X.shape[1])
X.shape, y.shape, theta.shape
((118, 28), (118,), (28,))
def costReg(theta, X, y, l=1):
# 不惩罚第一项
_theta = theta[1: ]
reg = (l / (2 * len(X))) *(_theta @ _theta) # _theta@_theta == inner product
return cost(theta, X, y) + reg
costReg(theta, X, y, l=1)
0.6931471805599454
def gradientReg(theta, X, y, l=1):
reg = (1 / len(X)) * theta
reg[0] = 0
return gradient(theta, X, y) + reg
gradientReg(theta, X, y, 1)
array([8.47457627e-03, 1.87880932e-02, 7.77711864e-05, 5.03446395e-02,
1.15013308e-02, 3.76648474e-02, 1.83559872e-02, 7.32393391e-03,
8.19244468e-03, 2.34764889e-02, 3.93486234e-02, 2.23923907e-03,
1.28600503e-02, 3.09593720e-03, 3.93028171e-02, 1.99707467e-02,
4.32983232e-03, 3.38643902e-03, 5.83822078e-03, 4.47629067e-03,
3.10079849e-02, 3.10312442e-02, 1.09740238e-03, 6.31570797e-03,
4.08503006e-04, 7.26504316e-03, 1.37646175e-03, 3.87936363e-02])
result2 = opt.fmin_tnc(func=costReg, x0=theta, fprime=gradientReg, args=(X, y, 2))
result2
(array([ 1.02252845, 0.56283947, 1.13465378, -1.78530023, -0.66539149,
-1.01863396, 0.13957 , -0.29358944, -0.30102267, -0.08324513,
-1.27206004, -0.06137383, -0.53996526, -0.17881809, -0.9419886 ,
-0.14054855, -0.17736663, -0.0769737 , -0.22918951, -0.21349652,
-0.37205413, -0.86417673, 0.00890083, -0.26795962, -0.00362248,
-0.28315235, -0.07321588, -0.75992674]), 57, 4)
from sklearn import linear_model#调用sklearn的线性回归包
model = linear_model.LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y.ravel())
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘warn’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘warn’,
tol=0.0001, verbose=0, warm_start=False)
model.score(X, y)
0.8305084745762712
评估:
final_theta = result2[0]
predictions = predict(final_theta, X)
correct = [1 if a==b else 0 for (a, b) in zip(predictions, y)]
accuracy = sum(correct) / len(correct)
accuracy
0.8050847457627118
print(classification_report(y, predictions))
precision recall f1-score support
0 0.85 0.75 0.80 60
1 0.77 0.86 0.81 58
micro avg 0.81 0.81 0.81 118
macro avg 0.81 0.81 0.80
weighted avg 0.81 0.81 0.80 118
绘制决策边界:
x = np.linspace(-1, 1.5, 250)
xx, yy = np.meshgrid(x, x)
z = feature_mapping(xx.ravel(), yy.ravel(), 6).as_matrix()
z = z @ final_theta
z = z.reshape(xx.shape)
plot_data()
plt.contour(xx, yy, z, 0)
plt.ylim(-.8, 1.2)