机器学习——逻辑回归

(一)逻辑回归原理推导

从二元的分类问题开始讨论,将因变量(dependent variable)可能属于的两个类分别称为负向类(negative class)和正向类(positive class),则因变量 y ∈ 0 , 1 y\in 0,1 y0,1,其中 0 表示负向类,1 表示正向类。

逻辑回归中选择对数几率函数(logistic function)作为激活函数,对数几率函数是Sigmoid函数(形状为S的函数)的重要代表。

g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+ez1

自变量取值为任意实数,值域[0,1]

解释:将任意的输入映射到[0,1]区间,在线性回归中可以得到一个预测值,再将该值映射到Sigmoid函数中,这样就完成了由值到概率的转换,也就是分类任务。

预测函数

h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x)=g(\theta^{T}x)=\frac{1}{1+e^{-\theta ^{T}x}} hθ(x)=g(θTx)=1+eθTx1

其中 θ 0 + θ 1 x 1 + . . . + θ n x n = ∑ i = 1 n θ i x i = θ T x \theta _{0}+\theta _{1}x_{1}+...+\theta _{n}x_{n}=\sum_{i=1}^{n}\theta _{i}x_{i}=\theta ^{T}x θ0+θ1x1+...+θnxn=i=1nθixi=θTx

在逻辑回归中,预测

h θ ( x ) h_{\theta}(x) hθ(x) >= 0.5时,预测 ? = 1。
h θ ( x ) h_{\theta}(x) hθ(x) < 0.5时,预测 ? = 0 。

根据上面绘制出的 S 形函数图像,我们知道当 :
z z z = 0 时 g ( z ) g(z) g(z) = 0.5
z z z> 0 时 g ( z ) g(z) g(z)> 0.5
z z z< 0 时 g ( z ) g(z) g(z) < 0.5

z = θ T x z=\theta^{T}x z=θTx ,即:
θ T x \theta^{T}x θTx >= 0 时,预测 ? = 1
θ T x \theta^{T}x θTx < 0 时,预测 ? = 0

判定边界(对不同类别的数据分割的边界,边界的两旁应该是不同类别的数据),现在假设我们有一个模型:

并且参数? 是向量[-3 1 1]。 则当 − 3 + x 1 + x 2 ≥ 0 -3+x_{1}+x_{2}\geq 0 3+x1+x20,即 x 1 + x 2 ≥ 3 x_{1}+x_{2}\geq 3 x1+x23时,模型将预测 ? =1。 我们可以绘制直线 x 1 + x 2 ≥ 3 x_{1}+x_{2}\geq 3 x1+x23,这条线便是我们模型的分界线,将预测为 1 的区域和预测为 0 的区域分隔开。
在这里插入图片描述

假使我们的数据呈现这样的分布情况:
在这里插入图片描述
因为需要用曲线才能分隔 ? = 0 的区域和 ? = 1 的区域,我们需要二次方特征: h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 + θ 4 x 2 2 ) h_{\theta}(x)=g(\theta _{0}+\theta _{1}x_{1}+\theta _{2}x_{2}+\theta _{3}x_{1}^{2}+\theta _{4}x_{2}^{2}) hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x22)是[-1 0 0 1 1],则我们得到的判定边界恰好是圆点在原点且半径为 1 的圆形。

理论上说,只要 h θ ( x ) h_{\theta}(x) hθ(x)设计足够合理,准确的说是 g ( θ T x ) g(\theta^{T}x) g(θTx) θ T x \theta^{T}x θTx足够复杂,我们能在不同的情形下,拟合出不同的判定边界,从而把不同的样本点分隔开来。

分类任务

P ( y = 1 ∣ x ; θ ) = h θ ( x ) P(y=1|x;\theta)=h_{\theta}(x) P(y=1x;θ)=hθ(x)

P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) P(y=0|x;\theta)=1-h_{\theta}(x) P(y=0x;θ)=1hθ(x)

整合:

P ( y ∣ x ; θ ) = h θ ( x ) y ( 1 − h θ ( x ) ) 1 − y P(y|x;\theta)=h_{\theta}(x)^{y}(1-h_{\theta}(x))^{1-y} P(yx;θ)=hθ(x)y(1hθ(x))1y

解释:对于二分类任务(0,1),整合后

  1. y取0只保留 ( 1 − h θ ( x ) ) 1 − y (1-h_{\theta}(x))^{1-y} (1hθ(x))1y
  2. y取1只保留 h θ ( x ) y h_{\theta}(x)^{y} hθ(x)y

似然函数:
L ( θ ) = ∏ i = 1 m P ( y i ∣ x i ; θ ) = ∏ i = 1 m ( h θ ( x i ) y i ) ( 1 − h θ ( x i ) ) 1 − y i L(\theta)=\prod_{i=1}^{m}P(y_{i}|x_{i};\theta)=\prod_{i=1}^{m}(h_{\theta}(x_{i})^{y_{i}})(1-h_{\theta}(x_{i}))^{1-y_{i}} L(θ)=i=1mP(yixi;θ)=i=1m(hθ(xi)yi)(1hθ(xi))1yi

对数似然 :

l ( θ ) = l o g L ( θ ) = ∑ i = 1 m ( y i l o g h θ ( x i ) + ( 1 − y i ) l o g ( 1 − h θ ( x i ) ) ) l(\theta)=logL(\theta)=\sum_{i=1}^{m}(y_{i}logh_{\theta}(x_{i})+(1-y_{i})log(1-h_{\theta}(x_{i}))) l(θ)=logL(θ)=i=1m(yiloghθ(xi)+(1yi)log(1hθ(xi)))

此时应用梯度上升求最大值,引入 J ( θ ) = − 1 m l ( θ ) J(\theta)=-\frac{1}{m}l(\theta) J(θ)=m1l(θ)转换为梯度下降任务。梯度下降算法是调整参数θ使得代价函数J(θ)取得最小值的最基本方法之一。

∂ J ( θ ) ∂ θ j = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \frac{\partial J(\theta)}{\partial\theta_{j}}=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} θjJ(θ)=m1i=1m(hθ(x(i))y(i))xj(i)

参数更新:
θ j : = θ j − 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_{j}:=\theta_{j}-\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} θj:=θjm1i=1m(hθ(x(i))y(i))xj(i)

(二)逻辑回归代码推导

逻辑回归算法,实现对数据的预处理操作,梯度下降优化,损失与预测:

import numpy as np
from scipy.optimize import minimize
from utils.features import prepare_for_training
from utils.hypothesis import sigmoid

class LogisticRegression:
    #初始化函数
    def __init__(self,data,labels,polynomial_degree = 0,sinusoid_degree = 0,normalize_data=False):
        """
        1.对数据进行预处理操作
        2.先得到所有的特征个数
        3.初始化参数矩阵
        """
        (data_processed,
         features_mean, 
         features_deviation)  = prepare_for_training(data, polynomial_degree, sinusoid_degree,normalize_data=False)
         
        self.data = data_processed
        self.labels = labels
        self.unique_labels = np.unique(labels)    #不同的标签数量
        self.features_mean = features_mean
        self.features_deviation = features_deviation
        self.polynomial_degree = polynomial_degree
        self.sinusoid_degree = sinusoid_degree
        self.normalize_data = normalize_data
        
        num_features = self.data.shape[1]
        num_unique_labels = np.unique(labels).shape[0]
        self.theta = np.zeros((num_unique_labels,num_features)) #1.要做多少类别  2.特征数量

    #训练函数
    def train(self,max_iterations=1000):
        cost_histories = []   #损失值
        num_features = self.data.shape[1]   #特征个数
        #训练多个二分类模型
        for label_index,unique_label in enumerate(self.unique_labels):
            current_initial_theta = np.copy(self.theta[label_index].reshape(num_features,1))
            current_lables = (self.labels == unique_label).astype(float)   #判断标签是不是跟训练类别一致
            (current_theta,cost_history) = LogisticRegression.gradient_descent(self.data,current_lables,current_initial_theta,max_iterations)  #梯度下降
            self.theta[label_index] = current_theta.T
            cost_histories.append(cost_history)
            
        return self.theta,cost_histories
            
    @staticmethod
    #梯度下降
    def gradient_descent(data,labels,current_initial_theta,max_iterations):
        cost_history = []
        num_features = data.shape[1]
        result = minimize(
            #要优化的目标:
            lambda current_theta:LogisticRegression.cost_function(data,labels,current_theta.reshape(num_features,1)),
            #初始化的权重参数
            current_initial_theta,
            #选择优化策略
            method = 'CG',
            # 梯度下降迭代计算公式
            jac = lambda current_theta:LogisticRegression.gradient_step(data,labels,current_theta.reshape(num_features,1)),
            # 记录结果
            callback = lambda current_theta:cost_history.append(LogisticRegression.cost_function(data,labels,current_theta.reshape((num_features,1)))),
            # 迭代次数  
            options={'maxiter': max_iterations}                                               
            )
        if not result.success:
            raise ArithmeticError('Can not minimize cost function'+result.message)
        optimized_theta = result.x.reshape(num_features,1)
        return optimized_theta,cost_history
        
    @staticmethod
    #计算损失
    def cost_function(data,labels,theat):    
        num_examples = data.shape[0]   #总共数据量
        predictions = LogisticRegression.hypothesis(data,theat)
        y_is_set_cost = np.dot(labels[labels == 1].T,np.log(predictions[labels == 1]))  #属于当前类别的总的损失
        y_is_not_set_cost = np.dot(1-labels[labels == 0].T,np.log(1-predictions[labels == 0]))   #不是当前类别的总的损失
        cost = (-1/num_examples)*(y_is_set_cost+y_is_not_set_cost)   #总损失
        return cost
    @staticmethod
    #预测函数
    def hypothesis(data,theat):
        
        predictions = sigmoid(np.dot(data,theat))
        
        return  predictions
    
    @staticmethod     
    def gradient_step(data,labels,theta):
        num_examples = labels.shape[0]
        predictions = LogisticRegression.hypothesis(data,theta)
        label_diff = predictions- labels
        gradients = (1/num_examples)*np.dot(data.T,label_diff)
        
        return gradients.T.flatten()
    
    def predict(self,data):
        num_examples = data.shape[0]
        data_processed = prepare_for_training(data, self.polynomial_degree, self.sinusoid_degree,self.normalize_data)[0]  #数据预处理
        prob = LogisticRegression.hypothesis(data_processed,self.theta.T)   #得到预测完概率值
        max_prob_index = np.argmax(prob, axis=1)   #算出三个里面最大的概率值索引
        class_prediction = np.empty(max_prob_index.shape,dtype=object)
        for index,label in enumerate(self.unique_labels):    #枚举每一个类别
            class_prediction[max_prob_index == index] = label   #所有类别实际的名字
        return class_prediction.reshape((num_examples,1))

导入数据,画图展示:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from logistic_regression import LogisticRegression


data = pd.read_csv('../data/iris.csv')
iris_types = ['SETOSA','VERSICOLOR','VIRGINICA']   #三种不同类别的花

x_axis = 'petal_length'
y_axis = 'petal_width'

for iris_type in iris_types:
    plt.scatter(data[x_axis][data['class']==iris_type],
                data[y_axis][data['class']==iris_type],
                label = iris_type
                )
plt.show()

接下来训练多分类模型:

num_examples = data.shape[0]
x_train = data[[x_axis,y_axis]].values.reshape((num_examples,2))
y_train = data['class'].values.reshape((num_examples,1))

max_iterations = 1000  #最大迭代次数
polynomial_degree = 0
sinusoid_degree = 0


logistic_regression = LogisticRegression(x_train,y_train,polynomial_degree,sinusoid_degree)
thetas,cost_histories = logistic_regression.train(max_iterations)  #开始训练
labels = logistic_regression.unique_labels

plt.plot(range(len(cost_histories[0])),cost_histories[0],label = labels[0])
plt.plot(range(len(cost_histories[1])),cost_histories[1],label = labels[1])
plt.plot(range(len(cost_histories[2])),cost_histories[2],label = labels[2])
plt.show()

不同的线代表梯度下降的结果,橙色,绿色,蓝色,分别代表做了三个二分类,在做三个二分类时,损失值如图变化。

绘制决策边界

#预测结果
y_train_prections = logistic_regression.predict(x_train)
#计算准确率
precision = np.sum(y_train_prections == y_train)/y_train.shape[0] * 100
print ('precision:',precision)

x_min = np.min(x_train[:,0])
x_max = np.max(x_train[:,0])
y_min = np.min(x_train[:,1])
y_max = np.max(x_train[:,1])
samples= 150  #样本个数

#取数据
X = np.linspace(x_min,x_max,samples)
Y = np.linspace(y_min,y_max,samples)

Z_SETOSA = np.zeros((samples,samples))  #初始化【150 150】
Z_VERSICOLOR = np.zeros((samples,samples))
Z_VIRGINICA = np.zeros((samples,samples))

for x_index,x in enumerate(X):
    for y_index,y in enumerate(Y):
        data = np.array([[x,y]])
        prediction = logistic_regression.predict(data)[0][0]
        if prediction == 'SETOSA':
            Z_SETOSA[x_index][y_index] =1
        elif prediction == 'VERSICOLOR':
            Z_VERSICOLOR[x_index][y_index] =1
        elif prediction == 'VIRGINICA':
            Z_VIRGINICA[x_index][y_index] =1

#对于每一个类别分别展示
for iris_type in iris_types:
    plt.scatter(
        x_train[(y_train == iris_type).flatten(),0],
        x_train[(y_train == iris_type).flatten(),1],
        label = iris_type
                )

plt.contour(X,Y,Z_SETOSA)
plt.contour(X,Y,Z_VERSICOLOR)
plt.contour(X,Y,Z_VIRGINICA)
plt.show()

非线性决策边界举例:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

from logistic_regression import LogisticRegression

data = pd.read_csv('../data/microchips-tests.csv')

# 类别标签
validities = [0, 1]

# 选择两个特征
x_axis = 'param_1'
y_axis = 'param_2'

# 散点图
for validity in validities:
    plt.scatter(
        data[x_axis][data['validity'] == validity],
        data[y_axis][data['validity'] == validity],
        label=validity
    )
  
plt.xlabel(x_axis)
plt.ylabel(y_axis)
plt.title('Microchips Tests')
plt.legend()
plt.show()


num_examples = data.shape[0]
x_train = data[[x_axis, y_axis]].values.reshape((num_examples, 2))
y_train = data['validity'].values.reshape((num_examples, 1))

# 训练参数
max_iterations = 100000  
regularization_param = 0  
polynomial_degree = 5  
sinusoid_degree = 0  
# 逻辑回归
logistic_regression = LogisticRegression(x_train, y_train, polynomial_degree, sinusoid_degree)

# 训练
(thetas, costs) = logistic_regression.train(max_iterations)

columns = []
for theta_index in range(0, thetas.shape[1]):
    columns.append('Theta ' + str(theta_index));

# 训练结果
labels = logistic_regression.unique_labels

plt.plot(range(len(costs[0])), costs[0], label=labels[0])
plt.plot(range(len(costs[1])), costs[1], label=labels[1])

plt.xlabel('Gradient Steps')
plt.ylabel('Cost')
plt.legend()
plt.show()

# 预测
y_train_predictions = logistic_regression.predict(x_train)

# 准确率
precision = np.sum(y_train_predictions == y_train) / y_train.shape[0] * 100

print('Training Precision: {:5.4f}%'.format(precision))


num_examples = x_train.shape[0]
samples = 150
x_min = np.min(x_train[:, 0])
x_max = np.max(x_train[:, 0])

y_min = np.min(x_train[:, 1])
y_max = np.max(x_train[:, 1])

X = np.linspace(x_min, x_max, samples)
Y = np.linspace(y_min, y_max, samples)
Z = np.zeros((samples, samples))

# 结果展示
for x_index, x in enumerate(X):
    for y_index, y in enumerate(Y):
        data = np.array([[x, y]])
        Z[x_index][y_index] = logistic_regression.predict(data)[0][0]

positives = (y_train == 1).flatten()
negatives = (y_train == 0).flatten()

plt.scatter(x_train[negatives, 0], x_train[negatives, 1], label='0')
plt.scatter(x_train[positives, 0], x_train[positives, 1], label='1')

plt.contour(X, Y, Z)

plt.xlabel('param_1')
plt.ylabel('param_2')
plt.title('Microchips Tests')
plt.legend()

plt.show()

可以看到逻辑回归不仅能解决线性问题,非线性问题也能解决。

(三)逻辑回归实验分析

在线性回归问题中,得到了具体的回归值,逻辑回归中借助sigmoid函数完成了数值映射,通过概率值比较来完成分类任务。

画出sigmoid函数

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

t = np.linspace(-10, 10, 100)
sig = 1 / (1 + np.exp(-t))
plt.figure(figsize=(9, 3))
plt.plot([-10, 10], [0, 0], "k-")
plt.plot([-10, 10], [0.5, 0.5], "k:")
plt.plot([-10, 10], [1, 1], "k:")
plt.plot([0, 0], [-1.1, 1.1], "k-")
plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \frac{1}{1 + e^{-t}}$")
plt.xlabel("t")
plt.legend(loc="upper left", fontsize=20)
plt.axis([-10, 10, -0.1, 1.1])
plt.title('Logistic function')
plt.show()
from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())  #查看中有哪些属性

[‘data’, ‘target’, ‘target_names’, ‘DESCR’, ‘feature_names’, ‘filename’]

X = iris['data'][:,3:]
y = (iris['target'] == 2).astype(np.int)     #把该标签设置为1,其他标签设置为0
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

from sklearn.linear_model import LogisticRegression
log_res = LogisticRegression()
log_res.fit(X,y)    #训练模型

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘warn’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘warn’,
tol=0.0001, verbose=0, warm_start=False)

#测试数据
X_new = np.linspace(0,3,1000).reshape(-1,1)  #在0到3之间取1000个数,reshape成一维的,第一个-1,第二个1
y_proba = log_res.predict_proba(X_new)   #用刚才创建的模型得到预测概率值
y_proba

array([[0.98554411, 0.01445589],
[0.98543168, 0.01456832],
[0.98531838, 0.01468162],
…,
[0.02618938, 0.97381062],
[0.02598963, 0.97401037],
[0.02579136, 0.97420864]])

左边表示属于当前类别,右边表示不属于当前类别 。随着输入特征数值的变化,结果概率值也会随之变化。

plt.figure(figsize=(12,5))
decision_boundary = X_new[y_proba[:,1]>=0.5][0]
plt.plot([decision_boundary,decision_boundary],[-1,2],'k:',linewidth = 2)   #中间边界
plt.plot(X_new,y_proba[:,1],'g-',label = 'Iris-Virginica')
plt.plot(X_new,y_proba[:,0],'b--',label = 'Not Iris-Virginica')
plt.arrow(decision_boundary,0.08,-0.3,0,head_width = 0.05,head_length=0.1,fc='b',ec='b')   #蓝色箭头
plt.arrow(decision_boundary,0.92,0.3,0,head_width = 0.05,head_length=0.1,fc='g',ec='g')   #绿色箭头
plt.text(decision_boundary+0.02,0.15,'Decision Boundary',fontsize = 16,color = 'k',ha='center')   #中间字体
plt.xlabel('Peta width(cm)',fontsize = 16)
plt.ylabel('y_proba',fontsize = 16)
plt.axis([0,3,-0.02,1.02])
plt.legend(loc = 'center left',fontsize = 16)

可以看到,随着花瓣宽度的增加,是Iris-Virginica花的可能性越来越大,不是Iris-Virginica花的可能性越来越小,当花瓣宽度在1.6的时候,很难确定究竟是什么种类。

X = iris['data'][:,(2,3)]  #选择所有样本的第二维,第三维
y = (iris['target']==2).astype(np.int)   #一个类别为1,其他类别为0
from sklearn.linear_model import LogisticRegression
log_res = LogisticRegression(C = 10000)  #构建逻辑回归模型,C越大,正则化惩罚力度越小
log_res.fit(X,y)    #训练

LogisticRegression(C=10000, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘warn’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘warn’,
tol=0.0001, verbose=0, warm_start=False)

X[:,0].min(),X[:,0].max()   #第一个维度最小值,最大值

(1.0, 6.9)

X[:,1].min(),X[:,1].max()

(0.1, 2.5)

构建坐标数据:

x0,x1 = np.meshgrid(np.linspace(2.9,7,500).reshape(-1,1),np.linspace(0.8,2.7,200).reshape(-1,1))

整合坐标点,得到所有测试输入数据坐标点:

X_new = np.c_[x0.ravel(),x1.ravel()]
X_new   #测试数据

array([[2.9 , 0.8 ],
[2.90821643, 0.8 ],
[2.91643287, 0.8 ],
…,
[6.98356713, 2.7 ],
[6.99178357, 2.7 ],
[7. , 2.7 ]])

X_new.shape

(100000, 2)

预测,得到所有点的概率值:

y_proba = log_res.predict_proba(X_new)  #用训练好的模型得概率值
x0.shape

(200, 500)

x1.shape

(200, 500)

绘制等高线,完成决策边界:

plt.figure(figsize=(10,4))
plt.plot(X[y==0,0],X[y==0,1],'bs')  #第一类点蓝色
plt.plot(X[y==1,0],X[y==1,1],'g^')  #第二类点绿色

zz = y_proba[:,1].reshape(x0.shape)   #概率值,1代表属于1这个类别
contour = plt.contour(x0,x1,zz,cmap=plt.cm.brg)  #等高线值
plt.clabel(contour,inline = 1)  
plt.axis([2.9,7,0.8,2.7])
plt.text(3.5,1.5,'NOT Vir',fontsize = 16,color = 'b')
plt.text(6.5,2.3,'Vir',fontsize = 16,color = 'g')

zz设置的是属于类别1(绿色三角)的概率,离绿色三角越近,等高线值越大,随着等高线往下走,属于绿色类别的概率越来越小。

多类别分类:

softmax计算概率:
P k ^ = σ ( s ( x ) ) k = e x p ( s k ( x ) ) ∑ j = 1 k e x p ( s j ( x ) ) \hat{P_{k}}=\sigma(s(x))_{k}=\frac{exp(s_{k}(x))}{\sum_{j=1}^{k}exp(s_{j}(x))} Pk^=σ(s(x))k=j=1kexp(sj(x))exp(sk(x))

损失函数(交叉熵):
J ( θ ) = − 1 m ∑ i = 1 m ∑ k = 1 k y k ( i ) l o g ( P k ( i ) ^ ) J(\theta )=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{k}y_{k}^{(i)}log(\hat{P_{k}^{(i)}}) J(θ)=m1i=1mk=1kyk(i)log(Pk(i)^)

X = iris['data'][:,(2,3)]
y = iris['target']
softmax_reg = LogisticRegression(multi_class = 'multinomial',solver='lbfgs')    #逻辑回归多分类问题 
softmax_reg.fit(X,y)     #训练操作

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘multinomial’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘lbfgs’,
tol=0.0001, verbose=0, warm_start=False)

softmax_reg.predict([[5,2]])

array([2]) #属于第三个类别

softmax_reg.predict_proba([[5,2]])

array([[2.43559894e-04, 2.14859516e-01, 7.84896924e-01]]) #三个类别的概率值

x0, x1 = np.meshgrid(
        np.linspace(0, 8, 500).reshape(-1, 1),
        np.linspace(0, 3.5, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]


y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris-Virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris-Versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris-Setosa")

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
plt.show()

比较三种不同停止策略对逻辑回归的影响:

STOP_ITER = 0   #根据迭代次数
STOP_COST = 1   #根据损失值目标函数变化
STOP_GRAD = 2   #根据梯度

def stopCriterion(type, value, threshold):
    #设定三种不同的停止策略
    if type == STOP_ITER:        return value > threshold
    elif type == STOP_COST:      return abs(value[-1]-value[-2]) < threshold
    elif type == STOP_GRAD:      return np.linalg.norm(value) < threshold

#计算梯度
def gradient(X, y, theta):
    grad = np.zeros(theta.shape)
    error = (model(X, theta)- y).ravel()
    for j in range(len(theta.ravel())): #for each parmeter
        term = np.multiply(error, X[:,j])
        grad[0, j] = np.sum(term) / len(X)
    
    return grad
import numpy.random
#洗牌
def shuffleData(data):
    np.random.shuffle(data)
    cols = data.shape[1]
    X = data[:, 0:cols-1]
    y = data[:, cols-1:]
    return X, y
import time

def descent(data, theta, batchSize, stopType, thresh, alpha):
    #梯度下降求解
    
    init_time = time.time()
    i = 0 # 迭代次数
    k = 0 # batch
    X, y = shuffleData(data)
    grad = np.zeros(theta.shape) # 计算的梯度
    costs = [cost(X, y, theta)] # 损失值

    
    while True:
        grad = gradient(X[k:k+batchSize], y[k:k+batchSize], theta)      #有三种不同的梯度下降方法,所以X[k:k+batchSize],是动态的,根据传入数据,决定是什么梯度下降
        k += batchSize #取batch数量个数据
        if k >= n: 
            k = 0 
            X, y = shuffleData(data) #重新洗牌
        theta = theta - alpha*grad  #参数更新
        costs.append(cost(X, y, theta)) # 计算新的损失
        i += 1 

        if stopType == STOP_ITER:       value = i
        elif stopType == STOP_COST:     value = costs
        elif stopType == STOP_GRAD:     value = grad
        if stopCriterion(stopType, value, thresh): break
    
    return theta, i-1, costs, grad, time.time() - init_time
def runExpe(data, theta, batchSize, stopType, thresh, alpha):
    #执行梯度下降得到最终结果
    theta, iter, costs, grad, dur = descent(data, theta, batchSize, stopType, thresh, alpha)
    #根据传入参数,显示是什么类型的梯度下降以及停止策略
    name = "Original" if (data[:,1]>2).sum() > 1 else "Scaled"
    name += " data - learning rate: {} - ".format(alpha)
    if batchSize==n: strDescType = "Gradient"
    elif batchSize==1:  strDescType = "Stochastic"
    else: strDescType = "Mini-batch ({})".format(batchSize)
    name += strDescType + " descent - Stop: "
    if stopType == STOP_ITER: strStop = "{} iterations".format(thresh)
    elif stopType == STOP_COST: strStop = "costs change < {}".format(thresh)
    else: strStop = "gradient norm < {}".format(thresh)
    name += strStop
    #画图展示结果
    print ("***{}\nTheta: {} - Iter: {} - Last cost: {:03.2f} - Duration: {:03.2f}s".format(
        name, theta, iter, costs[-1], dur))
    fig, ax = plt.subplots(figsize=(12,4))
    ax.plot(np.arange(len(costs)), costs, 'r')
    ax.set_xlabel('Iterations')
    ax.set_ylabel('Cost')
    ax.set_title(name.upper() + ' - Error vs. Iteration')
    return theta

准备工作做完,下面比较不同的停止策略。
1.根据迭代次数停止

#选择的梯度下降方法是基于所有样本的
n=100
runExpe(orig_data, theta, n, STOP_ITER, thresh=5000, alpha=0.000001)   #迭代5000次,学习率0.000001

2.根据损失值停止

runExpe(orig_data, theta, n, STOP_COST, thresh=0.000001, alpha=0.001)

迭代次数差不多需要110000次,所以同样对于全部样本,所需时间也多一些,但是cost值从0.63下降到0.37左右,结果比根据迭代次数停止效果好。

3.根据梯度变化停止

runExpe(orig_data, theta, n, STOP_GRAD, thresh=0.05, alpha=0.001)

迭代40000次结果,显然没有迭代110000次效果好,可以看到最影响结果变化的就是迭代次数,通常希望迭代次数能够大一些。

对比不同的梯度下降方法

1.随机梯度下降

runExpe(orig_data, theta, 1, STOP_ITER, thresh=5000, alpha=0.001)

损失值忽上忽下,很不稳定,再来试试把学习率调小一些。

runExpe(orig_data, theta, 1, STOP_ITER, thresh=15000, alpha=0.000002)

现在接近收敛,但是结果也不太理想。

2.小批量梯度下降

runExpe(orig_data, theta, 16, STOP_ITER, thresh=15000, alpha=0.001)

浮动仍然比较大,我们来尝试下对数据进行标准化将数据按其属性(按列进行)减去其均值,然后除以其方差。最后得到的结果是,对每个属性/每列来说所有数据都聚集在0附近,方差值为1。

from sklearn import preprocessing as pp

scaled_data = orig_data.copy()
scaled_data[:, 1:3] = pp.scale(orig_data[:, 1:3])
runExpe(scaled_data, theta, 16, STOP_GRAD, thresh=0.002*2, alpha=0.001)

效果比较好,原来不收敛,现在收敛,cost也降到了0.2147左右, 所以对数据做预处理是非常重要的,当数据出现浮动,先在数据层面入手,看下对数据做一系列处理后,能不能使结果变好,先改数据再改模型。

3.批量梯度下降

runExpe(scaled_data, theta, n, STOP_GRAD, thresh=0.02, alpha=0.001)

批量梯度下降需要更多的迭代次数,才能达到小批量梯度下降差不多相同的cost。综合比较用小批量梯度下降比较合适。

案例:交易数据异常检测

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
data = pd.read_csv("creditcard.csv")  #导入信用卡数据
data.head()  #显示前五行数据

v1,…v28代表28个特征,里面有一个class列,认为class等于0的为没有发生问题的,class等于1的为异常,因为这是一个银行信用卡数据,所以大部分应该都是正常的,通常可以把这份数据分为两部分,一部分是x,也就是输入的特征数,另一部分当做y,也可以是这份数据的label数据,那么这个class就是这个y,0是正常的,1是异常的。接下来对于这些特征用逻辑回归建模。

count_classes = pd.value_counts(data['Class'], sort = True).sort_index()   #算class这一列有几个不同的属性
count_classes.plot(kind = 'bar')    #画条形图
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")

可以看到样本分布规则,正常样本非常多,异常样本很少。对于这种样本极度不规则的现象,有两种方案,过采样(对1号样本进行样本生成策略,生成20多万个数据跟0号样本一样多)和下采样(让0和1的数据一样少,在0里面选择和1一样少的数据再重新组合到一起)。下面采用下采样方案。

同时观察到Amount列数据浮动比较大,但是在做机器学习建模的时候,要保证特征之间分布差异是差不多的,机器学习算法可能会有误区,一些数值比较大的数据重要程度比较大,所以为了保证每个特征的重要程度是一样的,需要对Amount数据归一化或者标准化。

from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))       #转换Amount为-1到1附近
data = data.drop(['Time','Amount'],axis=1)  删除这两列
data.head()

特征数据已经处理完毕。下面采取下采样:

X = data.ix[:, data.columns != 'Class']    #特征数据,把所有样本都取进来,不包括‘class’列
y = data.ix[:, data.columns == 'Class']   #‘class列的所有数据’

number_records_fraud = len(data[data.Class == 1]) #Class == 1的样本个数
fraud_indices = np.array(data[data.Class == 1].index)    #得到Class == 1.的index拿出来

normal_indices = data[data.Class == 0].index

random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)   #在0样本中随机选择number_records_fraud个样本,使两种样本一样少
random_normal_indices = np.array(random_normal_indices)

under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

下采样后一共有984个数据,一半正样本,一半负样本。把样本不均衡的数据变成了均衡的数据。下面交叉验证(线性回归已经讲过原理)。

from sklearn.cross_validation import train_test_split

# 原始数据集切分,(测试需要用)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# 下采样数据集切分
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample                                                                                                   ,test_size = 0.3                                                                                                  ,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))
#Recall = TP/(TP+FN)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score   
from sklearn.metrics import confusion_matrix,recall_score,classification_report 
def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(len(y_train_data),5,shuffle=False)   #交叉验证,把原始数据切分成五份

    # Different C parameters
    c_param_range = [0.01,0.1,1,10,100]

    results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range

    # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        for iteration, indices in enumerate(fold,start=1):

            # 建立逻辑回归模型,c_param是惩罚力度
            lr = LogisticRegression(C = c_param, penalty = 'l1')

            #用建立的lr模型.fit训练数据     
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())

            # 使用训练数据中的测试指标预测值
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

         
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration,': recall score = ', recall_acc)

        # 得到平均recall_accs
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
    
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')
    
    return best_c
best_c = printing_Kfold_scores(X_train,y_train)

对于每个参数,都有对应的五次交叉验证结果以及平均值,每次交叉验证的结果值都不一样,可能会相差10个百分点,第一个参数的平均recall值最高,所以把0.01当做默认参数。

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    绘制混淆矩阵
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 绘制混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

上图为下采样数据下的混淆矩阵图。可以观察到真实是0的样本预测也是0有129个,真实是1预测也是1的样本有137个。下面还要在原始数据上测试。

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# 计算混淆矩阵
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# 绘制混淆矩阵
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Recall metric in the testing dataset: 0.918367346939

上图为在原始数据的混淆矩阵,有8581个样本来没有异常却被检测为异常,对于下采样的问题就出来了,虽然recall值能达到符合样本的趋势,但是误杀比较严重。

下面试试直接用原始数据集,不用上采样也不用下采样。

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)

cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Recall metric in the testing dataset: 0.619047619048

可以看到recall值并不如下采样效果好,但是并没有那么多误杀了。

下面人为控制模型评估标准,设置不同阈值:

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # 计算混淆矩阵
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i) 

Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.986394557823
Recall metric in the testing dataset: 0.931972789116
Recall metric in the testing dataset: 0.884353741497
Recall metric in the testing dataset: 0.836734693878
Recall metric in the testing dataset: 0.748299319728
Recall metric in the testing dataset: 0.571428571429

观察到,随着阈值增加,recall值不断下降。当thresholds等于0.1时,所有的异常样本都检测出来了,但是所有的样本也都预测成了异常样本,虽然recall值很高,但是精度很低,所以评估不能从单方面评估。当thresholds大于等于0.4开始发生了变化,误杀减小了,检测不到的异常点也越来越多,当thresholds等于0.9,要求太严格了,没有误杀,但是有63个异常点没检测出来,所以可以根据实际需求选择thresholds值。(过采样方案就~~此处省略1000字,自己想象^^,过程类似)

(四)逻辑回归作业

1.建立一个逻辑回归模型来预测一个学生是否能进入大学。假设你是一所大学的行政管理人员,你想根据两门考试的结果,来决定每个申请人是否被录取。你有以前申请人的历史数据,可以将其用作逻辑回归训练集。对于每一个训练样本,你有申请人两次测评的分数以及录取的结果。

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('ex2data1.txt', names=['exam1', 'exam2', 'admitted'])
data.head()  #默认展示前五个
data.describe()   #基本统计量

创建两个分数的散点图:

positive = data[data.admitted.isin(['1'])]  # 1
negetive = data[data.admitted.isin(['0'])]  # 0

fig, ax = plt.subplots(figsize=(6,5))
ax.scatter(positive['exam1'], positive['exam2'], c='b', label='Admitted')
ax.scatter(negetive['exam1'], negetive['exam2'], s=50, c='r', marker='x', label='Not Admitted')
# 设置图例显示在图的上方
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width , box.height* 0.8])
ax.legend(loc='center left', bbox_to_anchor=(0.2, 1.12),ncol=3)
# 设置横纵坐标名
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
plt.show()
def sigmoid(z):
    return 1 / (1 + np.exp(- z))
x1 = np.arange(-10, 10, 0.1)
plt.plot(x1, sigmoid(x1), c='r')
plt.show()   #画出 logistic回归的假设函数
def cost(theta, X, y):
    first = (-y) * np.log(sigmoid(X @ theta))
    second = (1 - y)*np.log(1 - sigmoid(X @ theta))
    return np.mean(first - second)

# 添加一列,方便矩阵乘法计算 
if 'Ones' not in data.columns:
    data.insert(0, 'Ones', 1)

# 获取训练集数据
X = data.iloc[:, :-1].as_matrix()  # Convert the frame to its Numpy-array representation.
y = data.iloc[:, -1].as_matrix()  # Return is NOT a Numpy-matrix, rather, a Numpy-array.

theta = np.zeros(X.shape[1])
X.shape    #检查矩阵的维度

(100, 3)

theta.shape

(3,)

y.shape

(100,)

cost(theta, X, y)

0.6931471805599453

def gradient(theta, X, y):
    return (X.T @ (sigmoid(X @ theta) - y))/len(X)  
gradient(theta, X, y)      #批量梯度下降

array([ -0.1 , -12.00921659, -11.26284221])

import scipy.optimize as opt   #行速度通常远远超过梯度下降
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
result

(array([-25.16131853, 0.20623159, 0.20147149]), 36, 0)

res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='TNC', jac=gradient)
res

fun: 0.203497701589475
jac: array([9.14875519e-09, 9.99037356e-08, 4.79345707e-07])
message: ‘Local minimum reached (|pg| ~= 0)’
nfev: 36
nit: 17
status: 0
success: True
x: array([-25.16131853, 0.20623159, 0.20147149])

cost(result[0], X, y)

0.203497701589475

def predict(theta, X):
    probability = sigmoid(X@theta)
    return [1 if x >= 0.5 else 0 for x in probability]  # return a list
#学习好了参数θ后,用这个模型预测某个学生是否能被录取。
final_theta = result[0]
predictions = predict(final_theta, X)
correct = [1 if a==b else 0 for (a, b) in zip(predictions, y)]
accuracy = sum(correct) / len(X)
accuracy

0.89 #预测精度达到了89%

#用skearn中的方法来检验
from sklearn.metrics import classification_report
print(classification_report(predictions, y))

precision recall f1-score support
0 0.85 0.87 0.86 39
1 0.92 0.90 0.91 61
avg / total 0.89 0.89 0.89 100

x1 = np.arange(130, step=0.1)
x2 = -(final_theta[0] + x1*final_theta[1]) / final_theta[2]
#绘制决策边界
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(positive['exam1'], positive['exam2'], c='b', label='Admitted')
ax.scatter(negetive['exam1'], negetive['exam2'], s=50, c='r', marker='x', label='Not Admitted')
ax.plot(x1, x2)
ax.set_xlim(0, 130)
ax.set_ylim(0, 130)
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_title('Decision Boundary')
plt.show()

2.设想你是工厂的生产主管,你有一些芯片在两次测试中的测试结果。对于这两次测试,你想决定是否芯片要被接受或抛弃。

data2 = pd.read_csv('ex2data2.txt', names=['Test 1', 'Test 2', 'Accepted'])
data2.head()
def plot_data():
    positive = data2[data2['Accepted'].isin([1])]
    negative = data2[data2['Accepted'].isin([0])]

    fig, ax = plt.subplots(figsize=(8,5))
    ax.scatter(positive['Test 1'], positive['Test 2'], s=50, c='b', marker='o', label='Accepted')
    ax.scatter(negative['Test 1'], negative['Test 2'], s=50, c='r', marker='x', label='Rejected')
    ax.legend()
    ax.set_xlabel('Test 1 Score')
    ax.set_ylabel('Test 2 Score')
    
plot_data()	
def feature_mapping(x1, x2, power):
    data = {}
    for i in np.arange(power + 1):
        for p in np.arange(i + 1):
            data["f{}{}".format(i - p, p)] = np.power(x1, i - p) * np.power(x2, p)

#     data = {"f{}{}".format(i - p, p): np.power(x1, i - p) * np.power(x2, p)
#                 for i in np.arange(power + 1)
#                 for p in np.arange(i + 1)
#             }
    return pd.DataFrame(data)
x1 = data2['Test 1'].as_matrix()
x2 = data2['Test 2'].as_matrix()
_data2 = feature_mapping(x1, x2, power=6)
_data2.head()

在这个高维特征向量上训练的logistic回归分类器会有一个更复杂的决策边界,所以在二维图中绘制时,会出现非线性。

虽然特征映射允许我们构建一个更有表现力的分类器,但它也更容易过拟合。接下来实现正则化的logistic回归来拟合数据,并且可以看到正则化如何帮助解决过拟合的问题。

# 这里因为做特征映射的时候已经添加了偏置项,所以不用手动添加了。
X = _data2.as_matrix()  
y = data2['Accepted'].as_matrix()
theta = np.zeros(X.shape[1])
X.shape, y.shape, theta.shape 

((118, 28), (118,), (28,))

def costReg(theta, X, y, l=1):
    # 不惩罚第一项
    _theta = theta[1: ]
    reg = (l / (2 * len(X))) *(_theta @ _theta)  # _theta@_theta == inner product
    
    return cost(theta, X, y) + reg
costReg(theta, X, y, l=1) 

0.6931471805599454

def gradientReg(theta, X, y, l=1):
    reg = (1 / len(X)) * theta
    reg[0] = 0  
    return gradient(theta, X, y) + reg
gradientReg(theta, X, y, 1)

array([8.47457627e-03, 1.87880932e-02, 7.77711864e-05, 5.03446395e-02,
1.15013308e-02, 3.76648474e-02, 1.83559872e-02, 7.32393391e-03,
8.19244468e-03, 2.34764889e-02, 3.93486234e-02, 2.23923907e-03,
1.28600503e-02, 3.09593720e-03, 3.93028171e-02, 1.99707467e-02,
4.32983232e-03, 3.38643902e-03, 5.83822078e-03, 4.47629067e-03,
3.10079849e-02, 3.10312442e-02, 1.09740238e-03, 6.31570797e-03,
4.08503006e-04, 7.26504316e-03, 1.37646175e-03, 3.87936363e-02])

result2 = opt.fmin_tnc(func=costReg, x0=theta, fprime=gradientReg, args=(X, y, 2))
result2

(array([ 1.02252845, 0.56283947, 1.13465378, -1.78530023, -0.66539149,
-1.01863396, 0.13957 , -0.29358944, -0.30102267, -0.08324513,
-1.27206004, -0.06137383, -0.53996526, -0.17881809, -0.9419886 ,
-0.14054855, -0.17736663, -0.0769737 , -0.22918951, -0.21349652,
-0.37205413, -0.86417673, 0.00890083, -0.26795962, -0.00362248,
-0.28315235, -0.07321588, -0.75992674]), 57, 4)

from sklearn import linear_model#调用sklearn的线性回归包
model = linear_model.LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y.ravel())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=‘warn’,
n_jobs=None, penalty=‘l2’, random_state=None, solver=‘warn’,
tol=0.0001, verbose=0, warm_start=False)

model.score(X, y)

0.8305084745762712

评估:

final_theta = result2[0]
predictions = predict(final_theta, X)
correct = [1 if a==b else 0 for (a, b) in zip(predictions, y)]
accuracy = sum(correct) / len(correct)
accuracy

0.8050847457627118

print(classification_report(y, predictions))

precision recall f1-score support
0 0.85 0.75 0.80 60
1 0.77 0.86 0.81 58
micro avg 0.81 0.81 0.81 118
macro avg 0.81 0.81 0.80
weighted avg 0.81 0.81 0.80 118

绘制决策边界:

x = np.linspace(-1, 1.5, 250)
xx, yy = np.meshgrid(x, x)

z = feature_mapping(xx.ravel(), yy.ravel(), 6).as_matrix()
z = z @ final_theta
z = z.reshape(xx.shape)

plot_data()
plt.contour(xx, yy, z, 0)
plt.ylim(-.8, 1.2)
  • 3
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值