机器学习-逻辑回归(Logistic Regression)

Section I: Brief Glimpse Into Logistic Regression

Logistic regression is a classification model that is very easy to implement but performs very well on linearly separable classes. It is one of the most widely used algorithms for classification in industry. Similar to perceptron and AdaLine, the logistic regression model is also a linear model for binary classification that can be also extended to multiclass classification, for example, via OvR technique.

Section II: Model Logistic Regression Via Self-Coded and Sklearn

Step 1: Logistic sigmoid function

import matplotlib.pyplot as plt
import numpy as np

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'family': 'Times New Roman',
        'weight': 'light'}
plt.rc("font", **font)

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

z=np.arange(-7,7,0.1)
phi_z=sigmoid(z)
plt.plot(z,phi_z)
plt.axvline(0.0,color='k')
plt.ylim(-0.1,1.1)
plt.xlabel('z')
plt.ylabel('$\phi (z)$')
plt.yticks([0.0,0.5,1.0])
ax=plt.gca()
ax.yaxis.grid(True)
plt.savefig('./fig1.png')
plt.show()

在这里插入图片描述
从上图可以得知,Sigmoid函数接受负无穷到正无穷之间的实数,并将其转化为0-1之间的小数。其中,Sigmoid函数与Y轴的交叉点在[0,0.5]。

Step 2: Logistic cost function formulated via maximum log-likeligood function

import matplotlib.pyplot as plt
import numpy as np
from LogisticRegression.sigmoid import sigmoid

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'family': 'Times New Roman',
        'weight': 'light'}
plt.rc("font", **font)

#Logistic regression cost function
#For one-sample training instance
#J(f(z),y:w)=-ylog(f(z))-(1-y)log(1-f(z))

def cost_1(z):
    return -np.log(sigmoid(z))

def cost_0(z):
    return -np.log(1-sigmoid(z))

z=np.arange(-10,10,0.1)
phi_z=sigmoid(z)

c1=[cost_1(x) for x in z]
plt.plot(phi_z,c1,label='J(w) if y=1')
c0=[cost_0(x) for x in z]
plt.plot(phi_z,c0,linestyle='--',label='J(w) if y=0')
plt.ylim(0.0,5.1)
plt.xlim([0,1])
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='upper left')
plt.savefig('./fig2.png')
plt.show()

在这里插入图片描述
小结
结合上图,可得知:其一,当预测类别与真实类别相同时,训练成本函数均趋近于0;其二,若预测类别与真是类别完全不同时,成本函数惩罚幅度更大。此外,有趣的是Sigmoid函数预测值为[0,1]之间的小数,可理解为输出值经激活函数计算后的预测值趋近于真实值的概率。同样,该模型也可应用OvR技术改造后,扩展于多分类应用

Step 3: Logistic regression implementation

第一部分:Logistic Regression实现

import numpy as np

class LogisticRegressionGD(object):
    def __init__(self,eta=0.05,n_iter=100,random_state=1):
        self.eta=eta
        self.n_iter=n_iter
        self.random_state=random_state

    def fit(self,X,y):
        rgen=np.random.RandomState(self.random_state)
        self.w_=rgen.normal(loc=0.0,scale=0.01,
                            size=1+X.shape[1])
        self.cost_=[]

        for i in range(self.n_iter):
            net_input=self.net_input(X)
            output=self.activation(net_input)
            errors=(y-output)
            self.w_[1:]+=self.eta*X.T.dot(errors)
            self.w_[0]+=self.eta*errors.sum()

            cost=(-y.dot(np.log(output))-(1-y).dot(np.log(1-output)))
            self.cost_.append(cost)
        return self

    def net_input(self,X):
        return np.dot(X,self.w_[1:])+self.w_[0]

    def activation(self,z):
        return 1.0/(1.0+np.exp(-np.clip(z,-250,250)))

    def predict(self,X):
        return np.where(self.net_input(X)>=0.0,1,0)

第二部分:调用形式

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from LogisticRegression import logistic_regression
from LogisticRegression.visualize import plot_decision_regions

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'family': 'Times New Roman',
        'weight': 'light'}
plt.rc("font", **font)

##Section 1: Load data and split it into train/test dataset
iris=datasets.load_iris()
X=iris.data[:,[2,3]]
y=iris.target

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)
X_train_01_subset=X_train[(y_train==0)|(y_train==1)]
y_train_01_subset=y_train[(y_train==0)|(y_train==1)]

lrgd=logistic_regression.LogisticRegressionGD(eta=0.05,n_iter=1000,
                                              random_state=1)
lrgd.fit(X_train_01_subset,y_train_01_subset)

plot_decision_regions(X=X_train_01_subset,
                      y=y_train_01_subset,
                      classifier=lrgd)
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='upper left')
plt.savefig('./fig3.png')
plt.show()

在这里插入图片描述

Step 4: Train a Logistic Regression model with Sklearn

import matplotlib.pyplot as plt
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from LogisticRegression.visualize_test_idx import plot_decision_regions

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'family': 'Times New Roman',
        'weight': 'light'}
plt.rc("font", **font)

##Section 1: Load data and split it into train/test dataset
iris=datasets.load_iris()
X=iris.data[:,[2,3]]
y=iris.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)

#Section 2: Preprocessing data in standardized
sc=StandardScaler()
sc.fit(X_train)
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)

#Section 3: Train Logistic Regression model
lr=LogisticRegression(C=100,random_state=1)
lr.fit(X_train_std,y_train)
X_combined_std=np.vstack((X_train_std,X_test_std))
y_combined=np.hstack((y_train,y_test))

plot_decision_regions(X=X_combined_std,
                      y=y_combined,
                      classifier=lr,
                      test_idx=range(105,150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.savefig('./fig4.png')
plt.show()

在这里插入图片描述
小结
由上图可以得知,Sklearn中的Logistic Regression模型可以有效地对Iris花类别进行划分。

为便于观察基于Logistic Regression模型的predict,predict_proba等函数的用法,此处也给出其用法,具体如下:

#Section 4: Predict the probabilityn and class type
print("The probability belonging to each class: \n",lr.predict_proba(X_test_std[:3]))
print("Class type: \n",lr.predict_proba(X_test_std[:3,:]).argmax(axis=1))
print("Class type: \n",lr.predict(X_test_std[:3,:]))

运行结果如下:

The probability belonging to each class: 
 [[3.17983737e-08 1.44886616e-01 8.55113353e-01]
 [8.33962295e-01 1.66037705e-01 4.55557009e-12]
 [8.48762934e-01 1.51237066e-01 4.63166788e-13]]
Class type: 
 [2 0 0]
Class type: 
 [2 0 0]
Section III: Tackle Overfitting Via Regulation

过拟合是机器学习中常见问题,即模型在训练数据中经过学习,训练效果较佳,但是在未知的测试数据集中,泛化能力较差。具体地,“过拟合” 即为高方差,**“欠拟合”**即为高偏差,即当前模型参数不足以完全学习训练数据中隐含信息。
抑制拟合的方式有L1和L2两种正则化方式,L1为稀疏化,即为对大数不敏感,而L2则为稠密化,对大数较为敏感,惩罚力度较大。

import matplotlib.pyplot as plt
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'family': 'Times New Roman',
        'weight': 'light'}
plt.rc("font", **font)

##Section 1: Load data and split it into train/test dataset
iris=datasets.load_iris()
X=iris.data[:,[2,3]]
y=iris.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)

#Section 2: Preprocessing data in standardized
sc=StandardScaler()
sc.fit(X_train)
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)

#Section 3: The effect of regulation stength on weight parameters
weights,params=[],[]
for c in np.arange(-5,5):
    lr=LogisticRegression(C=10.**c,random_state=1)
    lr.fit(X_train_std,y_train)
    weights.append(lr.coef_[1])
    params.append(10.**c)

weights=np.array(weights)
plt.plot(params,weights[:,0],label='petal length')
plt.plot(params,weights[:,1],label='petal width')
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='upper left')
plt.xscale('log')
plt.savefig('./fig5.png')
plt.show()

这里需注意权重搜集的是Class 1类别的权重参数,纵向维度为特征空间,横向为Class类别总数。此处仅搜集第一个类别的权重参数。

lr.coef_
Out[3]: 
array([[-4.55059393e-04, -4.37654048e-04],
       [ 9.45879351e-05,  5.76462665e-05],
       [ 3.60471456e-04,  3.80007780e-04]])

在这里插入图片描述
由上图可以得知,C参数越小,权重参数趋近收缩,反之则发散而导致训练效果欠佳。此外,C参数为正则化参数lamda的倒数,C越小,则说明正则化力度越大。

参考文献
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值