笔记|统计学习方法:逻辑斯蒂回归与最大熵

基本介绍

逻辑斯蒂回归(Logistic Regression),虽然被称为回归,其实是一种解决分类问题的算法。
LR模型是在线性模型的基础上,使用sigmoid激励函数,将线性模型的结果压缩到 [ 0 , 1 ] [0,1] [0,1]之间,使其拥有概率意义。其本质依旧是线性模型,实现相对简单。LR模型也是深度学习的基本组成单元。

一般回归模型

回归模型: f ( x ) = 1 1 + e − w x f(x) = \frac{1}{1+e^{-wx}} f(x)=1+ewx1

其中wx线性函数: w x = w 0 ⋅ x 0 + w 1 ⋅ x 1 + w 2 ⋅ x 2 + . . . + w n ⋅ x n , ( x 0 = 1 ) wx =w_0\cdot x_0 + w_1\cdot x_1 + w_2\cdot x_2 +...+w_n\cdot x_n,(x_0=1) wx=w0x0+w1x1+w2x2+...+wnxn,(x0=1)

逻辑斯蒂分布

对逻辑斯蒂分布的说明如下:

分布函数

F ( X ) = P ( X ≤ x ) = 1 1 + e − ( x − μ ) / γ F(X)=P(X\leq x)=\frac{1}{1+e^{-(x-\mu)/ \gamma}} F(X)=P(Xx)=1+e(xμ)/γ1

密度函数

f ( x ) = F ′ ( x ) = e − ( x − μ ) / γ γ ( 1 + e − ( x − μ ) / γ ) 2 f(x)=F'(x)=\frac{e^{-(x-\mu)/ \gamma}}{\gamma(1+e^{-(x-\mu)/\gamma})^2} f(x)=F(x)=γ(1+e(xμ)/γ)2e(xμ)/γ

其中, μ \mu μ是位置参数, γ \gamma γ是形状参数

图像:

{% asset_image 1.jpg %}

且分布函数以点 ( μ , 1 2 ) (\mu , \frac{1}{2}) (μ,21)为中心对称,满足:
F ( − x + μ ) − 1 2 = − F ( x + μ ) + 1 2 F(-x+\mu)-\frac{1}{2}=-F(x+\mu)+\frac{1}{2} F(x+μ)21=F(x+μ)+21

逻辑斯蒂模型

逻辑斯谛回归模型是由以下条件概率分布表示的分类模型。逻辑斯谛回归模型可以用于二类或多类分类。

P ( Y = k ∣ x ) = exp ⁡ ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) , k = 1 , 2 , ⋯   , K − 1 P(Y=k | x)=\frac{\exp \left(w_{k} \cdot x\right)}{1+\sum_{k=1}^{K-1} \exp \left(w_{k} \cdot x\right)}, \quad k=1,2, \cdots, K-1 P(Y=kx)=1+k=1K1exp(wkx)exp(wkx),k=1,2,,K1

P ( Y = K ∣ x ) = 1 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) P(Y=K | x)=\frac{1}{1+\sum_{k=1}^{K-1} \exp \left(w_{k} \cdot x\right)} P(Y=Kx)=1+k=1K1exp(wkx)1
这里, x x x为输入特征, w w w为特征的权值。

逻辑斯谛回归模型源自逻辑斯谛分布,其分布函数 F ( x ) F(x) F(x) S S S形函数。逻辑斯谛回归模型是由输入的线性函数表示的输出的对数几率模型。

最大熵模型

最大熵模型是由以下条件概率分布表示的分类模型。最大熵模型也可以用于二类或多类分类。

P w ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) P_{w}(y | x)=\frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) Pw(yx)=Zw(x)1exp(i=1nwifi(x,y))
Z w ( x ) = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) Z_{w}(x)=\sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) Zw(x)=yexp(i=1nwifi(x,y))

其中, Z w ( x ) Z_w(x) Zw(x)是规范化因子, f i f_i fi为特征函数, w i w_i wi为特征的权值。

最大熵模型可以由最大熵原理推导得出。最大熵原理是概率模型学习或估计的一个准则。最大熵原理认为在所有可能的概率模型(分布)的集合中,熵最大的模型是最好的模型。

最大熵原理应用到分类模型的学习中,有以下约束最优化问题:

min ⁡ − H ( P ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) \min -H(P)=\sum_{x, y} \tilde{P}(x) P(y | x) \log P(y | x) minH(P)=x,yP~(x)P(yx)logP(yx)

s . t . P ( f i ) − P ~ ( f i ) = 0 , i = 1 , 2 , ⋯   , n s.t. \quad P\left(f_{i}\right)-\tilde{P}\left(f_{i}\right)=0, \quad i=1,2, \cdots, n s.t.P(fi)P~(fi)=0,i=1,2,,n

∑ y P ( y ∣ x ) = 1 \sum_{y} P(y | x)=1 yP(yx)=1

求解此最优化问题的对偶问题得到最大熵模型。

总结

逻辑斯谛回归模型与最大熵模型都属于对数线性模型。

逻辑斯谛回归模型及最大熵模型学习一般采用极大似然估计,或正则化的极大似然估计。逻辑斯谛回归模型及最大熵模型学习可以形式化为无约束最优化问题。求解该最优化问题的算法有改进的迭代尺度法、梯度下降法、拟牛顿法。

例子1-逻辑斯蒂回归

使用鸢尾花数据集,二分类

from math import exp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


# data
def create_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
    data = np.array(df.iloc[:100, [0,1,-1]])
    # print(data)
    return data[:,:2], data[:,-1]

X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

class LogisticReressionClassifier:
    def __init__(self, max_iter=200, learning_rate=0.01):
        self.max_iter = max_iter
        self.learning_rate = learning_rate

    def sigmoid(self, x):
        return 1 / (1 + exp(-x))

    def data_matrix(self, X):
        data_mat = []
        for d in X:
            data_mat.append([1.0, *d])
        return data_mat

    def fit(self, X, y):
        # label = np.mat(y)
        data_mat = self.data_matrix(X)  # m*n
        self.weights = np.zeros((len(data_mat[0]), 1), dtype=np.float32)

        for iter_ in range(self.max_iter):
            for i in range(len(X)):
                result = self.sigmoid(np.dot(data_mat[i], self.weights))
                error = y[i] - result
                self.weights += self.learning_rate * error * np.transpose(
                    [data_mat[i]])
        print('LogisticRegression Model(learning_rate={},max_iter={})'.format(
            self.learning_rate, self.max_iter))

    # def f(self, x):
    #     return -(self.weights[0] + self.weights[1] * x) / self.weights[2]

    def score(self, X_test, y_test):
        right = 0
        X_test = self.data_matrix(X_test)
        for x, y in zip(X_test, y_test):
            result = np.dot(x, self.weights)
            if (result > 0 and y == 1) or (result < 0 and y == 0):
                right += 1
        return right / len(X_test)


lr_clf = LogisticReressionClassifier()
lr_clf.fit(X_train, y_train)

lr_clf.score(X_test, y_test)

x_ponits = np.arange(4, 8)
y_ = -(lr_clf.weights[1]*x_ponits + lr_clf.weights[0])/lr_clf.weights[2]
plt.plot(x_ponits, y_)

#lr_clf.show_graph()
plt.scatter(X[:50,0],X[:50,1], label='0')
plt.scatter(X[50:,0],X[50:,1], label='1')
plt.legend()

图像结果

{%asset_img 2.jpg%}

例子2-最大熵模型

class MaxEntropy:
    def __init__(self, EPS=0.005):
        self._samples = []
        self._Y = set()  # 标签集合,相当去去重后的y
        self._numXY = {}  # key为(x,y),value为出现次数
        self._N = 0  # 样本数
        self._Ep_ = []  # 样本分布的特征期望值
        self._xyID = {}  # key记录(x,y),value记录id号
        self._n = 0  # 特征键值(x,y)的个数
        self._C = 0  # 最大特征数
        self._IDxy = {}  # key为(x,y),value为对应的id号
        self._w = []
        self._EPS = EPS  # 收敛条件
        self._lastw = []  # 上一次w参数值

    def loadData(self, dataset):
        self._samples = deepcopy(dataset)
        for items in self._samples:
            y = items[0]
            X = items[1:]
            self._Y.add(y)  # 集合中y若已存在则会自动忽略
            for x in X:
                if (x, y) in self._numXY:
                    self._numXY[(x, y)] += 1
                else:
                    self._numXY[(x, y)] = 1

        self._N = len(self._samples)
        self._n = len(self._numXY)
        self._C = max([len(sample) - 1 for sample in self._samples])
        self._w = [0] * self._n
        self._lastw = self._w[:]

        self._Ep_ = [0] * self._n
        for i, xy in enumerate(self._numXY):  # 计算特征函数fi关于经验分布的期望
            self._Ep_[i] = self._numXY[xy] / self._N
            self._xyID[xy] = i
            self._IDxy[i] = xy

    def _Zx(self, X):  # 计算每个Z(x)值
        zx = 0
        for y in self._Y:
            ss = 0
            for x in X:
                if (x, y) in self._numXY:
                    ss += self._w[self._xyID[(x, y)]]
            zx += math.exp(ss)
        return zx

    def _model_pyx(self, y, X):  # 计算每个P(y|x)
        zx = self._Zx(X)
        ss = 0
        for x in X:
            if (x, y) in self._numXY:
                ss += self._w[self._xyID[(x, y)]]
        pyx = math.exp(ss) / zx
        return pyx

    def _model_ep(self, index):  # 计算特征函数fi关于模型的期望
        x, y = self._IDxy[index]
        ep = 0
        for sample in self._samples:
            if x not in sample:
                continue
            pyx = self._model_pyx(y, sample)
            ep += pyx / self._N
        return ep

    def _convergence(self):  # 判断是否全部收敛
        for last, now in zip(self._lastw, self._w):
            if abs(last - now) >= self._EPS:
                return False
        return True

    def predict(self, X):  # 计算预测概率
        Z = self._Zx(X)
        result = {}
        for y in self._Y:
            ss = 0
            for x in X:
                if (x, y) in self._numXY:
                    ss += self._w[self._xyID[(x, y)]]
            pyx = math.exp(ss) / Z
            result[y] = pyx
        return result

    def train(self, maxiter=1000):  # 训练数据
        for loop in range(maxiter):  # 最大训练次数
            print("iter:%d" % loop)
            self._lastw = self._w[:]
            for i in range(self._n):
                ep = self._model_ep(i)  # 计算第i个特征的模型期望
                self._w[i] += math.log(self._Ep_[i] / ep) / self._C  # 更新参数
            print("w:", self._w)
            if self._convergence():  # 判断是否收敛
                break

dataset = [['no', 'sunny', 'hot', 'high', 'FALSE'],
           ['no', 'sunny', 'hot', 'high', 'TRUE'],
           ['yes', 'overcast', 'hot', 'high', 'FALSE'],
           ['yes', 'rainy', 'mild', 'high', 'FALSE'],
           ['yes', 'rainy', 'cool', 'normal', 'FALSE'],
           ['no', 'rainy', 'cool', 'normal', 'TRUE'],
           ['yes', 'overcast', 'cool', 'normal', 'TRUE'],
           ['no', 'sunny', 'mild', 'high', 'FALSE'],
           ['yes', 'sunny', 'cool', 'normal', 'FALSE'],
           ['yes', 'rainy', 'mild', 'normal', 'FALSE'],
           ['yes', 'sunny', 'mild', 'normal', 'TRUE'],
           ['yes', 'overcast', 'mild', 'high', 'TRUE'],
           ['yes', 'overcast', 'hot', 'normal', 'FALSE'],
           ['no', 'rainy', 'mild', 'high', 'TRUE']]
maxent = MaxEntropy()
x = ['overcast', 'mild', 'high', 'FALSE']

maxent.loadData(dataset)
maxent.train()

print('predict:', maxent.predict(x))

部分输出:
iter:0
w: [0.0455803891984887, -0.002832177999673058, 0.031103560672370825, -0.1772024616282862, -0.0037548445453157455, 0.16394435955437575, -0.02051493923938058, -0.049675901430111545, 0.08288783767234777, 0.030474400362443962, 0.05913652210443954, 0.08028783103573349, 0.1047516055195683, -0.017733409097415182, -0.12279936099838235, -0.2525211841208849, -0.033080678592754015, -0.06511302013721994, -0.08720030253991244]
iter:1
w: [0.11525071899801315, 0.019484939219927316, 0.07502777039579785, -0.29094979172869884, 0.023544184009850026, 0.2833018051925922, -0.04928887087664562, -0.101950931659509, 0.12655289130431963, 0.016078718904129236, 0.09710585487843026, 0.10327329399123442, 0.16183727320804359, 0.013224083490515591, -0.17018583153306513, -0.44038644519804815, -0.07026660158873668, -0.11606564516054546, -0.1711390483931799]
………………………………………………
最后输出:
predict: {‘no’: 2.819781341881656e-06, ‘yes’: 0.9999971802186581}

最大熵的DFP算法

第1步:
最大熵模型为:
max ⁡ H ( p ) = − ∑ x , y P ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) st. E p ( f i ) − E p ^ ( f i ) = 0 , i = 1 , 2 , ⋯   , n ∑ y P ( y ∣ x ) = 1 \begin{array}{cl} {\max } & {H(p)=-\sum_{x, y} P(x) P(y | x) \log P(y | x)} \\ {\text {st.}} & {E_p(f_i)-E_{\hat{p}}(f_i)=0, \quad i=1,2, \cdots,n} \\ & {\sum_y P(y | x)=1} \end{array} maxst.H(p)=x,yP(x)P(yx)logP(yx)Ep(fi)Ep^(fi)=0,i=1,2,,nyP(yx)=1
引入拉格朗日乘子,定义拉格朗日函数:
L ( P , w ) = ∑ x y P ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + ∑ i = 1 w i ( ∑ x y P ( x , y ) f i ( x , y ) − ∑ x y P ( x , y ) P ( y ∣ x ) f i ( x , y ) ) L(P, w)=\sum_{xy} P(x) P(y | x) \log P(y | x)+w_0 \left(1-\sum_y P(y | x)\right) \\ +\sum_{i=1} w_i\left(\sum_{xy} P(x, y) f_i(x, y)-\sum_{xy} P(x, y) P(y | x) f_i(x, y)\right) L(P,w)=xyP(x)P(yx)logP(yx)+w0(1yP(yx))+i=1wi(xyP(x,y)fi(x,y)xyP(x,y)P(yx)fi(x,y))

最优化原始问题为:
min ⁡ P ∈ C max ⁡ w L ( P , w ) \min_{P \in C} \max_{w} L(P,w) PCminwmaxL(P,w)
对偶问题为:
max ⁡ w min ⁡ P ∈ C L ( P , w ) \max_{w} \min_{P \in C} L(P,w) wmaxPCminL(P,w)

Ψ ( w ) = min ⁡ P ∈ C L ( P , w ) = L ( P w , w ) \Psi(w) = \min_{P \in C} L(P,w) = L(P_w, w) Ψ(w)=PCminL(P,w)=L(Pw,w)

Ψ ( w ) \Psi(w) Ψ(w)称为对偶函数,同时,其解记作
P w = arg ⁡ min ⁡ P ∈ C L ( P , w ) = P w ( y ∣ x ) P_w = \mathop{\arg \min}_{P \in C} L(P,w) = P_w(y|x) Pw=argminPCL(P,w)=Pw(yx)
L ( P , w ) L(P,w) L(P,w) P ( y ∣ x ) P(y|x) P(yx)的偏导数,并令偏导数等于0,解得:
P w ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) P_w(y | x)=\frac{1}{Z_w(x)} \exp \left(\sum_{i=1}^n w_i f_i (x, y)\right) Pw(yx)=Zw(x)1exp(i=1nwifi(x,y))
其中:
Z w ( x ) = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) Z_w(x)=\sum_y \exp \left(\sum_{i=1}^n w_i f_i(x, y)\right) Zw(x)=yexp(i=1nwifi(x,y))
则最大熵模型目标函数表示为
φ ( w ) = min ⁡ w ∈ R n Ψ ( w ) = ∑ x P ( x ) log ⁡ ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) − ∑ x , y P ( x , y ) ∑ i = 1 n w i f i ( x , y ) \varphi(w)=\min_{w \in R_n} \Psi(w) = \sum_{x} P(x) \log \sum_{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right)-\sum_{x, y} P(x, y) \sum_{i=1}^{n} w_{i} f_{i}(x, y) φ(w)=wRnminΨ(w)=xP(x)logyexp(i=1nwifi(x,y))x,yP(x,y)i=1nwifi(x,y)

第2步:
DFP的 G k + 1 G_{k+1} Gk+1的迭代公式为:
G k + 1 = G k + δ k δ k T δ k T y k − G k y k y k T G k y k T G k y k G_{k+1}=G_k+\frac{\delta_k \delta_k^T}{\delta_k^T y_k}-\frac{G_k y_k y_k^T G_k}{y_k^T G_k y_k} Gk+1=Gk+δkTykδkδkTykTGkykGkykykTGk
最大熵模型的DFP算法:
输入:目标函数 φ ( w ) \varphi(w) φ(w),梯度 g ( w ) = ∇ g ( w ) g(w) = \nabla g(w) g(w)=g(w),精度要求 ε \varepsilon ε
输出: φ ( w ) \varphi(w) φ(w)的极小值点 w ∗ w^* w
(1)选定初始点 w ( 0 ) w^{(0)} w(0),取 G 0 G_0 G0为正定对称矩阵,置 k = 0 k=0 k=0
(2)计算 g k = g ( w ( k ) ) g_k=g(w^{(k)}) gk=g(w(k)),若 ∥ g k ∥ < ε \|g_k\| < \varepsilon gk<ε,则停止计算,得近似解 w ∗ = w ( k ) w^*=w^{(k)} w=w(k),否则转(3)
(3)置 p k = − G k g k p_k=-G_kg_k pk=Gkgk
(4)一维搜索:求 λ k \lambda_k λk使得 φ ( w ( k ) + λ k P k ) = min ⁡ λ ⩾ 0 φ ( w ( k ) + λ P k ) \varphi\left(w^{(k)}+\lambda_k P_k\right)=\min _{\lambda \geqslant 0} \varphi\left(w^{(k)}+\lambda P_{k}\right) φ(w(k)+λkPk)=λ0minφ(w(k)+λPk)(5)置 w ( k + 1 ) = w ( k ) + λ k p k w^{(k+1)}=w^{(k)}+\lambda_k p_k w(k+1)=w(k)+λkpk
(6)计算 g k + 1 = g ( w ( k + 1 ) ) g_{k+1}=g(w^{(k+1)}) gk+1=g(w(k+1)),若 ∥ g k + 1 ∥ < ε \|g_{k+1}\| < \varepsilon gk+1<ε,则停止计算,得近似解 w ∗ = w ( k + 1 ) w^*=w^{(k+1)} w=w(k+1);否则,按照迭代式算出 G k + 1 G_{k+1} Gk+1
(7)置 k = k + 1 k=k+1 k=k+1,转(3)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值