复现经典：《统计学习方法》第 6 章逻辑斯谛回归

最新推荐文章于 2023-04-13 15:48:04 发布

湾区人工智能

最新推荐文章于 2023-04-13 15:48:04 发布

阅读量300

点赞数

本文是李航老师的《统计学习方法》^[1]一书的代码复现。

作者：黄海广^[2]

备注：代码都可以在github^[3]中下载。

我将陆续将代码发布在公众号“机器学习初学者”，敬请关注。

代码目录

第 1 章统计学习方法概论
第 2 章感知机
第 3 章 k 近邻法
第 4 章朴素贝叶斯
第 5 章决策树
第 6 章逻辑斯谛回归
第 7 章支持向量机
第 8 章提升方法
第 9 章 EM 算法及其推广
第 10 章隐马尔可夫模型
第 11 章条件随机场
第 12 章监督学习方法总结

代码参考：wzyonggege^[4],WenDesi^[5],火烫火烫的^[6]

第 6 章逻辑斯谛回归

逻辑斯谛回归(LR)是经典的分类方法

1．逻辑斯谛回归模型是由以下条件概率分布表示的分类模型。逻辑斯谛回归模型可以用于二类或多类分类。

这里，为输入特征，为特征的权值。

逻辑斯谛回归模型源自逻辑斯谛分布，其分布函数是形函数。逻辑斯谛回归模型是由输入的线性函数表示的输出的对数几率模型。

2．最大熵模型是由以下条件概率分布表示的分类模型。最大熵模型也可以用于二类或多类分类。

其中，是规范化因子，为特征函数，为特征的权值。

3．最大熵模型可以由最大熵原理推导得出。最大熵原理是概率模型学习或估计的一个准则。最大熵原理认为在所有可能的概率模型（分布）的集合中，熵最大的模型是最好的模型。

最大熵原理应用到分类模型的学习中，有以下约束最优化问题：

求解此最优化问题的对偶问题得到最大熵模型。

4．逻辑斯谛回归模型与最大熵模型都属于对数线性模型。

5．逻辑斯谛回归模型及最大熵模型学习一般采用极大似然估计，或正则化的极大似然估计。逻辑斯谛回归模型及最大熵模型学习可以形式化为无约束最优化问题。求解该最优化问题的算法有改进的迭代尺度法、梯度下降法、拟牛顿法。

回归模型：

其中 wx 线性函数：

from math import exp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# data
def create_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
    data = np.array(df.iloc[:100, [0,1,-1]])
    # print(data)
    return data[:,:2], data[:,-1]

X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

class LogisticReressionClassifier:
    def __init__(self, max_iter=200, learning_rate=0.01):
        self.max_iter = max_iter
        self.learning_rate = learning_rate


    def sigmoid(self, x):
        return 1 / (1 + exp(-x))


    def data_matrix(self, X):
        data_mat = []
        for d in X:
            data_mat.append([1.0, *d])
        return data_mat


    def fit(self, X, y):
        # label = np.mat(y)
        data_mat = self.data_matrix(X)  # m*n
        self.weights = np.zeros((len(data_mat[0]), 1), dtype=np.float32)


        for iter_ in range(self.max_iter):
            for i in range(len(X)):
                result = self.sigmoid(np.dot(data_mat[i], self.weights))
                error = y[i] - result
                self.weights += self.learning_rate * error * np.transpose(
                    [data_mat[i]])
        print('LogisticRegression Model(learning_rate={},max_iter={})'.format(
            self.learning_rate, self.max_iter))


    # def f(self, x):
    #     return -(self.weights[0] + self.weights[1] * x) / self.weights[2]


    def score(self, X_test, y_test):
        right = 0
        X_test = self.data_matrix(X_test)
        for x, y in zip(X_test, y_test):
            result = np.dot(x, self.weights)
            if (result > 0 and y == 1) or (result < 0 and y == 0):
                right += 1
        return right / len(X_test)

lr_clf = LogisticReressionClassifier()
lr_clf.fit(X_train, y_train)

LogisticRegression Model(learning_rate=0.01,max_iter=200)

lr_clf.score(X_test, y_test)

1.0

x_ponits = np.arange(4, 8)
y_ = -(lr_clf.weights[1]*x_ponits + lr_clf.weights[0])/lr_clf.weights[2]
plt.plot(x_ponits, y_)


#lr_clf.show_graph()
plt.scatter(X[:50,0],X[:50,1], label='0')
plt.scatter(X[50:,0],X[50:,1], label='1')
plt.legend()

scikit-learn 实例

sklearn.linear_model.LogisticRegression

solver 参数决定了我们对逻辑回归损失函数的优化方法，有四种算法可以选择，分别是：

a) liblinear：使用了开源的 liblinear 库实现，内部使用了坐标轴下降法来迭代优化损失函数。
b) lbfgs：拟牛顿法的一种，利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。
c) newton-cg：也是牛顿法家族的一种，利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。
d) sag：即随机平均梯度下降，是梯度下降法的变种，和普通梯度下降法的区别是每次迭代仅仅用一部分的样本来计算梯度，适合于样本数据多的时候。

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=200)

clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

clf.score(X_test, y_test)

1.0

print(clf.coef_, clf.intercept_)

[[ 1.94562393 -3.20898537]] [-0.49595725]

x_ponits = np.arange(4, 8)
y_ = -(clf.coef_[0][0]*x_ponits + clf.intercept_)/clf.coef_[0][1]
plt.plot(x_ponits, y_)


plt.plot(X[:50, 0], X[:50, 1], 'bo', color='blue', label='0')
plt.plot(X[50:, 0], X[50:, 1], 'bo', color='orange', label='1')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend()

最大熵模型

import math
from copy import deepcopy

class MaxEntropy:
    def __init__(self, EPS=0.005):
        self._samples = []
        self._Y = set()  # 标签集合，相当去去重后的y
        self._numXY = {}  # key为(x,y)，value为出现次数
        self._N = 0  # 样本数
        self._Ep_ = []  # 样本分布的特征期望值
        self._xyID = {}  # key记录(x,y),value记录id号
        self._n = 0  # 特征键值(x,y)的个数
        self._C = 0  # 最大特征数
        self._IDxy = {}  # key为(x,y)，value为对应的id号
        self._w = []
        self._EPS = EPS  # 收敛条件
        self._lastw = []  # 上一次w参数值


    def loadData(self, dataset):
        self._samples = deepcopy(dataset)
        for items in self._samples:
            y = items[0]
            X = items[1:]
            self._Y.add(y)  # 集合中y若已存在则会自动忽略
            for x in X:
                if (x, y) in self._numXY:
                    self._numXY[(x, y)] += 1
                else:
                    self._numXY[(x, y)] = 1


        self._N = len(self._samples)
        self._n = len(self._numXY)
        self._C = max([len(sample) - 1 for sample in self._samples])
        self._w = [0] * self._n
        self._lastw = self._w[:]


        self._Ep_ = [0] * self._n
        for i, xy in enumerate(self._numXY):  # 计算特征函数fi关于经验分布的期望
            self._Ep_[i] = self._numXY[xy] / self._N
            self._xyID[xy] = i
            self._IDxy[i] = xy


    def _Zx(self, X):  # 计算每个Z(x)值
        zx = 0
        for y in self._Y:
            ss = 0
            for x in X:
                if (x, y) in self._numXY:
                    ss += self._w[self._xyID[(x, y)]]
            zx += math.exp(ss)
        return zx


    def _model_pyx(self, y, X):  # 计算每个P(y|x)
        zx = self._Zx(X)
        ss = 0
        for x in X:
            if (x, y) in self._numXY:
                ss += self._w[self._xyID[(x, y)]]
        pyx = math.exp(ss) / zx
        return pyx


    def _model_ep(self, index):  # 计算特征函数fi关于模型的期望
        x, y = self._IDxy[index]
        ep = 0
        for sample in self._samples:
            if x not in sample:
                continue
            pyx = self._model_pyx(y, sample)
            ep += pyx / self._N
        return ep


    def _convergence(self):  # 判断是否全部收敛
        for last, now in zip(self._lastw, self._w):
            if abs(last - now) >= self._EPS:
                return False
        return True


    def predict(self, X):  # 计算预测概率
        Z = self._Zx(X)
        result = {}
        for y in self._Y:
            ss = 0
            for x in X:
                if (x, y) in self._numXY:
                    ss += self._w[self._xyID[(x, y)]]
            pyx = math.exp(ss) / Z
            result[y] = pyx
        return result


    def train(self, maxiter=1000):  # 训练数据
        for loop in range(maxiter):  # 最大训练次数
            print("iter:%d" % loop)
            self._lastw = self._w[:]
            for i in range(self._n):
                ep = self._model_ep(i)  # 计算第i个特征的模型期望
                self._w[i] += math.log(self._Ep_[i] / ep) / self._C  # 更新参数
            print("w:", self._w)
            if self._convergence():  # 判断是否收敛
                break

dataset = [['no', 'sunny', 'hot', 'high', 'FALSE'],
           ['no', 'sunny', 'hot', 'high', 'TRUE'],
           ['yes', 'overcast', 'hot', 'high', 'FALSE'],
           ['yes', 'rainy', 'mild', 'high', 'FALSE'],
           ['yes', 'rainy', 'cool', 'normal', 'FALSE'],
           ['no', 'rainy', 'cool', 'normal', 'TRUE'],
           ['yes', 'overcast', 'cool', 'normal', 'TRUE'],
           ['no', 'sunny', 'mild', 'high', 'FALSE'],
           ['yes', 'sunny', 'cool', 'normal', 'FALSE'],
           ['yes', 'rainy', 'mild', 'normal', 'FALSE'],
           ['yes', 'sunny', 'mild', 'normal', 'TRUE'],
           ['yes', 'overcast', 'mild', 'high', 'TRUE'],
           ['yes', 'overcast', 'hot', 'normal', 'FALSE'],
           ['no', 'rainy', 'mild', 'high', 'TRUE']]

maxent = MaxEntropy()
x = ['overcast', 'mild', 'high', 'FALSE']

maxent.loadData(dataset)
maxent.train()

iter:0
w: [0.0455803891984887, -0.002832177999673058, 0.031103560672370825, -0.1772024616282862, -0.0037548445453157455, 0.16394435955437575, -0.02051493923938058, -0.049675901430111545, 0.08288783767234777, 0.030474400362443962, 0.05913652210443954, 0.08028783103573349, 0.1047516055195683, -0.017733409097415182, -0.12279936099838235, -0.2525211841208849, -0.033080678592754015, -0.06511302013721994, -0.08720030253991244]
iter:1
w: [0.11525071899801315, 0.019484939219927316, 0.07502777039579785, -0.29094979172869884, 0.023544184009850026, 0.2833018051925922, -0.04928887087664562, -0.101950931659509, 0.12655289130431963, 0.016078718904129236, 0.09710585487843026, 0.10327329399123442, 0.16183727320804359, 0.013224083490515591, -0.17018583153306513, -0.44038644519804815, -0.07026660158873668, -0.11606564516054546, -0.1711390483931799]
iter:2
w: [0.18178907332733973, 0.04233703122822168, 0.11301330241050131, -0.37456674484068975, 0.05599764270990431, 0.38356978711239126, -0.07488546168160945, -0.14671211613144097, 0.15633348706002106, -0.011836411721359321, 0.12895826039781944, 0.10572969681821211, 0.19953102749655352, 0.06399991656546679, -0.17475388854415905, -0.5893308194447993, -0.10405912653008922, -0.16350962040062977, -0.24701967386590512]
iter:3
w: [0.2402117261976856, 0.06087651054892573, 0.14300856884173724, -0.44265412294427664, 0.08623192206158618, 0.47264512563925376, -0.09600090083002198, -0.18353847640798293, 0.17967535014110475, -0.04398112111909075, 0.15854994616895085, 0.09937760679990165, 0.22754399461146121, 0.12138068016302067, -0.15616500410638443, -0.7136213594089919, -0.13342640817803014, -0.2097936229338585, -0.3153356710047331]
iter:4
w: [0.2914313208012359, 0.07547654306538813, 0.16668283431764536, -0.5013655616789854, 0.1129176109082406, 0.553725824276617, -0.11340104779016447, -0.214026170852028, 0.19932565541497924, -0.07698174342694904, 0.18676347888513212, 0.08897527479225055, 0.250034281885875, 0.17966909953761648, -0.12561912266368833, -0.8214131440732644, -0.15887039192864807, -0.255021849396353, -0.3775163032854775]
iter:5
w: [0.3371038340609469, 0.08719816942080917, 0.1858885244336221, -0.5536101929687616, 0.13629778855333752, 0.6284587190599515, -0.12800357294309486, -0.23983404211792342, 0.21652676634149073, -0.10944257416223822, 0.2137132093479417, 0.07676820672685672, 0.2690824414813502, 0.2363909590693551, -0.0894456215757756, -0.9176374337279947, -0.18113135827470755, -0.298867529863144, -0.43486330681003177]
iter:6
w: [0.3785824456804424, 0.09688384478129128, 0.2020182323803342, -0.6009874705178111, 0.15692184161636785, 0.6978719259357552, -0.14051015547758733, -0.26225964542470354, 0.231946295562788, -0.14075188805495795, 0.23936253047337575, 0.06390380813674021, 0.2858409112894462, 0.290497131241793, -0.05118245076440544, -1.0054122371529666, -0.20087035680546067, -0.34104258966535955, -0.4883751534969831]
。。。。。。中间过程略。
iter:663
w: [3.806361507565719, 0.0348973837073587, 1.6391762776402004, -4.46082036700038, 1.7872898160522181, 5.305910631880809, -0.13401635325297073, -2.2528324581617647, 1.4833115301839292, -1.8899383652170454, 1.9323695880561387, -1.2622764904730739, 1.7249196963071136, 2.966398532640618, 3.904166955381073, -9.515244625579237, -1.8726512915652174, -3.4821197858946427, -5.634828605832783]
iter:664
w: [3.8083642640626554, 0.03486819339595951, 1.6400224976589866, -4.463151671894514, 1.7883062251202617, 5.308526768308639, -0.13398764643967714, -2.2539799445450406, 1.4840784189709668, -1.890906591367886, 1.933249316738729, -1.2629454476069037, 1.7257519419059324, 2.967849703391228, 3.9061632698216244, -9.520241584621713, -1.8736788731126397, -3.483844660866203, -5.637874599559359]

print('predict:', maxent.predict(x))

predict: {'no': 2.819781341881656e-06, 'yes': 0.9999971802186581}
参考资料[1] 《统计学习方法》: https://baike.baidu.com/item/统计学习方法/10430179
[2] 黄海广: https://github.com/fengdu78
[3] github: https://github.com/fengdu78/lihang-code
[4] wzyonggege: https://github.com/wzyonggege/statistical-learning-method
[5] WenDesi: https://github.com/WenDesi/lihang_book_algorithm
[6] 火烫火烫的: https://blog.csdn.net/tudaodiaozhale



往期精彩回顾





那些年做的学术公益-你不是一个人在战斗
适合初学者入门人工智能的路线及资料下载
吴恩达机器学习课程笔记及资源（github标星12000+，提供百度云镜像）

吴恩达深度学习笔记及视频等资源（github标星8500+，提供百度云镜像）
《统计学习方法》的python代码实现（github标星7200+）
机器学习的数学精华（在线阅读版）

人工智能常识和干货，适合收藏

《统计学习方法》(李航)读书笔记(完结)

良心推荐：机器学习入门资料汇总及学习建议

网红少年编程书，AI自学不再难

【B站免费教程】2W 收藏！火爆 B 站的计算机科学速成教程发布，全中文版

良心推荐：机器学习入门资料汇总及学习建议（2018版）

机器学习必备宝典-《统计学习方法》的python代码实现、电子书及课件

软件下载和Python，AI，资料
【送书PDF】Python编程从入门到实践

Python从入门到精通，深度学习与机器学习资料大礼包！

【免费】某机构最新3980元机器学习/大数据课程高速下载，限量200份

长按扫码撩海归


   觉得不错， 请随意转发，麻烦点个在看！