機器學習-LogisticRegression邏輯迴歸訓練分類模型

最新推荐文章于 2023-12-04 14:00:39 发布

鱼鱼～

最新推荐文章于 2023-12-04 14:00:39 发布

阅读量367

点赞数

分类专栏：機器學習文章标签： python 机器学习人工智能算法

本文链接：https://blog.csdn.net/weixin_44730000/article/details/113346547

版权

機器學習专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1. LogisticRegression邏輯迴歸

2. 使用到的函數

2.1. 线性判别分析（Linear Discriminant Analysis）

2.2 數據預處理-StandardScaler

1. LogisticRegression邏輯迴歸

logistic回归是一种广义线性回归，通过logistic函数算概率，然后算出来一个样本属于一个类别的概率，概率越大越可能是这个类的样本。

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2',dual=False,C=1.0,n_jobs=1,random_state=20)

其主要使用的参数为：

penalty：惩罚项，str类型，可选参数为l1和l2，默认为l2。用于指定惩罚项中使用的规范。newton-cg、sag和lbfgs求解算法只支持L2规范。L1G规范假设的是模型的参数满足拉普拉斯分布，L2假设的模型参数满足高斯分布，所谓的范式就是加上对参数的约束，使得模型更不会过拟合(overfit)，但是如果要说是不是加了约束就会好，这个没有人能回答，只能说，加约束的情况下，理论上应该可以获得泛化能力更强的结果。
dual：对偶或原始方法，bool类型，默认为False。对偶方法只用在求解线性多核(liblinear)的L2惩罚项上。当样本数量>样本特征的时候，dual通常设置为False。
C:正则化系数λ的倒数，float类型，默认为1.0。必须是正浮点型数。像SVM一样，越小的数值表示越强的正则化。
random_state：随机数种子，int类型，可选参数，默认为无，仅在正则化优化算法为sag,liblinear时有用。
random_state：随机数种子，int类型，可选参数，默认为无，仅在正则化优化算法为sag,liblinear时有用。
warm_start：热启动参数，bool类型。默认为False。如果为True，则下一次训练是以追加树的形式进行（重新使用上一次的调用作为初始化）。
n_jobs：并行数。int类型，默认为1。1的时候，用CPU的一个内核运行程序，2的时候，用CPU的2个内核运行程序。为-1的时候，用所有CPU的内核运行程序。

2. 使用到的函數

2.1. 线性判别分析（Linear Discriminant Analysis）

LDA基于费舍尔准则，即同一类样本尽可能聚合在一起，不同类样本应该尽量扩散；或者说，同類樣本具有较好的聚合度，类别间具有较好的扩散度。

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
Lda = LinearDiscriminantAnalysis(n_components=200)

其主要使用的参数为：

n_components：即我们进行LDA降维时降到的维数。在降维时需要输入这个参数。注意只能为[1,类别数-1)范围之间的整数。如果我们不是用于降维，则这个值可以用默认的None。
solver : 即求LDA超平面特征矩阵使用的方法。可以选择的方法有奇异值分解"svd"，最小二乘"lsqr"和特征分解"eigen"。一般来说特征数非常多的时候推荐使用svd，而特征数不多的时候推荐使用eigen。主要注意的是，如果使用svd，则不能指定正则化参数shrinkage进行正则化。默认值是svd。可以選擇的有{‘auto’, ‘full’, ‘arpack’, ‘randomized’}
priors ：类别权重，可以在做分类模型时指定不同类别的权重，进而影响分类模型建立。降维时一般不需要关注这个参数。

作用：在进行机器学习的实验里，但并不是所有的维度都是有用的，如果能将对实验结果影响较大的有用维度提取出来，去除掉无用维度，那么既可以提高预测的精度、也可以加快模型的预测时间。

2.2 數據預處理-StandardScaler

from sklearn.preprocessing import StandardScaler

作用：去均值和方差归一化。且是针对每一个特征维度来做的，而不是针对样本。

标准差标准化（standardScale）使得经过处理的数据符合标准正态分布，即均值为0，标准差为1，其转化函数为：

归一化后加快了梯度下降求最优解的速度；
归一化有可能提高精度；

2.3 K折交叉驗證

K折交叉验证首先要将整个数据集分成K份。
1、取前面K-1份用于训练，最后一份用于测试，并取得测试结果。
2、取前面K-2份和最后一份用于训练，取第K-1份用于测试，并取得测试结果。
3、以此类推，将数据集的每一份均用于测试过一次。
4、取所有测试结果取平均。
其验证结果相比单次实验更加精准。
实现示意图如下：

導入模塊：

from sklearn.model_selection import cross_val_score 
cross_val_score(
	estimator, 
	X, y=None, 
	scoring=None, 
	cv=None, 
	n_jobs=1, 
	verbose=0, 
	fit_params=None, 
	pre_dispatch=‘2*n_jobs’
)

其常用参数如下：

estimator：用于预测的模型。
X：预测的特征数据
y：预测结果
cv ：子集个数就是k
scoring：打分参数默认‘accuracy’、可选‘f1’、‘precision’、‘recall’ 、‘roc_auc’、’neg_log_loss’

3. 模型訓練

3.1 數據集

一個文件夾存放一個類別數據：

3.2 訓練代碼

訓練結束后會保存模型。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib
from sklearn.model_selection import cross_val_score
import os, cv2
import numpy as np
from collections import Counter

# 加載數據
def load_date(img_path):
    x, y = [], []
    assert os.path.exists(img_path), 'date path not exists!'
    for cls_index, cls_name in enumerate(os.listdir(img_path)):
        cls_path = os.path.join(img_path, cls_name)
        for img in os.listdir(cls_path):
            image_path = os.path.join(cls_path, img)
            image = cv2.imread(image_path)
            input = image.reshape(1, -1)
            x.append(input[0])
            y.append(cls_index)
    return (np.array(x, np.float64), np.array(y, np.float64))


def eval(model, x_test, y_test):
    prediction = model.predict(x_test)
    print("-" * 100)
    print("starting eval!")
    classes = set(y_test)
    dict_ = {}
    for i in classes:
        dict_[int(i)] = 0
    for i in range(len(y_test)):
        if y_test[i] != prediction[i]:
            dict_[y_test[i]] += 1
    for k in dict_.keys():
        acc = 1 - dict_[k] / Counter(y_test)[k]
        print('classes', k, 'eval acc: ', acc)


def train(date, save_model_path):
    x, y = date

    #數據預處理： 劃分訓練集和測試集
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

    # 數據預處理：去均值和方差归一化。且是针对每一个特征维度来做的，而不是针对样本
    S = StandardScaler()
    S.fit(x_train)
    joblib.dump(S, os.path.join(save_model_path, 'S.model'))
    x_train = S.transform(x_train)
    x_test = S.transform(x_test)

    # # 提取特徵，數據降維
    # Lda = LinearDiscriminantAnalysis(n_components=2600)
    # Lda.fit(x_train, y_train)
    # print("lda_scores: ", Lda.score(x_test, y_test))
    # joblib.dump(Lda, os.path.join(save_model_path, 'Lda.model'))
    # x_train = Lda.transform(x_train)
    # x_test = Lda.transform(x_test)

    print('start to trian!')
    # 邏輯迴歸模型
    Log = LogisticRegression(penalty='l2', dual=False, warm_start=True)
    Log.fit(x_train, y_train)

    # K折交叉驗證
    scores = cross_val_score(Log, x_test, y_test, cv=5, scoring='accuracy')
    print('k_mean_scores: ', scores.mean())
    print('Compete the training! ')

    # 測試保存模型
    eval(Log, x_test, y_test)
    joblib.dump(Log, os.path.join(save_model_path, 'Log.model'))
    print("The model is saved!")


if __name__ == '__main__':
    img_path = r'D:\PY_scipty\sklearn\data\aw'
    save_model_path = r'./MODEL'
    date = load_date(img_path)
    train(date, save_model_path)

4. 模型預測

4.1 預測代碼

單張圖片預測

from sklearn.externals import joblib
import cv2
import os


def model_predict(Model_Log, Model_S, Img_path):
    img = cv2.imread(Img_path)
    input = img.reshape(1, -1)
    input = Model_S.transform(input)
    predict = Model_Log.predict(input)
    print(predict)

if __name__ == '__main__':
    Imgs_path = r'test1.jpg'
    model_path = './MODEL'
    Model_Log = joblib.load(os.path.join(model_path, 'Log.model'))
    Model_S = joblib.load(os.path.join(model_path, 'S.model'))
    model_predict(Model_Log, Model_S, Imgs_path)

參考博主：https://blog.csdn.net/weixin_44791964