机器学习day03

世事如棋901

已于 2022-10-06 00:08:02 修改

阅读量226

点赞数

文章标签：机器学习 python 人工智能

于 2022-10-05 23:53:35 首次发布

本文链接：https://blog.csdn.net/qq_56045855/article/details/127178615

版权

5、回归与聚类算法

5.1、线性回归

定义：

利用回归方程（函数）对一个或多个自变量和因变量（即特征值和目标值）进行建模的分析方式

我们熟知的线性模型：自变量为一次
$H(w)=w_1x_1+w_2x_2+...+w_nx_n+b=W^TX$

$\begin{pmatrix}b\\w_1\\\vdots\\w_n\end{pmatrix},X= \begin{pmatrix}1\\x_1\\\vdots\\x_n\end{pmatrix}$

另一种线性模型：参数一个
$H(w)=w_1x_1+w_2x_1^2+...+w_1x_1^n+b$

损失函数(cost)：

预测值与真实值的差距（常见为最小二乘法）

优化算法：

正规方程（计算复杂，数据量小可用）：

$W=(X^TX)^{-1}X^TY，(已知Y=W^TX)$

梯度下降（数据量大用）：

$KaTeX parse error: Undefined control sequence: \symbfit at position 34: …partial \space \̲s̲y̲m̲b̲f̲i̲t̲{Cost}(W_0+W_1x…$

$\alpha 称为学习率$

线性回归API：

# 正规方程优化
sklearn.linear_model.LinearRegression(fit_intercept=True)
# fit_interce:pt:是否计算偏置
# LinearRegression.coef_:回归系数
# LinearRegression.intercept_:偏置

# 梯度下降优化
sklearn.linear_model.SGDRegressor(loss="",fit_intercept=,learning_rate='',eta0=)
# loss:损失类型。"squared_loss"
# fit_interce:pt:是否计算偏置
# learning_rate:学习率。
                        "invscaling":
                        "constant":eta=1.0/(alpha * (t+t0))[default]
                        "optimal":eta=eta0/pow(t,power_t)
# SGDRegressor.coef_:回归系数
# SGDRegressor.intercept_:偏置

回归性能评估：

均方误差MSE
$MSE={1\over m}\sum_{i=1}^m (y_i -\bar y)^2$

sklearn.metrics.mean_squared_error(y_true,y_pred)

案例：波士顿房价预测

# -*- coding: GBK -*-
# -*- coding: UTF-8 -*-
# coding=gbk

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error


def boston_1():
    # 获取数据
    boston = load_boston()

    # 数据划分
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=100)

    # 特征工程
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 预估器
    estimator = LinearRegression()
    estimator.fit(x_train, y_train)

    # 模型评估
    print("回归系数为", estimator.coef_, '\n')
    print("偏置为：", estimator.intercept_, '\n')
    y_predict = estimator.predict(x_test)
    Mse = mean_squared_error(y_test, y_predict)
    print("均方误差为：", Mse)


def boston_2():
    # 获取数据
    boston = load_boston()

    # 数据划分
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=100)

    # 特征工程
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 预估器
    estimator = SGDRegressor(learning_rate="invscaling", eta0=0.01)
    estimator.fit(x_train, y_train)

    # 模型评估
    print("回归系数为", estimator.coef_, '\n')
    print("偏置为：", estimator.intercept_, '\n')
    y_predict = estimator.predict(x_test)
    Mse = mean_squared_error(y_test, y_predict)
    print("均方误差为：", Mse)

if __name__ == '__main__':
    boston_1()
    boston_2()

运行结果：

回归系数为 [-0.60292601  1.04914911 -0.13037299  0.63411901 -1.57254519  2.73708926
 -0.37092604 -2.99837179  2.54865538 -2.20887515 -1.94391032  0.95278425
 -3.28572799] 

偏置为： 22.69973614775727 

均方误差为： 27.173144173043656
回归系数为 [-0.50148422  0.91289931 -0.3896733   0.7036103  -1.39714531  2.82462609
 -0.40509405 -2.92375102  1.82430464 -1.42591033 -1.8905937   0.95238311
 -3.22595256] 

偏置为： [22.70096789] 

均方误差为： 27.611320173209425

进程已结束，退出代码为 0

拓展：梯度下降优化

GD：

SGD：

SAG：

5.2、欠拟合与过拟合

欠拟合：需要增加特征值

过拟合：需要正则化

L2正则化（Ridge回归）：

$J(W)={1\over 2m}\sum_{i=1}^m(h_W(x_i)-y_i)^2+\lambda \sum_{j=1}^mW_j^2\\ 损失函数=原损失函数+惩罚项$

L1正则化（LASSO回归）：直接使部分W的值为0，删除某些特征的影响

5.3、岭回归

即带有L2正则化的线性回归

API：

sklearn.linear_model.Ridge(alpha=1,fit_intercept=Ture,solver=，normalize=False)
# alpha:正则化系数，即λ
# solver:优化器
# normalize:是否进行标准化
# Ridge.coef_:回归系数
# Ridge.intercept_:偏置

案例：波士顿房价预测（岭回归）

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge

def boston_3():
    # 获取数据
    boston = load_boston()
    # 数据划分
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=100)

    # 特征工程
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 预估器
    estimator = Ridge(alpha = 0.5)
    estimator.fit(x_train, y_train)

    # 模型评估
    print("回归系数为", estimator.coef_, '\n')
    print("偏置为：", estimator.intercept_, '\n')
    y_predict = estimator.predict(x_test)
    Mse = mean_squared_error(y_test, y_predict)
    print("均方误差为：", Mse, '\n')


if __name__ == '__main__':
    boston_3()

运行结果：

回归系数为 [-0.59823625  1.04053133 -0.14493391  0.63710781 -1.55860538  2.74130502
 -0.37204575 -2.98251376  2.50332384 -2.16480946 -1.93972975  0.9522699
 -3.27862245] 

偏置为： 22.69973614775727 

均方误差为： 27.197433238049836 

进程已结束，退出代码为 0

5.4、分类算法：逻辑回归(二分类)

逻辑回归的输入就是一个线性回归的结果，即输入为：
$\theta ^Tx=H(w)=w_1x_1+w_2x_2+...+w_nx_n+b=W^TX$
激活函数：sigmoid 函数
$g(\theta ^Tx)={1\over 1+e^{-\theta ^Tx}}={1\over 1+e^{-H(w)}}=g(H(w))$
回归的结果输入到sigmoid函数中，输出结果即[0,1]区间的一个概率值，默认阈值0.5

对数似然损失：
${Cost}(h,y)=\begin{cases}-log(h)， when\space y=1\\ -log(1-h)，when\space y=0\end{cases}\\h为预测值，y为真实值$
对数似然损失函数：
${CostFunction}={Cost}(h,y)=\sum_{i=1}^m -y_ilog(h)-(1-y_i)log(1-h)$
优化方式：梯度下降

API：

sklearn.linear_model.LogisticRegression(solver='liblinear',penalty='l2',C=1.0)
# solve:优化求解方式
# penalty:正则化种类
# C:正则化力度

案例：是否得癌症分类

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
    # 获取数据
    path = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
    column_name = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape','Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']
    data = pd.read_csv(path, names=column_name)
    
	# 数据处理
    data = data.replace(to_replace="?",value=np.nan)
    data.dropna(inplace = True)
    
    # 选定特征值和目标值
    x=data.iloc[:,1:-1]
    y=data["Class"]
    
    # 数据集划分
    x_train,x_test,y_train,y_test = train_test_split(x,y)
    
    # 特征工程
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    
    # 逻辑回归
    estimator = LogisticRegression()
    estimator.fit(x_train,y_train)
    
    # 模型评估
    print("回归系数为", estimator.coef_, '\n')
    print("偏置为：", estimator.intercept_, '\n')
    
    # 预测
    y_predict = estimator.predict(x_test)
    print(y_predict, '\n')
    print("预测结果：\n", y_predict == y_test)

    # 模型准确率评分
    score = estimator.score(x_test, y_test)
    print("模型评分：", score, '\n')

运行结果：

回归系数为 [[1.34557955 0.12081454 0.3220905  1.65048536 0.31212786 1.65357286
  1.05789333 1.10955067 1.0743157 ]] 

偏置为： [-0.70803147] 

[2 2 2 2 4 2 2 2 4 2 2 4 4 2 4 4 2 4 2 2 4 2 2 4 4 2 4 4 2 2 2 2 2 2 4 2 2
 4 2 4 4 2 2 2 2 2 4 4 4 2 4 4 4 4 4 4 2 4 2 2 4 4 4 4 2 2 4 2 2 2 2 4 2 2
 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 2 2 2 2 2 4 2 2 4 2 4 2 4 4 2 2 4 4 2
 2 2 2 4 4 2 4 2 4 2 4 4 2 4 2 4 2 2 2 2 2 2 4 4 2 2 2 4 4 4 2 4 2 4 2 2 4
 4 4 4 4 4 2 2 2 4 2 2 2 2 2 4 2 4 2 2 2 2 4 2] 

预测结果：
 615    True
144    True
127    True
547    True
270    True
       ... 
524    True
150    True
444    True
440    True
83     True
Name: Class, Length: 171, dtype: bool
模型评分： 0.935672514619883

5.5、二分类的评估方法：

1、精准率、召回率与F1—score

P—精准率（查准率）：预测结果为正例中，真实为正例的比例。越高越好

R—召回率（查全率）：真实为正例中，预测结果为正例的比例。越高越好

F1—score：2×P×R/(P+R)

API：

sklearn.metrics.classfication_report(y_true,y_pred,labels=[],target_name=None)
# y_true:真实目标值
# y_pred:预测目标值
# labels:指定类别对应的数字
# target_name:目标类别名称
# 返回精准率和召回率

2、ROC曲线与AUC指标

TPR：即召回率。TP/(TP+FN)，所有真实为1中，预测也为1的比例
FPR：FP/(FP+TN)，所有真实为0中，预测为1的比例

ROC曲线：横轴FPR，纵轴TPR

AUC：ROC围的面积

API：

sklearn.metrics.roc_auc_score(y_true,y_score)
# 计算ROC面积，即AUC值
# y_true：样本的真实类别。必须为0——反例，1——正例标记
# y_score：预测得分。可以为分类器的预测返回值或正类的估计概率，置信值

模型保存与加载

API：

from sklearn.externals import joblib
保存：joblib.dump(rf,'test.pkl')
加载：estimator=joblib.load('test.pkl')

无监督学习

无监督学习：无目标值

k-means聚类算法

随机设置K个特征空间内的点作为初始的聚类中心
对于其他每个点计算到K个中心的距离，选最近的聚类中心点作为标记类别
对标记的聚类中心，重新计算出每个聚类的中心点
若中心点与原中心点一致，则结束，否则跳回第二步

API：

sklearn.cluster.KMeans(n_clusters=8,init='k-means++')
# n_clusters:开始的中心数量，即簇
# init:初始方法，默认为k-means++
# labels_:默认标记的类型，点方法调用

案例：k-means对Instacart Market用户聚类（使用数据降维案例的数据）

聚类的模型评估：

通俗来讲，好的模型要求即为，外部距离最大化，内部距离最小化。”高内聚低耦合“

轮廓系数：
$KaTeX parse error: Expected 'EOF', got '&' at position 38: …{max(b_i,a_i)}}&̲\mbox -1<sc_i<1…$
API：

sklearn.metrics.silhouette_score(X,labels)
# X:特征值
# labels:被聚类标记的目标值

世事如棋901

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫