5、回归与聚类算法
5.1、线性回归
定义:
利用回归方程(函数)对一个或多个自变量和因变量(即特征值和目标值)进行建模的分析方式
我们熟知的线性模型:自变量为一次
H
(
w
)
=
w
1
x
1
+
w
2
x
2
+
.
.
.
+
w
n
x
n
+
b
=
W
T
X
H(w)=w_1x_1+w_2x_2+...+w_nx_n+b=W^TX
H(w)=w1x1+w2x2+...+wnxn+b=WTX
其中 W = ( b w 1 ⋮ w n ) , X = ( 1 x 1 ⋮ x n ) 其中W= \begin{pmatrix}b\\w_1\\\vdots\\w_n\end{pmatrix},X= \begin{pmatrix}1\\x_1\\\vdots\\x_n\end{pmatrix} 其中W=⎝ ⎛bw1⋮wn⎠ ⎞,X=⎝ ⎛1x1⋮xn⎠ ⎞
另一种线性模型:参数一个
H
(
w
)
=
w
1
x
1
+
w
2
x
1
2
+
.
.
.
+
w
1
x
1
n
+
b
H(w)=w_1x_1+w_2x_1^2+...+w_1x_1^n+b
H(w)=w1x1+w2x12+...+w1x1n+b
损失函数(cost):
预测值与真实值的差距(常见为最小二乘法)
优化算法:
- 正规方程(计算复杂,数据量小可用):
W = ( X T X ) − 1 X T Y , ( 已知 Y = W T X ) W=(X^TX)^{-1}X^TY,(已知Y=W^TX) W=(XTX)−1XTY,(已知Y=WTX)
- 梯度下降(数据量大用):
KaTeX parse error: Undefined control sequence: \symbfit at position 34: …partial \space \̲s̲y̲m̲b̲f̲i̲t̲{Cost}(W_0+W_1x…
α 称为学习率 \alpha 称为学习率 α称为学习率
线性回归API:
# 正规方程优化
sklearn.linear_model.LinearRegression(fit_intercept=True)
# fit_interce:pt:是否计算偏置
# LinearRegression.coef_:回归系数
# LinearRegression.intercept_:偏置
# 梯度下降优化
sklearn.linear_model.SGDRegressor(loss="",fit_intercept=,learning_rate='',eta0=)
# loss:损失类型。"squared_loss"
# fit_interce:pt:是否计算偏置
# learning_rate:学习率。
"invscaling":
"constant":eta=1.0/(alpha * (t+t0))[default]
"optimal":eta=eta0/pow(t,power_t)
# SGDRegressor.coef_:回归系数
# SGDRegressor.intercept_:偏置
回归性能评估:
均方误差MSE
M
S
E
=
1
m
∑
i
=
1
m
(
y
i
−
y
ˉ
)
2
MSE={1\over m}\sum_{i=1}^m (y_i -\bar y)^2
MSE=m1i=1∑m(yi−yˉ)2
sklearn.metrics.mean_squared_error(y_true,y_pred)
案例:波士顿房价预测
# -*- coding: GBK -*-
# -*- coding: UTF-8 -*-
# coding=gbk
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error
def boston_1():
# 获取数据
boston = load_boston()
# 数据划分
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=100)
# 特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 预估器
estimator = LinearRegression()
estimator.fit(x_train, y_train)
# 模型评估
print("回归系数为", estimator.coef_, '\n')
print("偏置为:", estimator.intercept_, '\n')
y_predict = estimator.predict(x_test)
Mse = mean_squared_error(y_test, y_predict)
print("均方误差为:", Mse)
def boston_2():
# 获取数据
boston = load_boston()
# 数据划分
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=100)
# 特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 预估器
estimator = SGDRegressor(learning_rate="invscaling", eta0=0.01)
estimator.fit(x_train, y_train)
# 模型评估
print("回归系数为", estimator.coef_, '\n')
print("偏置为:", estimator.intercept_, '\n')
y_predict = estimator.predict(x_test)
Mse = mean_squared_error(y_test, y_predict)
print("均方误差为:", Mse)
if __name__ == '__main__':
boston_1()
boston_2()
运行结果:
回归系数为 [-0.60292601 1.04914911 -0.13037299 0.63411901 -1.57254519 2.73708926
-0.37092604 -2.99837179 2.54865538 -2.20887515 -1.94391032 0.95278425
-3.28572799]
偏置为: 22.69973614775727
均方误差为: 27.173144173043656
回归系数为 [-0.50148422 0.91289931 -0.3896733 0.7036103 -1.39714531 2.82462609
-0.40509405 -2.92375102 1.82430464 -1.42591033 -1.8905937 0.95238311
-3.22595256]
偏置为: [22.70096789]
均方误差为: 27.611320173209425
进程已结束,退出代码为 0
拓展:梯度下降优化
GD:
SGD:
SAG:
5.2、欠拟合与过拟合
欠拟合:需要增加特征值
过拟合:需要正则化
L2正则化(Ridge回归):
J ( W ) = 1 2 m ∑ i = 1 m ( h W ( x i ) − y i ) 2 + λ ∑ j = 1 m W j 2 损失函数 = 原损失函数 + 惩罚项 J(W)={1\over 2m}\sum_{i=1}^m(h_W(x_i)-y_i)^2+\lambda \sum_{j=1}^mW_j^2\\ 损失函数=原损失函数+惩罚项 J(W)=2m1i=1∑m(hW(xi)−yi)2+λj=1∑mWj2损失函数=原损失函数+惩罚项
L1正则化(LASSO回归):直接使部分W的值为0,删除某些特征的影响
5.3、岭回归
即带有L2正则化的线性回归
API:
sklearn.linear_model.Ridge(alpha=1,fit_intercept=Ture,solver=,normalize=False)
# alpha:正则化系数,即λ
# solver:优化器
# normalize:是否进行标准化
# Ridge.coef_:回归系数
# Ridge.intercept_:偏置
案例:波士顿房价预测(岭回归)
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
def boston_3():
# 获取数据
boston = load_boston()
# 数据划分
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=100)
# 特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 预估器
estimator = Ridge(alpha = 0.5)
estimator.fit(x_train, y_train)
# 模型评估
print("回归系数为", estimator.coef_, '\n')
print("偏置为:", estimator.intercept_, '\n')
y_predict = estimator.predict(x_test)
Mse = mean_squared_error(y_test, y_predict)
print("均方误差为:", Mse, '\n')
if __name__ == '__main__':
boston_3()
运行结果:
回归系数为 [-0.59823625 1.04053133 -0.14493391 0.63710781 -1.55860538 2.74130502
-0.37204575 -2.98251376 2.50332384 -2.16480946 -1.93972975 0.9522699
-3.27862245]
偏置为: 22.69973614775727
均方误差为: 27.197433238049836
进程已结束,退出代码为 0
5.4、分类算法:逻辑回归(二分类)
逻辑回归的输入就是一个线性回归的结果,即输入为:
θ
T
x
=
H
(
w
)
=
w
1
x
1
+
w
2
x
2
+
.
.
.
+
w
n
x
n
+
b
=
W
T
X
\theta ^Tx=H(w)=w_1x_1+w_2x_2+...+w_nx_n+b=W^TX
θTx=H(w)=w1x1+w2x2+...+wnxn+b=WTX
激活函数:sigmoid 函数
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
=
1
1
+
e
−
H
(
w
)
=
g
(
H
(
w
)
)
g(\theta ^Tx)={1\over 1+e^{-\theta ^Tx}}={1\over 1+e^{-H(w)}}=g(H(w))
g(θTx)=1+e−θTx1=1+e−H(w)1=g(H(w))
回归的结果输入到sigmoid函数中,输出结果即[0,1]区间的一个概率值,默认阈值0.5
对数似然损失:
C
o
s
t
(
h
,
y
)
=
{
−
l
o
g
(
h
)
,
w
h
e
n
y
=
1
−
l
o
g
(
1
−
h
)
,
w
h
e
n
y
=
0
h
为预测值,
y
为真实值
{Cost}(h,y)=\begin{cases}-log(h), when\space y=1\\ -log(1-h),when\space y=0\end{cases}\\h为预测值,y为真实值
Cost(h,y)={−log(h),when y=1−log(1−h),when y=0h为预测值,y为真实值
对数似然损失函数:
C
o
s
t
F
u
n
c
t
i
o
n
=
C
o
s
t
(
h
,
y
)
=
∑
i
=
1
m
−
y
i
l
o
g
(
h
)
−
(
1
−
y
i
)
l
o
g
(
1
−
h
)
{CostFunction}={Cost}(h,y)=\sum_{i=1}^m -y_ilog(h)-(1-y_i)log(1-h)
CostFunction=Cost(h,y)=i=1∑m−yilog(h)−(1−yi)log(1−h)
优化方式:梯度下降
API:
sklearn.linear_model.LogisticRegression(solver='liblinear',penalty='l2',C=1.0)
# solve:优化求解方式
# penalty:正则化种类
# C:正则化力度
案例:是否得癌症分类
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# 获取数据
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
column_name = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape','Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']
data = pd.read_csv(path, names=column_name)
# 数据处理
data = data.replace(to_replace="?",value=np.nan)
data.dropna(inplace = True)
# 选定特征值和目标值
x=data.iloc[:,1:-1]
y=data["Class"]
# 数据集划分
x_train,x_test,y_train,y_test = train_test_split(x,y)
# 特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 逻辑回归
estimator = LogisticRegression()
estimator.fit(x_train,y_train)
# 模型评估
print("回归系数为", estimator.coef_, '\n')
print("偏置为:", estimator.intercept_, '\n')
# 预测
y_predict = estimator.predict(x_test)
print(y_predict, '\n')
print("预测结果:\n", y_predict == y_test)
# 模型准确率评分
score = estimator.score(x_test, y_test)
print("模型评分:", score, '\n')
运行结果:
回归系数为 [[1.34557955 0.12081454 0.3220905 1.65048536 0.31212786 1.65357286
1.05789333 1.10955067 1.0743157 ]]
偏置为: [-0.70803147]
[2 2 2 2 4 2 2 2 4 2 2 4 4 2 4 4 2 4 2 2 4 2 2 4 4 2 4 4 2 2 2 2 2 2 4 2 2
4 2 4 4 2 2 2 2 2 4 4 4 2 4 4 4 4 4 4 2 4 2 2 4 4 4 4 2 2 4 2 2 2 2 4 2 2
4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 2 2 2 2 2 4 2 2 4 2 4 2 4 4 2 2 4 4 2
2 2 2 4 4 2 4 2 4 2 4 4 2 4 2 4 2 2 2 2 2 2 4 4 2 2 2 4 4 4 2 4 2 4 2 2 4
4 4 4 4 4 2 2 2 4 2 2 2 2 2 4 2 4 2 2 2 2 4 2]
预测结果:
615 True
144 True
127 True
547 True
270 True
...
524 True
150 True
444 True
440 True
83 True
Name: Class, Length: 171, dtype: bool
模型评分: 0.935672514619883
5.5、二分类的评估方法:
1、精准率、召回率与F1—score
P—精准率(查准率):预测结果为正例中,真实为正例的比例。越高越好
R—召回率(查全率):真实为正例中,预测结果为正例的比例。越高越好
F1—score:2×P×R/(P+R)
API:
sklearn.metrics.classfication_report(y_true,y_pred,labels=[],target_name=None)
# y_true:真实目标值
# y_pred:预测目标值
# labels:指定类别对应的数字
# target_name:目标类别名称
# 返回精准率和召回率
2、ROC曲线与AUC指标
- TPR:即召回率。TP/(TP+FN),所有真实为1中,预测也为1的比例
- FPR:FP/(FP+TN),所有真实为0中,预测为1的比例
ROC曲线:横轴FPR,纵轴TPR
AUC:ROC围的面积
API:
sklearn.metrics.roc_auc_score(y_true,y_score)
# 计算ROC面积,即AUC值
# y_true:样本的真实类别。必须为0——反例,1——正例标记
# y_score:预测得分。可以为分类器的预测返回值或正类的估计概率,置信值
模型保存与加载
API:
from sklearn.externals import joblib
保存:joblib.dump(rf,'test.pkl')
加载:estimator=joblib.load('test.pkl')
无监督学习
无监督学习:无目标值
k-means聚类算法
- 随机设置K个特征空间内的点作为初始的聚类中心
- 对于其他每个点计算到K个中心的距离,选最近的聚类中心点作为标记类别
- 对标记的聚类中心,重新计算出每个聚类的中心点
- 若中心点与原中心点一致,则结束,否则跳回第二步
API:
sklearn.cluster.KMeans(n_clusters=8,init='k-means++')
# n_clusters:开始的中心数量,即簇
# init:初始方法,默认为k-means++
# labels_:默认标记的类型,点方法调用
案例:k-means对Instacart Market用户聚类(使用数据降维案例的数据)
聚类的模型评估:
通俗来讲,好的模型要求即为,外部距离最大化,内部距离最小化。”高内聚低耦合“
轮廓系数:
KaTeX parse error: Expected 'EOF', got '&' at position 38: …{max(b_i,a_i)}}&̲\mbox -1<sc_i<1…
API:
sklearn.metrics.silhouette_score(X,labels)
# X:特征值
# labels:被聚类标记的目标值