[机器学习][理论知识][实践]回归算法

最新推荐文章于 2024-06-19 09:58:54 发布

Twish

最新推荐文章于 2024-06-19 09:58:54 发布

阅读量766

点赞数

分类专栏：机器学习 Python

本文链接：https://blog.csdn.net/qq_27695659/article/details/103000043

版权

Python 同时被 2 个专栏收录

283 篇文章 1 订阅

订阅专栏

机器学习

14 篇文章 1 订阅

订阅专栏

文章目录

1回归算法概念
2线性回归
- 2.1线性回归、最大似然估计及二乘法
3模型效果判断
4机器学习调参
5梯度下降算法
- 5.1梯度方向
- 5.2批量梯度下降算法(BGD)
- 5.3随机梯度下降算法(SGD)
- 5.4BGD和SGD算法比较
- 5.5小批量梯度下降法(MBGD)
- 5.6小结
- - 5.6.1BGD、SGD、MBGD的区别:
6线性回归的扩展
7线性回归总结
8局部加权回归
- 8.1局部加权回归-损失函数
- 8.2局部加权回归-权重值设置
9回归算法综合案例(二):波士顿房屋租赁价格预测
- 9.1 模型评测
- - 9.1.1数据输出
10回归算法综合案例(三)：葡萄酒质量预测
11Logistic回归
- 11.1Logistic回归及似然函数
- 11.2最大似然/极大似然函数的随机梯度
- 11.3θ参数求解
- 11.4极大似然估计与Logistic回归损失函数
- 11.5Logistic案例(一):乳腺癌分类
12Softmax回归
- 12.1Softmax案例(一):葡萄酒质量分类
13分类问题综合案例(一):信贷审批
14总结

1回归算法概念

回归算法是一种有监督算法
回归算法是一种比较常用的机器学习算法，用来建立“解释”变量(自变量X)和观测值(因变量Y)之间的关系;从机器学习的角度来讲，用于构建一个算法模型(函数)来做属性(X)与标签(Y)之间的映射关系，在算法的学习过程中，试图寻找一个函数 h: $R^d\to R$ 使得参数之间的关系拟合性最好。
回归算法中算法(函数)的最终结果是一个连续的数据值，输入值(属性值)是一个d 维度的属性/数值向量

2线性回归

$h_\theta(x)=\theta_0+\theta_1x_1+...+\theta_nx_n\\ = \theta_01+\theta_1x_1+...+\theta_nx_n\\ = \theta_0x_0+\theta_1x_1+....+\theta_nx_n\\ = \sum_{i=0}^n\theta_ix_i=\theta^Tx$
最终要求是计算出 的值，并选择最优的 值构成算法公式

2.1线性回归、最大似然估计及二乘法

$y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}$
误差 $\varepsilon^{(i)}(1\leq i\leq n)$ 是独立同分布的，服从均值为0，方差为某定值 $\sigma^2$ 的高斯分布。原因:中心极限定理
实际问题中，很多随机现象可以看做众多因素的独立影响的综合反应，往往服从正态分布

2.1.1似然函数

$y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}\\ p(\varepsilon^{(i)})=\frac{1}{\sigma \sqrt{2\pi}}e^{\{-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}\}}\\ p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\}\\ L(\theta)=\prod_{i=1}^mp(y^{(i)}|x^{(i)};\theta)\\ =\prod_{i=1}^m\frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\}$
对数似然、目标函数及最小二乘
$\varrho(\theta)=\log L(\theta) = \log \prod_{i=1}^m\frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\}\\ =\sum_{i=1}^m\log \frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\}\\ =m\log \frac{1}{\sigma\sqrt{2\pi}}-\frac{1}{\sigma^2}\frac{1}{2}\sum_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2\\ loss(y_j,\hat y_j)=J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2$

2.1.2最小二乘法的参数最优解

参数解析式
$\theta = (X^TX)^{-1}X^TY$
最小二乘法的使用要求矩阵 $X^T X$ 是可逆的;为了防止不可逆或者过拟合的问题存在，可以增加额外数据影响，导致最终的矩阵是可逆的:
$\theta = (X^TX+\lambda I)^{-1}X^Ty$
最小二乘法直接求解的难点:矩阵逆的求解是一个难处

2.1.3普通最小二乘法线性回归案例

现有一批描述家庭用电情况的数据，对数据进行算法模型预测，并最终得到预测模型(每天各个时间段和功率之间的关系、功率与电流之间的关系等)
数据来源: Individual household electric power consumption Data Set
建议:使用python的sklearn库的linear_model中LinearRegression来获取算法
具体代码如下：

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pandas import DataFrame 
import time
import pandas as pd

patch = 'datas/household_power_consumption_1000.txt'
df = pd.read_csv(patch,sep=';',low_memory=False)

df.head()

	Date	Time	Global_active_power	Global_reactive_power	Voltage	Global_intensity	Sub_metering_2	Sub_metering_3
0	16/12/2006	17:24:00	4.216	0.418	234.84	18.4	1.0	17.0
1	16/12/2006	17:25:00	5.360	0.436	233.63	23.0	1.0	16.0
2	16/12/2006	17:26:00	5.374	0.498	233.29	23.0	2.0	17.0
3	16/12/2006	17:27:00	5.388	0.502	233.74	23.0	1.0	17.0
4	16/12/2006	17:28:00	3.666	0.528	235.68	15.8	1.0	17.0

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
Date                     1000 non-null object
Time                     1000 non-null object
Global_active_power      1000 non-null float64
Global_reactive_power    1000 non-null float64
Voltage                  1000 non-null float64
Global_intensity         1000 non-null float64
Sub_metering_1           1000 non-null float64
Sub_metering_2           1000 non-null float64
Sub_metering_3           1000 non-null float64
dtypes: float64(7), object(2)
memory usage: 70.4+ KB

# 异常数据过滤
new_df = df.replace('?',np.nan)
datas = new_df.dropna(axis=0,how='any')
datas.describe()

	Global_active_power	Global_reactive_power	Voltage	Global_intensity	Sub_metering_1	Sub_metering_2	Sub_metering_3
count	1000.000000	1000.000000	1000.00000	1000.000000	1000.0	1000.000000	1000.000000
mean	2.418772	0.089232	240.03579	10.351000	0.0	2.749000	5.756000
std	1.239979	0.088088	4.08442	5.122214	0.0	8.104053	8.066941
min	0.206000	0.000000	230.98000	0.800000	0.0	0.000000	0.000000
25%	1.806000	0.000000	236.94000	8.400000	0.0	0.000000	0.000000
50%	2.414000	0.072000	240.65000	10.000000	0.0	0.000000	0.000000
75%	3.308000	0.126000	243.29500	14.000000	0.0	1.000000	17.000000
max	7.706000	0.528000	249.37000	33.200000	0.0	38.000000	19.000000

# 查看格式信息
datas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 9 columns):
Date                     1000 non-null object
Time                     1000 non-null object
Global_active_power      1000 non-null float64
Global_reactive_power    1000 non-null float64
Voltage                  1000 non-null float64
Global_intensity         1000 non-null float64
Sub_metering_1           1000 non-null float64
Sub_metering_2           1000 non-null float64
Sub_metering_3           1000 non-null float64
dtypes: float64(7), object(2)
memory usage: 78.1+ KB

## 创建一个时间函数格式化字符串
def date_format(dt):
    # dt显示是一个series/tuple；dt[0]是date，dt[1]是time
    import time
    t = time.strptime(' '.join(dt), '%d/%m/%Y %H:%M:%S')
    return (t.tm_year, t.tm_mon, t.tm_mday, t.tm_hour, t.tm_min, t.tm_sec)

## 需求：构建时间和功率之间的映射关系，可以认为：特征属性为时间；目标属性为功率值。
# 获取x和y变量, 并将时间转换为数值型连续变量
X = datas.iloc[:,0:2]
X = X.apply(lambda x: pd.Series(date_format(x)), axis=1)
Y = datas['Global_active_power']

## 对数据集进行测试集合训练集划分
# X：特征矩阵(类型一般是DataFrame)
# Y：特征对应的Label标签(类型一般是Series)
# test_size: 对X/Y进行划分的时候，测试集合的数据占比, 是一个(0,1)之间的float类型的值
# random_state: 数据分割是基于随机器进行分割的，该参数给定随机数种子；给一个值(int类型)的作用就是保证每次分割所产生的数数据集是完全相同的
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

X_train.describe()

	0	1	2	3	4	5
count	800.0	800.0	800.000000	800.000000	800.000000	800.0
mean	2006.0	12.0	16.598750	10.755000	29.723750	0.0
std	0.0	0.0	0.490458	8.068386	17.266517	0.0
min	2006.0	12.0	16.000000	0.000000	0.000000	0.0
25%	2006.0	12.0	16.000000	4.000000	15.000000	0.0
50%	2006.0	12.0	17.000000	8.000000	30.000000	0.0
75%	2006.0	12.0	17.000000	19.000000	45.000000	0.0
max	2006.0	12.0	17.000000	23.000000	59.000000	0.0

## 数据标准化
# StandardScaler：将数据转换为标准差为1的数据集(有一个数据的映射)
# scikit-learn中：如果一个API名字有fit，那么就有模型训练的含义，没法返回值
# scikit-learn中：如果一个API名字中有transform， 那么就表示对数据具有转换的含义操作
# scikit-learn中：如果一个API名字中有predict，那么就表示进行数据预测，会有一个预测结果输出
# scikit-learn中：如果一个API名字中既有fit又有transform的情况下，那就是两者的结合(先做fit，再做transform)
ss = StandardScaler()
X_train = ss.fit_transform(X_train) # 训练模型并转换训练集
X_test = ss.transform(X_test) ## 直接使用在模型构建数据上进行一个数据标准化操作 (测试集)

print(X_train)

[[ 0.          0.          0.81862454 -0.83774203  1.23299681  0.        ]
 [ 0.          0.          0.81862454 -1.20979622  0.82733427  0.        ]
 [ 0.          0.         -1.22156123  1.39458314  1.52275577  0.        ]
 ...
 [ 0.          0.          0.81862454 -0.96176009  1.3489004   0.        ]
 [ 0.          0.          0.81862454 -1.08577816  0.76938248  0.        ]
 [ 0.          0.          0.81862454 -0.83774203  1.05914144  0.        ]]

pd.DataFrame(X_train).describe()

	0	1	2	3	4	5
count	800.0	800.0	8.000000e+02	8.000000e+02	8.000000e+02	800.0
mean	0.0	0.0	2.050582e-15	-5.107026e-17	5.329071e-17	0.0
std	0.0	0.0	1.000626e+00	1.000626e+00	1.000626e+00	0.0
min	0.0	0.0	-1.221561e+00	-1.333814e+00	-1.722545e+00	0.0
25%	0.0	0.0	-1.221561e+00	-8.377420e-01	-8.532677e-01	0.0
50%	0.0	0.0	8.186245e-01	-3.416698e-01	1.600918e-02	0.0
75%	0.0	0.0	8.186245e-01	1.022529e+00	8.852861e-01	0.0
max	0.0	0.0	8.186245e-01	1.518601e+00	1.696611e+00	0.0

# 模型训练
lr = LinearRegression(fit_intercept=True)
lr.fit(X_train,Y_train)
# 模型校验
y_predict = lr.predict(X_test)
print("训练集上R2:",lr.score(X_train, Y_train))
print("测试集上R2:",lr.score(X_test, Y_test))
# 均方误差，分值越高说明效果越差
mse = np.average((y_predict-Y_test)**2)
rmse = np.sqrt(mse)
print(rmse)

训练集上R2: 0.24409311805909026
测试集上R2: 0.1255162851373588
1.1640923459736248

# 输出模型训练得到的相关参数
print("模型系数：θ",end="")
print(lr.coef_)
print("模型的截距:", end='')
print(lr.intercept_)

模型系数：θ[ 0.00000000e+00  2.77555756e-16 -1.41588166e+00 -9.34953243e-01
 -1.02140756e-01  0.00000000e+00]
模型的截距:2.4454375000000024

## 模型保存/持久化
# 在机器学习部署的时候，实际上其中一种方式就是将模型进行输出；另外一种方式就是直接将预测结果输出
# 模型输出一般是将模型输出到磁盘文件
import joblib
# 保存模型要求给定的文件所在的文件夹比较存在
joblib.dump(ss,'data_ss.model')
joblib.dump(lr,'data_lr.model')

['data_lr.model']

# 加载模型
ss3 = joblib.load('data_ss.model')
lr3 = joblib.load('data_lr.model')

date1 = [[2019,11,11,10,7,0]]
data1 = ss3.transform(date1)
print(data1)
lr3.predict(data1)

[[ 13.          -1.         -11.42249008  -0.09363364  -1.31688203
    0.        ]]






array([18.84038214])

## 预测值和实际值画图比较
t = np.arange(len(X_test))
plt.figure(facecolor="w")
plt.plot(t,Y_test,'r-',linewidth=2,label='RealValue')
plt.plot(t,y_predict,'g-',linewidth=2,label='Predict')
plt.legend(loc='upper left')
plt.title('Relationship between time and power by Linear regression',fontsize=20)
plt.grid(b=True)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ys3rfRCm-1573483279268)(output_17_0.png)]

## 功率和电流之间的关系
X = datas.iloc[:,2:4]
X

	Global_active_power	Global_reactive_power
0	4.216	0.418
1	5.360	0.436
2	5.374	0.498
3	5.388	0.502
4	3.666	0.528
...	...	...
995	2.296	0.054
996	2.292	0.054
997	0.370	0.000
998	0.472	0.000
999	3.054	0.060

1000 rows × 2 columns

Y2 = datas.iloc[:,5]
Y2

0      18.4
1      23.0
2      23.0
3      23.0
4      15.8
       ... 
995     9.6
996     9.6
997     2.4
998     2.4
999    13.4
Name: Global_intensity, Length: 1000, dtype: float64

## 数据分割
X2_train,X2_test,Y2_train,Y2_test = train_test_split(X, Y2, test_size=0.2, random_state=0)

## 数据归一化
scaler2 = StandardScaler()
X2_train = scaler2.fit_transform(X2_train) # 训练并转换
X2_test = scaler2.transform(X2_test) ## 直接使用在模型构建数据上进行一个数据标准化操作

## 模型训练
lr2 = LinearRegression()
lr2.fit(X2_train, Y2_train) ## 训练模型

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#结果预测
Y2_predict = lr2.predict(X2_test)

#模型估计
print("电流模型评估:",lr2.score(X2_test,Y2_test))
print("电流参数：",lr2.coef_)

电流模型评估: 0.9920420609708968
电流参数： [5.07744316 0.07191391]

## 绘制图表
#### 电流关系
tn = np.arange(len(X2_test))
plt.figure(facecolor='w')
plt.plot(tn,Y2_test,'r-',linewidth=2,label='RealValue')
plt.plot(tn,Y2_predict,'g-',linewidth=2,label='PredictValue')
plt.legend(loc = 'upper left')
plt.title('Relationship between Power and current predicted by Linear regression',fontsize=20)
plt.grid(True)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WEKENEr9-1573483279271)(output_25_0.png)]

2.2目标函数

0-1损失函数
$J(\theta)=\begin{cases} 1,Y\neq f(X)\\ 0, Y=f(x) \end{cases}$
感知损失函数
$J(\theta)=\begin{cases} 1,|Y-f(X)|>t\\ 0, |Y-f(X)|\leq t \end{cases}$
平方损失函数

$J(\theta)=\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$

绝对值损失函数
$J(\theta)=\sum_{i=1}^m|(h_\theta(x^{(i)})-y^{(i)})|$
对数损失函数
$J(\theta)=\sum_{i=1}^m(y^{(i)}logh_\theta(x^{(i)}))$

2.3线性回归的过拟合

目标函数:
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)}))^2$
为了防止数据过拟合，也就是的θ值在样本空间中不能过大/过小，可以在目标函数之上增加一个平方和损失:
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)}))^2 + \lambda\sum_{i=1}^n\theta^2_j$
正则项(norm):
$\lambda \sum_{j=1}^m\theta^2_j$
这里这个正则项叫做L2-norm

2.3.1过拟合和正则项

L2-norm:
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{i=1}^m\theta^2_j ,\space\space\space\lambda>0$
L1-norm:
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{j=1}^m|\theta_j |,\space\space\space\lambda>0$
Ridge回归:
使用L2正则的线性回归模型就称为Ridge回归(岭回归)
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{i=1}^m\theta^2_j ,\space\space\space\lambda>0$
LASSO回归:
使用L1正则的线性回归模型就称为LASSO回归(Least Absolute Shrinkage and Selection Operator)
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{j=1}^m|\theta_j |,\space\space\space\lambda>0$

2.4Ridge(L2-norm)和LASSO(L1-norm)比较

L2-norm中，由于对于各个维度的参数缩放是在一个圆内缩放的，不可能导致有维度参数变为0的情况，那么也就不会产生稀疏解;实际应用中，数据的维度中是存在噪音和冗余的，稀疏的解可以找到有用的维度并且减少冗余，提高回归预测的准确性和鲁棒性(减少了overfitting)(L1-norm可以达到最终解的稀疏性的要求)
Ridge模型具有较高的准确性、鲁棒性以及稳定性;LASSO模型具有较高的求解速度。
如果既要考虑稳定性也考虑求解的速度，就使用Elasitc Net

2.5 Elasitc Net

同时使用L1正则和L2正则的线性回归模型就称为Elasitc Net算法(弹性网络算法)
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{i=1}^m\theta^2_j +\lambda\{p\sum_{j=1}^n|\theta_j|+(1-p)\sum_{j=1}^n\theta^2_j\}\\ \lambda >0 ,p\in[0,1]$

3模型效果判断

$\frac{1}{m}\sum_{i=1}^m(y_i-\hat y_i)^2\\ RMSE = \sqrt{MSE}=\sqrt{\frac{1}{m}\sum_{i=1}^m(y_i-\hat y_i)^2}\\ R^2 = 1-\frac{RSS}{TSS}=1-\frac{\sum_{i=1}^m(y_i-\hat y)^2}{\sum_{i=1}^m(y_i-\bar y)^2}\\ \bar y = \frac{1}{m}\sum_{i=1}^my_i$

MSE:误差平方和，越趋近于0表示模型越拟合训练数据。
RMSE:MSE的平方根，作用同MSE
R2:取值范围(负无穷,1]，值越大表示模型越拟合训练数据;最优解是1;当模型预测为随机值的时候，有可能为负;若预测值恒为样本期望，R2为0
TSS:总平方和TSS(Total Sum of Squares)，表示样本之间的差异情况，是伪方差的m倍
RSS:残差平方和RSS(Residual Sum of Squares)，表示预测值和样本值之间的差异情况，是MSE的m倍

4机器学习调参

在实际工作中，对于各种算法模型(线性回归)来讲，我们需要获取θ、λ、p的值; θ的求解其实就是算法模型的求解，一般不需要开发人员参与(算法已经实现)，主要需要求解的是λ和p的值，这个过程就叫做调参(超参)
交叉验证:将训练数据分为多份，其中一份进行数据验证并获取最优的超参:λ 和p;比如:十折交叉验证、五折交叉验证(scikit-learn中默认)等

5梯度下降算法

目标函数θ求解
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2$
初始化θ(随机初始化，可以初始为0)
沿着负梯度方向迭代，更新后的θ使J(θ)更小
$\theta = \theta-\alpha \frac{\partial J(\theta)}{\partial\theta} \\ \alpha:学习率、步长$

5.1梯度方向

$\frac{\partial}{\partial \theta_j}J(\theta)=\frac{\partial}{\partial \theta_j}\frac{1}{2}(h_\theta(x)-y)^2=\\ 2\frac{1}{2}(h_\theta(x)-y)^2\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)=\\ (h_\theta(x)-y)^2\frac{\partial}{\partial\theta_j}\{\sum_{i=1}\theta_jx_j-y\}=\\ (h_\theta(x)-y)x_j$

5.2批量梯度下降算法(BGD)

$\frac{\partial}{\partial \theta_j}J(\theta)=(h_\theta(x)-y)x_j\\ \frac{\partial J(\theta)}{\partial\theta_j}=\sum_{i=1}^m\frac{\partial}{\partial \theta_j}=\sum_{i=1}^m(x_j^{(i)}(h_\theta(x^{(i)})-y^{(i)}))=\\ \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}=\\ \theta_j= \theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$

5.3随机梯度下降算法(SGD)

$\frac{\partial}{\partial \theta_j}J(\theta)=(h_\theta(x)-y)x_j$
for i to m{
$\theta_j = \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$
}

5.4BGD和SGD算法比较

SGD速度比BGD快(迭代次数少)
SGD在某些情况下(全局存在多个相对最优解/J(θ)不是一个二次)，SGD有可能跳出某些小的局部最优解，所以不会比BGD坏
BGD一定能够得到一个局部最优解(在线性回归模型中一定是得到一个全局最优解)，SGD由于随机性的存在可能导致最终结果比BGD的差
注意:优先选择SGD

5.5小批量梯度下降法(MBGD)

如果即需要保证算法的训练过程比较快，又需要保证最终参数训练的准确率，而这正是小批量梯度下降法(Mini-batch Gradient Descent，简称MBGD)的初衷。MBGD中不是每拿一个样本就更新一次梯度，而且拿b个样本(b一般为10)的平均梯度作为更新方向。
for i = 1 to m/10{
$\theta_j =\theta_j+\alpha \sum_{k=1}^{i+10}(y^(k)-h_\theta(x^{(k)}))x_j^{(k)}$
}

5.6小结

由于梯度下降法中负梯度方向作为变量的变化方向，所以有可能导致最终求解的值是局部最优解，所以在使用梯度下降的时候，一般需要进行一些调优策略:

学习率的选择:学习率过大，表示每次迭代更新的时候变化比较大，有可能会跳过最优解;学习率过小，表示每次迭代更新的时候变化比较小，就会导致迭代速度过慢，很长时间都不能结束;
算法初始参数值的选择:初始值不同，最终获得的最小值也有可能不同，因为梯度下降法求解的是局部最优解，所以一般情况下，选择多次不同初始值运行算法，并最终返回损失函数最小情况下的结果值;
标准化:由于样本不同特征的取值范围不同，可能会导致在各个不同参数上迭代速度不同，为了减少特征取值的影响，可以将特征进行标准化操作。

5.6.1BGD、SGD、MBGD的区别:

当样本量为m的时候，每次迭代BGD算法中对于参数值更新一次，SGD算法中对于参数值更新m次，MBGD算法中对于参数值更新m/n次，相对来讲 SGD算法的更新速度最快;
SGD算法中对于每个样本都需要更新参数值，当样本值不太正常的时候，就有可能会导致本次的参数更新会产生相反的影响，也就是说SGD算法的结果并不是完全收敛的，而是在收敛结果处波动的;
SGD算法是每个样本都更新一次参数值，所以SGD算法特别适合样本数据量大的情况以及在线机器学习(Online ML)。

6线性回归的扩展

线性回归针对的是θ而言是一种，对于样本本身而言，样本可以是非线性的
也就是说最终得到的函数f:x->y;函数f(x)可以是非线性的，比如:曲线等

7线性回归总结

算法模型:线性回归(Linear)、岭回归(Ridge)、LASSO回归、Elastic Net
正则化:L1-norm、L2-norm
损失函数/目标函数: $J(\theta)=\sum_{i=1} ^m(h_\theta(x^{(i)})-y^{(i)})^2 \to \min_\theta J(\theta)$
θ求解方式:最小二乘法(直接计算，目标函数是平方和损失函数)、梯度下降 (BGD\SGD\MBGD)

8局部加权回归

8.1局部加权回归-损失函数

普通线性回归损失函数:
$J(\theta)=\sum_{i=1} ^m(h_\theta(x^{(i)})-y^{(i)})^2$
局部加权回归损失函数:
$J(\theta)=\sum_{i=1} ^mw^{(i)}(h_\theta(x^{(i)})-y^{(i)})^2$

8.2局部加权回归-权重值设置

$w^{(i)}$ 是权重，它根据要预测的点与数据集中的点的距离来为数据集中的点赋权值。当某点离要预测的点越远，其权重越小，否则越大。常用值选择公式为:
$w^{(i)}=exp\{-\frac{(x^{(i)}-\bar x)^2}{2k^2}\}$
该函数称为指数衰减函数，其中k为波长参数，它控制了权值随距离下降的速率
注意:使用该方式主要应用到样本之间的相似性考虑，主要内容在SVM中再考虑 (核函数)

9回归算法综合案例(二):波士顿房屋租赁价格预测

基于波士顿房屋租赁数据进行房屋租赁价格预测模型构建，分别使用Lasso回归、Ridge回两种回归算法构建模型，并分别构建1/2/3阶算法中的最优算法 (参数)，并比较这两种回归算法的效果;另外使用lasso回归算法做特征选择（数据自行百度，就出来了）

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

def notEmpty(s):
    return s != ''

## 加载数据
names = ['CRIM','ZN', 'INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']
path = "boston_housing.data"
## 由于数据文件格式不统一，所以读取的时候，先按照一行一个字段属性读取数据，然后再安装每行数据进行处理
fd = pd.read_csv(path,header=None)
fd.head()

	0
0	0.00632 18.00 2.310 0 0.5380 6.5750 65...
1	0.02731 0.00 7.070 0 0.4690 6.4210 78...
2	0.02729 0.00 7.070 0 0.4690 7.1850 61...
3	0.03237 0.00 2.180 0 0.4580 6.9980 45...
4	0.06905 0.00 2.180 0 0.4580 7.1470 54...

## 加载数据
names = ['CRIM','ZN', 'INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']
path = "boston_housing.data"
## 由于数据文件格式不统一，所以读取的时候，先按照一行一个字段属性读取数据，然后再安装每行数据进行处理
fd = pd.read_csv(path,header=None)
# print (fd.shape)
data = np.empty((len(fd), 14))
for i, d in enumerate(fd.values):#enumerate生成一列索 引i,d为其元素

    d = map(float, filter(notEmpty, d[0].split(' ')))#filter一个函数，一个list
    
    #根据函数结果是否为真，来过滤list中的项。
    data[i] = list(d)
    
## 分割数据
x, y = np.split(data, (13,), axis=1)
print (x[0:5])
y = y.ravel() # 转换格式 拉直操作
print (y[0:5])
ly=len(y)
print(y.shape)
print ("样本数据量:%d, 特征个数：%d" % x.shape)
print ("target样本数据量:%d" % y.shape[0])

[[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00
  6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 3.9690e+02
  4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
  7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
  9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
  4.0300e+00]
 [3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00
  4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9463e+02
  2.9400e+00]
 [6.9050e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 7.1470e+00
  5.4200e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9690e+02
  5.3300e+00]]
[24.  21.6 34.7 33.4 36.2]
(506,)
样本数据量:506, 特征个数：13
target样本数据量:506

## Pipeline常用于并行调参
models = [
    Pipeline([
            ('ss', StandardScaler()),
            ('poly', PolynomialFeatures()),
            ('linear', RidgeCV(alphas=np.logspace(-3,1,20)))
        ]),
    Pipeline([
            ('ss', StandardScaler()),
            ('poly', PolynomialFeatures()),
            ('linear', LassoCV(alphas=np.logspace(-3,1,20)))
        ])
] 

# 参数字典， 字典中的key是属性的名称，value是可选的参数列表
parameters = {
    "poly__degree": [3,2,1], 
    "poly__interaction_only": [True, False],#不产生交互项，如X1*X1 
    "poly__include_bias": [True, False],#多项式幂为零的特征作为线性模型中的截距;true表示包含
    "linear__fit_intercept": [True, False]
}

# 数据分割
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

## Lasso和Ridge模型比较运行图表展示
titles = ['Ridge', 'Lasso']
colors = ['g-', 'b-']
plt.figure(figsize=(16,8), facecolor='w')
ln_x_test = range(len(x_test))

plt.plot(ln_x_test, y_test, 'r-', lw=2, label=u'真实值')
for t in range(2):
    # 获取模型并设置参数
    # GridSearchCV: 进行交叉验证，选择出最优的参数值出来
    # 第一个输入参数：进行参数选择的模型，
    # param_grid： 用于进行模型选择的参数字段，要求是字典类型；cv: 进行几折交叉验证
    model = GridSearchCV(models[t], param_grid=parameters,cv=5, n_jobs=1)#五折交叉验证
    # 模型训练-网格搜索
    model.fit(x_train, y_train)
    # 模型效果值获取（最优参数）
    print ("%s算法:最优参数:" % titles[t],model.best_params_)
    print ("%s算法:R值=%.3f" % (titles[t], model.best_score_))
    # 模型预测
    y_predict = model.predict(x_test)
    # 画图
    plt.plot(ln_x_test, y_predict, colors[t], lw = t + 3, label=u'%s算法估计值,$R^2$=%.3f' % (titles[t],model.best_score_))
# 图形显示
plt.legend(loc = 'upper left')
plt.grid(True)
plt.title(u"波士顿房屋价格预测")
plt.show()

Ridge算法:最优参数: {'linear__fit_intercept': True, 'poly__degree': 2, 'poly__include_bias': True, 'poly__interaction_only': True}
Ridge算法:R值=0.874
Lasso算法:最优参数: {'linear__fit_intercept': False, 'poly__degree': 3, 'poly__include_bias': True, 'poly__interaction_only': True}
Lasso算法:R值=0.857

## 模型训练 ====> 单个Lasso模型（一阶特征选择）<2参数给定1阶情况的最优参数>
model = Pipeline([
            ('ss', StandardScaler()),
            ('poly', PolynomialFeatures(degree=1, include_bias=False, interaction_only=True)),
            ('linear', LassoCV(alphas=np.logspace(-3,1,20), fit_intercept=False))
        ])
# 模型训练
model.fit(x_train, y_train)

在这里插入图片描述

9.1 模型评测

9.1.1数据输出

print (“参数:”, list(zip(names,model.get_params(‘linear’)[‘linear’].coef_)))
print (“截距:”, model.get_params(‘linear’)[‘linear’].intercept_)


    参数: [('CRIM', -0.0), ('ZN', 0.0), ('INDUS', -0.0), ('CHAS', 0.0), ('NOX', -0.0), ('RM', 2.290127774392436), ('AGE', -0.0), ('DIS', 0.0), ('RAD', -0.0), ('TAX', -0.0), ('PTRATIO', -1.5607620229769363), ('B', 0.0), ('LSTAT', -3.523934059194476)]
    截距: 0.0

一般在实际工作中，当线性模型的参数接近0的时候，我们认为当前参数对应的那个特征属性在模型判断中是没有太大的决策的信息，所以对于这样的属性我们可以删除；一般情况下，如果是手动删除的话，选择小于1e-4的特征属性

10回归算法综合案例(三)：葡萄酒质量预测

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

path1 = "winequality-red.csv"
df1 = pd.read_csv(path1,sep=";")
df1.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

df1['type'] = 1

path2 = "winequality-white.csv"
df2 = pd.read_csv(path2,sep=";")
df2["type"] = 2

df2.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	type
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6	2
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6	2
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6	2
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6	2
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6	2

df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
type                    4898 non-null int64
dtypes: float64(11), int64(2)
memory usage: 497.6 KB

df = pd.concat([df1,df2],axis=0)
df.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	type
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	1
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5	1
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5	1
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6	1
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	1

names = ["fixed acidity","volatile acidity","citric acid",
         "residual sugar","chlorides","free sulfur dioxide",
         "total sulfur dioxide","density","pH","sulphates",
         "alcohol", "type"]
quality = "quality"

names1 = []
for i in list(df):
    names1.append(i)
names1

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality',
 'type']

new_df = df.replace("?",np.nan)
datas = new_df.dropna(how='any')
X = datas[names]
Y = datas[quality]
Y.ravel()

array([5, 5, 5, ..., 6, 7, 6])

models = [
    Pipeline([
        ("Poly",PolynomialFeatures()),
        ("Linear",LinearRegression())
    ]),
    Pipeline([
        ("Poly",PolynomialFeatures()),
        ("Linear",RidgeCV(alphas = np.logspace(-4,2,20)))
    ]),
    Pipeline([
        ("Poly",PolynomialFeatures()),
        ("Linear",LassoCV(alphas=np.logspace(-4,2,20)))
    ]),
    Pipeline([
        ("Poly",PolynomialFeatures()),
        ("Linear",ElasticNetCV(alphas=np.logspace(-4,2,20),l1_ratio=np.linspace(0,1,5)))
    ])
]

plt.figure(figsize=(16,8),facecolor='w')
titles = "Line predict","Ridge predict","Lasso predict","ElasticNet predict"
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.01,random_state=0)
ln_x_test = range(len(X_test))
d_pool = np.arange(1,4,1)
m = len(d_pool)
clrs = []
for c in np.linspace(5570560,255,m):
    clrs.append('#%06x'% int(c))
    
for t in range(4):
    plt.subplot(2,2,t+1)
    model = models[t]
    plt.plot(ln_x_test,Y_test,c='r',lw=2,alpha=0.75,zorder=10,label='Real value')
    for i,d in enumerate(d_pool):
        model.set_params(Poly__degree=d)
        model.fit(X_train,Y_train)
        Y_pre = model.predict(X_test)
        R = model.score(X_train,Y_train)
        lin = model.get_params('Linear')['Linear']
        plt.plot(ln_x_test,Y_pre,c=clrs[i],lw=2,alpha=0.75,zorder=i,label='%d Predict Value,$R^2$=%.3f' % (d,R))
        
    plt.legend(loc='upper left')
    plt.grid(True)
    plt.title(titles[t],fontsize=18)
    plt.xlabel('X',fontsize=16)
    plt.ylabel('Y',fontsize=16)
plt.suptitle('wine quality predict',fontsize=22)
plt.show()

在这里插入图片描述

11Logistic回归

Logistic/sigmoid函数：
$h_\theta(x)=g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}}$
$g(z)=\frac{1}{1+e^{-z}}$
$g'(z)=(\frac{1}{1+e^{-z}})'=\frac{e^{-z}}{(1+e^{-z})^2}=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}}=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})=g(z)(1-g(z))$

11.1Logistic回归及似然函数

假设：
$P(y=1|x;\theta)=h_\theta(x)\\ P(y=1|x;\theta) = 1- h_\theta(x)\\ P(y|s;\theta) = (h_\theta(x))^y(1-h_\theta(x))^{(1-y)}$
似然函数：
$L(\theta) = p(\vec y|X;\theta) =\prod_{i=1}^mp(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^m(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})}$
对数似然函数：
$\ell(\theta) = \log L(\theta)=\sum_{i=1}^m(y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta (x^{(i)})))$

11.2最大似然/极大似然函数的随机梯度

$\frac{\partial\ell(\theta)}{\partial \theta_j}=\sum_{i=1}^m(y^{(i)}-g(\theta^Tx^{(i)}))X_j^{(i)}$

11.3θ参数求解

Logistic回归θ参数的求解过程为(类似梯度下降方法):
$\theta_j = \theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\\ \theta_j = \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$

11.4极大似然估计与Logistic回归损失函数

$L(\theta) = \prod_{i=1}^mp(y^{(i)}|x^{(i);\theta})=\prod_{i=1}^mp_i^{y^{(i)}}\\ p_i = h_\theta(x^{(i)})=\frac{1}{1+e^{-\theta^Tx^{(i)}}}$
$\ell(\theta)=\ln L(\theta) = \sum_{i=1}^m\ln [p_i^{y^{(i)}}(1-p_i)^{1-y^{(i)}}]$
$loss=-\ell(\theta)=\sum_{i=1}^m[-y^{(i)}\ln(h_\theta (x(^{(i)}))-(1-y^{(i)})\ln(1-h_\theta(x^{(i)}))]$

11.5Logistic案例(一):乳腺癌分类

基于病理数据进行乳腺癌预测(复发4/正常2)，使用Logistic算法构建模型

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LogisticRegressionCV,LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

path = "breast-cancer-wisconsin.data"
names = ['id','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape',
         'Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei',
        'Bland Chromatin','Normal Nucleoli','Mitoses','Class']
df = pd.read_csv(path,header=None,names=names)
datas = df.replace('?',np.nan).dropna(how='any')
datas.head()

	id	Clump Thickness	Uniformity of Cell Size	Uniformity of Cell Shape	Marginal Adhesion	Single Epithelial Cell Size	Bare Nuclei	Bland Chromatin	Normal Nucleoli	Mitoses	Class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2

X = datas[names[1:10]]
Y = datas[names[10]]
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.1,random_state=0)

ss =StandardScaler()
X_train = ss.fit_transform(X_train)

# 3. 模型构建及训练
## penalty: 过拟合解决参数,l1或者l2
## solver: 参数优化方式
### 当penalty为l1的时候，参数只能是：liblinear(坐标轴下降法)；
### nlbfgs和cg都是关于目标函数的二阶泰勒展开
### 当penalty为l2的时候，参数可以是：lbfgs(拟牛顿法)、newton-cg(牛顿法变种)，seg(minibatch)
# 维度<10000时，lbfgs法比较好，   维度>10000时， cg法比较好，显卡计算的时候，lbfgs和cg都比seg快
## multi_class: 分类方式参数；参数可选: ovr(默认)、multinomial；这两种方式在二元分类问题中，效果是一样的；在多元分类问题中，效果不一样
### ovr: one-vs-rest， 对于多元分类的问题，先将其看做二元分类，分类完成后，再迭代对其中一类继续进行二元分类
### multinomial: many-vs-many（MVM）,即Softmax分类效果
## class_weight: 特征权重参数

### TODO: Logistic回归是一种分类算法，不能应用于回归中(也即是说对于传入模型的y值来讲，不能是float类型，必须是int类型)
lr = LogisticRegressionCV(multi_class='ovr',fit_intercept=True, Cs=np.logspace(-2, 2, 20), cv=2, penalty='l2', solver='lbfgs', tol=0.01)
re=lr.fit(X_train, Y_train)

r = re.score(X_train,Y_train)
print ("R值（准确率）：", r)
print ("稀疏化特征比率：%.2f%%" % (np.mean(lr.coef_.ravel() == 0) * 100))
print ("参数：",re.coef_)
print ("截距：",re.intercept_)
print(re.predict_proba(X_test)) # 获取sigmoid函数返回的概率值

R值（准确率）： 0.9706840390879479
稀疏化特征比率：0.00%
参数： [[1.3926311  0.17397478 0.65749877 0.8929026  0.36507062 1.36092964
  0.91444624 0.63198866 0.75459326]]
截距： [-1.02717163]
[[6.61838068e-06 9.99993382e-01]
 [3.78575185e-05 9.99962142e-01]
 [2.44249065e-15 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [1.52850624e-03 9.98471494e-01]
 [6.67061684e-05 9.99933294e-01]
 [6.75536843e-07 9.99999324e-01]
 [0.00000000e+00 1.00000000e+00]
 [2.43117004e-05 9.99975688e-01]
 [6.13092842e-04 9.99386907e-01]
 [0.00000000e+00 1.00000000e+00]
 [2.00330728e-06 9.99997997e-01]
 [0.00000000e+00 1.00000000e+00]
 [3.78575185e-05 9.99962142e-01]
 [4.65824155e-08 9.99999953e-01]
 [5.47788703e-10 9.99999999e-01]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [6.27260778e-07 9.99999373e-01]
 [3.78575185e-05 9.99962142e-01]
 [3.85098865e-06 9.99996149e-01]
 [1.80189197e-12 1.00000000e+00]
 [9.44640398e-05 9.99905536e-01]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [4.11688915e-06 9.99995883e-01]
 [1.85886872e-05 9.99981411e-01]
 [5.83016713e-06 9.99994170e-01]
 [0.00000000e+00 1.00000000e+00]
 [1.52850624e-03 9.98471494e-01]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [1.51713085e-05 9.99984829e-01]
 [2.34685008e-05 9.99976531e-01]
 [1.51713085e-05 9.99984829e-01]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [2.34685008e-05 9.99976531e-01]
 [0.00000000e+00 1.00000000e+00]
 [9.97563915e-07 9.99999002e-01]
 [1.70686321e-07 9.99999829e-01]
 [1.38382134e-04 9.99861618e-01]
 [1.36080718e-04 9.99863919e-01]
 [1.52850624e-03 9.98471494e-01]
 [1.68154251e-05 9.99983185e-01]
 [6.66097483e-04 9.99333903e-01]
 [0.00000000e+00 1.00000000e+00]
 [9.77502258e-07 9.99999022e-01]
 [5.83016713e-06 9.99994170e-01]
 [0.00000000e+00 1.00000000e+00]
 [4.09496721e-06 9.99995905e-01]
 [0.00000000e+00 1.00000000e+00]
 [1.37819117e-06 9.99998622e-01]
 [6.27260778e-07 9.99999373e-01]
 [4.52734741e-07 9.99999547e-01]
 [0.00000000e+00 1.00000000e+00]
 [8.88178420e-16 1.00000000e+00]
 [1.06976766e-08 9.99999989e-01]
 [0.00000000e+00 1.00000000e+00]
 [2.45780192e-04 9.99754220e-01]
 [3.92389040e-04 9.99607611e-01]
 [6.10681985e-05 9.99938932e-01]
 [9.44640398e-05 9.99905536e-01]
 [1.51713085e-05 9.99984829e-01]
 [2.45780192e-04 9.99754220e-01]
 [2.45780192e-04 9.99754220e-01]
 [1.51713085e-05 9.99984829e-01]
 [0.00000000e+00 1.00000000e+00]]

X_test = ss.transform(X_test)
Y_predict = re.predict(X_test)

x_len = range(len(X_test))
plt.figure(figsize=(14,7),facecolor='w')
plt.ylim(0,6)
plt.plot(x_len,Y_test,'ro',markersize=8,zorder=3,label='real value')
plt.plot(x_len,Y_predict,'go',markersize=14,zorder=2,label='predcit value,$R^2$=%.3f' % re.score(X_test,Y_test))
plt.legend(loc='upper left')
plt.xlabel('data num',fontsize=18)
plt.ylabel('cancer catelog',fontsize=18)
plt.title('Classification of data by Logistic regression algorithm',fontsize=20)
plt.show()

在这里插入图片描述

12Softmax回归

softmax回归是logistic回归的一般化，适用于K分类的问题，第k类的参数为向量 $θ_k$ ，组成的二维矩阵为 $θ_{k*n}$ ;
softmax函数的本质就是将一个K维的任意实数向量压缩(映射)成另一个K维的实数向量，其中向量中的每个元素取值都介于(0，1)之间。
softmax回归概率函数为:
$p(y=k|x;\theta)=\frac{e^{\theta_k^{T}x}}{\sum_{l=1}^ke^{\theta_l^{T}x}},k=1,2,...,K$

12.1Softmax案例(一):葡萄酒质量分类

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import label_binarize
from sklearn import metrics

path1 = "winequality-red.csv"
df1 = pd.read_csv(path1,sep=";")
df1["type"]=1

path2 = "winequality-white.csv"
df2 = pd.read_csv(path2,sep=";")
df2["type"]=2

df = pd.concat([df1,df2],axis=0)

names = ["fixed acidity","volatile acidity","citric acid",
         "residual sugar","chlorides","free sulfur dioxide",
         "total sulfur dioxide","density","pH","sulphates",
         "alcohol", "type"]
quality = "quality"
df.head(5)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	type
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	1
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5	1
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5	1
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6	1
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5	1

new_df = df.replace('?',np.nan)
datas = new_df.dropna(how='any')
print ("原始数据条数:%d；异常数据处理后数据条数:%d；异常数据条数:%d" % (len(df), len(datas), len(df) - len(datas)))
X = datas[names]
Y = datas[quality]

原始数据条数:6497；异常数据处理后数据条数:6497；异常数据条数:0

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=0)
print ("训练数据条数:%d；数据特征个数:%d；测试数据条数:%d" % (X_train.shape[0], X_train.shape[1], X_test.shape[0]))

训练数据条数:4872；数据特征个数:12；测试数据条数:1625

ss = StandardScaler()
X_train = ss.fit_transform(X_train)

Y_train.value_counts()

6    2132
5    1606
7     805
4     161
8     146
3      20
9       2
Name: quality, dtype: int64

lr = LogisticRegressionCV(fit_intercept=True, Cs=np.logspace(-5, 1, 100), 
                          multi_class='multinomial', penalty='l2', solver='lbfgs')
lr.fit(X_train, Y_train)

LogisticRegressionCV(Cs=array([1.00000000e-05, 1.14975700e-05, 1.32194115e-05, 1.51991108e-05,
       1.74752840e-05, 2.00923300e-05, 2.31012970e-05, 2.65608778e-05,
       3.05385551e-05, 3.51119173e-05, 4.03701726e-05, 4.64158883e-05,
       5.33669923e-05, 6.13590727e-05, 7.05480231e-05, 8.11130831e-05,
       9.32603347e-05, 1.07226722e-04, 1.23284674e-04, 1.41747416e-04,
       1.62975083e-04, 1.87...
       3.76493581e+00, 4.32876128e+00, 4.97702356e+00, 5.72236766e+00,
       6.57933225e+00, 7.56463328e+00, 8.69749003e+00, 1.00000000e+01]),
                     class_weight=None, cv='warn', dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='multinomial', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

r = lr.score(X_train, Y_train)
print("R值：", r)
print("特征稀疏化比率：%.2f%%" % (np.mean(lr.coef_.ravel() == 0) * 100))
print("参数：",lr.coef_)
print("截距：",lr.intercept_)
print("概率：", lr.predict_proba(X_test)) # 获取sigmoid函数返回的概率值

R值： 0.5447454844006568
特征稀疏化比率：0.00%
参数： [[ 2.33356993e-01  3.96147394e-01 -1.10711540e-01 -1.05855564e-01
   2.61801820e-01  3.56963863e-01  4.81906313e-02  1.94290542e-02
   2.25156551e-02 -1.94888814e-01 -8.98233422e-02  2.92866349e-03]
 [-1.33155433e-02  6.38796366e-01 -1.83921908e-02 -2.75146255e-01
   1.28019817e-01 -6.41981628e-01  6.48908149e-02  1.48959271e-01
   1.64274546e-02 -1.48473096e-01 -4.64261683e-01  6.99712789e-01]
 [-2.68229854e-01  2.62994302e-01  3.93147113e-02 -3.70401547e-01
   5.55961741e-02 -1.48023941e-01  2.95386175e-01  3.18642259e-01
  -2.07857133e-01 -1.78047228e-01 -8.24149481e-01 -2.39410311e-01]
 [-1.80030744e-01 -3.53604839e-01 -4.18188438e-02 -2.46998610e-02
   7.18370492e-04  5.92537487e-02 -7.33975434e-02  1.36161552e-01
  -1.09165220e-01  6.09787017e-02  1.61650854e-02 -2.23098277e-01]
 [ 1.52461941e-01 -6.36661173e-01 -4.56938032e-02  3.70505346e-01
  -2.27488717e-01  1.54955369e-01 -2.11312147e-01 -3.30908669e-01
   1.19653456e-01  3.13549620e-01  5.31754130e-01 -2.75147048e-01]
 [ 8.92749538e-03 -2.72849255e-01  8.55984157e-02  3.81082429e-01
  -1.71045984e-01  2.24782325e-01 -1.25358442e-01 -2.75522717e-01
   1.14952529e-01  1.95264228e-01  7.56680057e-01 -6.66454923e-03]
 [ 6.68297119e-02 -3.48227951e-02  9.17032513e-02  2.45154516e-02
  -4.76014817e-02 -5.94973687e-03  1.60051072e-03 -1.67607497e-02
   4.34732578e-02 -4.83834115e-02  7.36352343e-02  4.16787321e-02]]
截距： [-1.97619117 -0.06391344  2.29175069  2.85891982  1.44293004 -0.40871583
 -4.1447801 ]
概率： [[6.96038503e-06 1.34095016e-14 9.99993040e-01 ... 7.85580648e-17
  8.48744768e-12 6.17261950e-12]
 [8.28176713e-01 2.60641243e-14 1.71744247e-01 ... 3.15689985e-09
  7.89359197e-05 7.49868933e-08]
 [3.60210355e-04 8.25475749e-29 9.99639790e-01 ... 4.25877211e-24
  2.22228431e-15 8.64138861e-17]
 ...
 [6.29409692e-09 9.60374419e-19 9.99999994e-01 ... 1.69431710e-24
  1.15901186e-17 3.45570149e-16]
 [2.61326510e-03 2.64497823e-11 9.97386416e-01 ... 4.48053758e-11
  3.01796409e-07 1.52629118e-08]
 [2.26350417e-02 4.02079320e-26 9.77364958e-01 ... 3.02503793e-20
  2.52993652e-12 2.03862697e-14]]

print("概率：", lr.predict_proba(X_test).shape) # 获取sigmoid函数返回的概率值

概率： (1625, 7)

X_test = ss.transform(X_test)

Y_predict = lr.predict(X_test)

x_len = range(len(X_test))
plt.figure(figsize=(14,7),facecolor='w')
plt.ylim(-1,11)
plt.plot(x_len,Y_test,'ro',markersize=8,zorder=3,label='real value')
plt.plot(x_len,Y_predict,'go',markersize=12,zorder=2,label='predict value')
plt.legend(loc = 'upper left')
plt.xlabel('data num')
plt.ylabel('wine quality')
plt.title('wine predict')
plt.show()

在这里插入图片描述

13分类问题综合案例(一):信贷审批

基于信贷数据进行用户信贷分类，使用Logistic算法和KNN算法构建模型，并比较这两大类算法的效果

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings

import sklearn
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

pd.read_csv("crx.data",header=None)

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
0	b	30.83	0.000	u	g	w	v	1.25	t	t	1	f	g	00202	0	+
1	a	58.67	4.460	u	g	q	h	3.04	t	t	6	f	g	00043	560	+
2	a	24.50	0.500	u	g	q	h	1.50	t	f	0	f	g	00280	824	+
3	b	27.83	1.540	u	g	w	v	3.75	t	t	5	t	g	00100	3	+
4	b	20.17	5.625	u	g	w	v	1.71	t	f	0	f	s	00120	0	+
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
685	b	21.08	10.085	y	p	e	h	1.25	f	f	0	f	g	00260	0	-
686	a	22.67	0.750	u	g	c	v	2.00	f	t	2	t	g	00200	394	-
687	a	25.25	13.500	y	p	ff	ff	2.00	f	t	1	t	g	00200	1	-
688	b	17.92	0.205	u	g	aa	v	0.04	f	f	0	f	g	00280	750	-
689	b	35.00	3.375	u	g	c	h	8.29	f	f	0	t	g	00000	0	-

690 rows × 16 columns

path = "crx.data"
names = ['A1','A2','A3','A4','A5','A6','A7','A8',
         'A9','A10','A11','A12','A13','A14','A15','A16']
df = pd.read_csv(path, header=None, names=names)
print ("数据条数:", len(df))

# 2. 异常数据过滤
df = df.replace("?", np.nan).dropna(how='any')
print ("过滤后数据条数:", len(df))

df.head(5)

数据条数: 690
过滤后数据条数: 653

	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16
0	b	30.83	0.000	u	g	w	v	1.25	t	t	1	f	g	00202	0	+
1	a	58.67	4.460	u	g	q	h	3.04	t	t	6	f	g	00043	560	+
2	a	24.50	0.500	u	g	q	h	1.50	t	f	0	f	g	00280	824	+
3	b	27.83	1.540	u	g	w	v	3.75	t	t	5	t	g	00100	3	+
4	b	20.17	5.625	u	g	w	v	1.71	t	f	0	f	s	00120	0	+

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 689
Data columns (total 16 columns):
A1     653 non-null object
A2     653 non-null object
A3     653 non-null float64
A4     653 non-null object
A5     653 non-null object
A6     653 non-null object
A7     653 non-null object
A8     653 non-null float64
A9     653 non-null object
A10    653 non-null object
A11    653 non-null int64
A12    653 non-null object
A13    653 non-null object
A14    653 non-null object
A15    653 non-null int64
A16    653 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.7+ KB

df.A16.value_counts()

-    357
+    296
Name: A16, dtype: int64

# 自定义的一个哑编码实现方式：将v变量转换成为一个向量/list集合的形式
def parse(v, l):
    # v是一个字符串，需要进行转换的数据
    # l是一个类别信息，其中v是其中的一个值
    return [1 if i == v else 0 for i in l]
# 定义一个处理每条数据的函数
def parseRecord(record):
    result = []
    ## 格式化数据，将离散数据转换为连续数据
    a1 = record['A1']
    for i in parse(a1, ('a', 'b')):
        result.append(i)
    
    result.append(float(record['A2']))
    result.append(float(record['A3']))
    
    # 将A4的信息转换为哑编码的形式; 对于DataFrame中，原来一列的数据现在需要四列来进行表示
    a4 = record['A4']
    for i in parse(a4, ('u', 'y', 'l', 't')):
        result.append(i)
    
    a5 = record['A5']
    for i in parse(a5, ('g', 'p', 'gg')):
        result.append(i)
    
    a6 = record['A6']
    for i in parse(a6, ('c', 'd', 'cc', 'i', 'j', 'k', 'm', 'r', 'q', 'w', 'x', 'e', 'aa', 'ff')):
        result.append(i)
    
    a7 = record['A7']
    for i in parse(a7, ('v', 'h', 'bb', 'j', 'n', 'z', 'dd', 'ff', 'o')):
        result.append(i)
    
    result.append(float(record['A8']))
    
    a9 = record['A9']
    for i in parse(a9, ('t', 'f')):
        result.append(i)
        
    a10 = record['A10']
    for i in parse(a10, ('t', 'f')):
        result.append(i)
    
    result.append(float(record['A11']))
    
    a12 = record['A12']
    for i in parse(a12, ('t', 'f')):
        result.append(i)
        
    a13 = record['A13']
    for i in parse(a13, ('g', 'p', 's')):
        result.append(i)
    
    result.append(float(record['A14']))
    result.append(float(record['A15']))
    
    a16 = record['A16']
    if a16 == '+':
        result.append(1)
    else:
        result.append(0)
        
    return result

print(parse('v', ['v', 'y', 'l']))
print(parse('y', ['v', 'y', 'l']))
print(parse('l', ['v', 'y', 'l']))

[1, 0, 0]
[0, 1, 0]
[0, 0, 1]

### 数据特征处理(将数据转换为数值类型的)
new_names =  ['A1_0', 'A1_1',
              'A2','A3',
              'A4_0','A4_1','A4_2','A4_3', # 因为需要对A4进行哑编码操作，需要使用四列来表示一列的值
              'A5_0', 'A5_1', 'A5_2', 
              'A6_0', 'A6_1', 'A6_2', 'A6_3', 'A6_4', 'A6_5', 'A6_6', 'A6_7', 'A6_8', 'A6_9', 'A6_10', 'A6_11', 'A6_12', 'A6_13', 
              'A7_0', 'A7_1', 'A7_2', 'A7_3', 'A7_4', 'A7_5', 'A7_6', 'A7_7', 'A7_8', 
              'A8',
              'A9_0', 'A9_1' ,
              'A10_0', 'A10_1',
              'A11',
              'A12_0', 'A12_1',
              'A13_0', 'A13_1', 'A13_2',
              'A14','A15','A16']
datas = df.apply(lambda x: pd.Series(parseRecord(x), index = new_names), axis=1)
names = new_names

## 展示一下处理后的数据
datas.head(5)

	A1_0	A1_1	A2	A3	A4_0	A5_0	...	A10_1	A11	A12_0	A12_1	A13_0	A13_2	A14	A15	A16
0	0.0	1.0	30.83	0.000	1.0	1.0	...	0.0	1.0	0.0	1.0	1.0	0.0	202.0	0.0	1.0
1	1.0	0.0	58.67	4.460	1.0	1.0	...	0.0	6.0	0.0	1.0	1.0	0.0	43.0	560.0	1.0
2	1.0	0.0	24.50	0.500	1.0	1.0	...	1.0	0.0	0.0	1.0	1.0	0.0	280.0	824.0	1.0
3	0.0	1.0	27.83	1.540	1.0	1.0	...	0.0	5.0	1.0	0.0	1.0	0.0	100.0	3.0	1.0
4	0.0	1.0	20.17	5.625	1.0	1.0	...	1.0	0.0	0.0	1.0	0.0	1.0	120.0	0.0	1.0

5 rows × 48 columns

datas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 689
Data columns (total 48 columns):
A1_0     653 non-null float64
A1_1     653 non-null float64
A2       653 non-null float64
A3       653 non-null float64
A4_0     653 non-null float64
A4_1     653 non-null float64
A4_2     653 non-null float64
A4_3     653 non-null float64
A5_0     653 non-null float64
A5_1     653 non-null float64
A5_2     653 non-null float64
A6_0     653 non-null float64
A6_1     653 non-null float64
A6_2     653 non-null float64
A6_3     653 non-null float64
A6_4     653 non-null float64
A6_5     653 non-null float64
A6_6     653 non-null float64
A6_7     653 non-null float64
A6_8     653 non-null float64
A6_9     653 non-null float64
A6_10    653 non-null float64
A6_11    653 non-null float64
A6_12    653 non-null float64
A6_13    653 non-null float64
A7_0     653 non-null float64
A7_1     653 non-null float64
A7_2     653 non-null float64
A7_3     653 non-null float64
A7_4     653 non-null float64
A7_5     653 non-null float64
A7_6     653 non-null float64
A7_7     653 non-null float64
A7_8     653 non-null float64
A8       653 non-null float64
A9_0     653 non-null float64
A9_1     653 non-null float64
A10_0    653 non-null float64
A10_1    653 non-null float64
A11      653 non-null float64
A12_0    653 non-null float64
A12_1    653 non-null float64
A13_0    653 non-null float64
A13_1    653 non-null float64
A13_2    653 non-null float64
A14      653 non-null float64
A15      653 non-null float64
A16      653 non-null float64
dtypes: float64(48)
memory usage: 250.0 KB

## 数据分割
X = datas[names[0:-1]]
Y = datas[names[-1]]

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.1,random_state=0)

X_train.describe().T

	count	mean	std	min	25%	50%	75%	max
A1_0	587.0	0.315162	0.464977	0.00	0.000	0.00	1.000	1.00
A1_1	587.0	0.684838	0.464977	0.00	0.000	1.00	1.000	1.00
A2	587.0	31.685417	11.883506	13.75	22.625	28.67	38.290	76.75
A3	587.0	4.909319	5.073588	0.00	1.040	3.00	7.520	28.00
A4_0	587.0	0.761499	0.426530	0.00	1.000	1.00	1.000	1.00
A4_1	587.0	0.235094	0.424419	0.00	0.000	0.00	0.000	1.00
A4_2	587.0	0.003407	0.058321	0.00	0.000	0.00	0.000	1.00
A4_3	587.0	0.000000	0.000000	0.00	0.000	0.00	0.000	0.00
A5_0	587.0	0.761499	0.426530	0.00	1.000	1.00	1.000	1.00
A5_1	587.0	0.235094	0.424419	0.00	0.000	0.00	0.000	1.00
A5_2	587.0	0.003407	0.058321	0.00	0.000	0.00	0.000	1.00
A6_0	587.0	0.211244	0.408539	0.00	0.000	0.00	0.000	1.00
A6_1	587.0	0.037479	0.190094	0.00	0.000	0.00	0.000	1.00
A6_2	587.0	0.061329	0.240137	0.00	0.000	0.00	0.000	1.00
A6_3	587.0	0.085179	0.279386	0.00	0.000	0.00	0.000	1.00
A6_4	587.0	0.015332	0.122975	0.00	0.000	0.00	0.000	1.00
A6_5	587.0	0.073254	0.260775	0.00	0.000	0.00	0.000	1.00
A6_6	587.0	0.059625	0.236993	0.00	0.000	0.00	0.000	1.00
A6_7	587.0	0.005111	0.071367	0.00	0.000	0.00	0.000	1.00
A6_8	587.0	0.120954	0.326352	0.00	0.000	0.00	0.000	1.00
A6_9	587.0	0.097104	0.296352	0.00	0.000	0.00	0.000	1.00
A6_10	587.0	0.049404	0.216894	0.00	0.000	0.00	0.000	1.00
A6_11	587.0	0.035775	0.185887	0.00	0.000	0.00	0.000	1.00
A6_12	587.0	0.078365	0.268974	0.00	0.000	0.00	0.000	1.00
A6_13	587.0	0.069847	0.255106	0.00	0.000	0.00	0.000	1.00
A7_0	587.0	0.587734	0.492662	0.00	0.000	1.00	1.000	1.00
A7_1	587.0	0.207836	0.406105	0.00	0.000	0.00	0.000	1.00
A7_2	587.0	0.081772	0.274250	0.00	0.000	0.00	0.000	1.00
A7_3	587.0	0.011925	0.108641	0.00	0.000	0.00	0.000	1.00
A7_4	587.0	0.006814	0.082337	0.00	0.000	0.00	0.000	1.00
A7_5	587.0	0.013629	0.116042	0.00	0.000	0.00	0.000	1.00
A7_6	587.0	0.010221	0.100669	0.00	0.000	0.00	0.000	1.00
A7_7	587.0	0.076661	0.266280	0.00	0.000	0.00	0.000	1.00
A7_8	587.0	0.003407	0.058321	0.00	0.000	0.00	0.000	1.00
A8	587.0	2.221882	3.304041	0.00	0.210	1.00	2.605	28.50
A9_0	587.0	0.538330	0.498954	0.00	0.000	1.00	1.000	1.00
A9_1	587.0	0.461670	0.498954	0.00	0.000	0.00	1.000	1.00
A10_0	587.0	0.449744	0.497892	0.00	0.000	0.00	1.000	1.00
A10_1	587.0	0.550256	0.497892	0.00	0.000	1.00	1.000	1.00
A11	587.0	2.562181	5.056756	0.00	0.000	0.00	3.000	67.00
A12_0	587.0	0.465077	0.499204	0.00	0.000	0.00	1.000	1.00
A12_1	587.0	0.534923	0.499204	0.00	0.000	1.00	1.000	1.00
A13_0	587.0	0.913118	0.281903	0.00	1.000	1.00	1.000	1.00
A13_1	587.0	0.003407	0.058321	0.00	0.000	0.00	0.000	1.00
A13_2	587.0	0.083475	0.276835	0.00	0.000	0.00	0.000	1.00
A14	587.0	178.855196	171.400688	0.00	66.000	152.00	264.000	2000.00
A15	587.0	943.959114	5081.188098	0.00	0.000	5.00	397.000	100000.00

lr = LogisticRegressionCV(Cs=np.logspace(-4,1,50),fit_intercept=True,penalty='l2',solver='lbfgs',tol=0.01,multi_class='ovr')
lr.fit(X_train,Y_train)

LogisticRegressionCV(Cs=array([1.00000000e-04, 1.26485522e-04, 1.59985872e-04, 2.02358965e-04,
       2.55954792e-04, 3.23745754e-04, 4.09491506e-04, 5.17947468e-04,
       6.55128557e-04, 8.28642773e-04, 1.04811313e-03, 1.32571137e-03,
       1.67683294e-03, 2.12095089e-03, 2.68269580e-03, 3.39322177e-03,
       4.29193426e-03, 5.42867544e-03, 6.86648845e-03, 8.68511374e-03,
       1.09854114e-02, 1.38...
       1.20679264e+00, 1.52641797e+00, 1.93069773e+00, 2.44205309e+00,
       3.08884360e+00, 3.90693994e+00, 4.94171336e+00, 6.25055193e+00,
       7.90604321e+00, 1.00000000e+01]),
                     class_weight=None, cv='warn', dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='ovr', n_jobs=None, penalty='l2',
                     random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.01, verbose=0)

lr_r = lr.score(X_train,Y_train)
print ("Logistic算法R值（训练集上的准确率）：", lr_r)
print ("Logistic算法稀疏化特征比率：%.2f%%" % (np.mean(lr.coef_.ravel() == 0) * 100))
print ("Logistic算法参数：",lr.coef_)
print ("Logistic算法截距：",lr.intercept_)

Logistic算法R值（训练集上的准确率）： 0.8722316865417377
Logistic算法稀疏化特征比率：2.13%
Logistic算法参数： [[-3.16666246e-02  6.81060212e-02 -4.79275864e-03 -7.91375832e-03
   1.01962993e-01 -1.85791599e-01  1.20268003e-01  0.00000000e+00
   1.01962993e-01 -1.85791599e-01  1.20268003e-01 -9.73878221e-03
  -2.01150844e-02  4.10359964e-01 -4.58283536e-01 -2.05522876e-02
  -2.67931426e-01 -1.11132872e-01  3.47130384e-03  9.24560737e-02
   1.91389650e-01  6.42832964e-01  9.25454935e-02 -9.20792566e-02
  -4.16782808e-01  6.55077570e-02  3.38771742e-01 -1.89955583e-01
   1.09471200e-01  1.15000661e-01 -1.18558814e-01  2.30930774e-02
  -2.97691604e-01 -9.19903959e-03  1.32470292e-01  1.41022382e+00
  -1.37378443e+00  1.93156745e-01 -1.56717348e-01  1.36228784e-01
  -3.26934489e-02  6.91328454e-02 -3.00204676e-02 -1.47415133e-02
   8.12013774e-02 -1.42065091e-03  5.01349345e-04]]
Logistic算法截距： [-1.01504439]

lr_y_predict = lr.predict(X_test)
lr_y_predict

array([1., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1., 0., 0., 1., 0., 0., 1.,
       1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
       1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1.,
       0., 1., 0., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1.])

y1 = lr.predict_proba(X_test)
y1

array([[1.00172592e-01, 8.99827408e-01],
       [2.10731730e-01, 7.89268270e-01],
       [1.02564277e-01, 8.97435723e-01],
       [9.18579612e-01, 8.14203879e-02],
       [4.96460800e-01, 5.03539200e-01],
       [1.57562852e-12, 1.00000000e+00],
       [9.19055111e-01, 8.09448889e-02],
       [5.76602518e-02, 9.42339748e-01],
       [9.25514497e-01, 7.44855027e-02],
       [3.07565128e-01, 6.92434872e-01],
       [3.04840001e-06, 9.99996952e-01],
       [9.37140018e-01, 6.28599821e-02],
       [9.46845548e-01, 5.31544517e-02],
       [2.89763929e-02, 9.71023607e-01],
       [5.90593530e-01, 4.09406470e-01],
       [9.47922855e-01, 5.20771450e-02],
       [9.85641175e-03, 9.90143588e-01],
       [4.49637006e-01, 5.50362994e-01],
       [9.75285287e-01, 2.47147128e-02],
       [2.35426074e-01, 7.64573926e-01],
       [9.83683270e-01, 1.63167301e-02],
       [9.58942383e-01, 4.10576167e-02],
       [9.41157486e-01, 5.88425136e-02],
       [8.67416400e-01, 1.32583600e-01],
       [8.63673114e-01, 1.36326886e-01],
       [8.09135599e-01, 1.90864401e-01],
       [9.18740616e-01, 8.12593843e-02],
       [5.07988826e-01, 4.92011174e-01],
       [1.52565645e-01, 8.47434355e-01],
       [8.86917247e-01, 1.13082753e-01],
       [7.68921398e-01, 2.31078602e-01],
       [1.71031614e-02, 9.82896839e-01],
       [8.28998790e-01, 1.71001210e-01],
       [9.38043147e-01, 6.19568528e-02],
       [5.92562341e-02, 9.40743766e-01],
       [9.14115478e-01, 8.58845224e-02],
       [8.71766217e-01, 1.28233783e-01],
       [9.79206513e-01, 2.07934869e-02],
       [9.66223761e-01, 3.37762392e-02],
       [9.37688450e-01, 6.23115502e-02],
       [1.61634863e-01, 8.38365137e-01],
       [5.49906429e-01, 4.50093571e-01],
       [9.30588983e-01, 6.94110169e-02],
       [9.40655454e-01, 5.93445455e-02],
       [9.15741340e-01, 8.42586599e-02],
       [9.37482034e-01, 6.25179665e-02],
       [3.73482331e-01, 6.26517669e-01],
       [9.67523269e-01, 3.24767314e-02],
       [6.22550724e-01, 3.77449276e-01],
       [9.21861425e-02, 9.07813858e-01],
       [2.36736330e-02, 9.76326367e-01],
       [8.93707865e-01, 1.06292135e-01],
       [1.15353240e-01, 8.84646760e-01],
       [7.08249977e-01, 2.91750023e-01],
       [5.81239535e-01, 4.18760465e-01],
       [3.03352701e-02, 9.69664730e-01],
       [9.61968655e-01, 3.80313451e-02],
       [9.51893915e-02, 9.04810608e-01],
       [9.17046779e-02, 9.08295322e-01],
       [5.44721836e-02, 9.45527816e-01],
       [9.49642697e-01, 5.03573032e-02],
       [3.77682261e-01, 6.22317739e-01],
       [9.57133060e-01, 4.28669400e-02],
       [2.48916456e-03, 9.97510835e-01],
       [8.98503538e-01, 1.01496462e-01],
       [3.21273555e-01, 6.78726445e-01]])

## KNN
knn = KNeighborsClassifier(n_neighbors=20,algorithm='kd_tree',weights='distance')
knn.fit(X_train,Y_train)

KNeighborsClassifier(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=20, p=2,
                     weights='distance')

knn_r = knn.score(X_train,Y_train)
print("Logistic algorithm R: %.2f"%knn_r)

Logistic algorithm R: 1.00

knn_y_predict = knn.predict(X_test)
knn_r_test = knn.score(X_test,Y_test)
print("Logistic算法训练上R值（测试集上准确率）：%.2f" % knn_r_test)

Logistic算法训练上R值（测试集上准确率）：0.71

x_len = range(len(X_test))
plt.figure(figsize=(14,7),facecolor='w')
plt.ylim(-0.1,1.1)
plt.plot(x_len, Y_test, 'ro',markersize = 6, zorder=3, label=u'real value')
plt.plot(x_len, lr_y_predict, 'go', markersize = 10, zorder=2, label=u'Logis predict,$R^2$=%.3f' % lr.score(X_test, Y_test))
plt.plot(x_len, knn_y_predict, 'yo', markersize = 16, zorder=1, label=u'KNN predict,$R^2$=%.3f' % knn.score(X_test, Y_test))
plt.legend(loc = 'center right')
plt.xlabel(u'data num', fontsize=18)
plt.ylabel(u'isApproval(0 N，1 Y)', fontsize=18)
plt.title(u'Logistic and KNN ', fontsize=20)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hfkwpJ8z-1574610475754)(output_18_1.png)]

14总结

线性模型一般用于回归问题，Logistic和Softmax模型一般用于分类问题
求θ的主要方式是梯度下降算法，梯度下降算法是参数优化的重要手段，主要是SGD，适用于在线学习以及跳出局部极小值
Logistic/Softmax回归是实践中解决分类问题的最重要的方法
广义线性模型对样本要求不必要服从正态分布、只需要服从指数分布簇(二项分布、泊松分布、伯努利分布、指数分布等)即可;广义线性模型的自变量可以是连续的也可以是离散的。