[机器学习][理论知识][实践]回归算法

1回归算法概念

  • 回归算法是一种有监督算法
  • 回归算法是一种比较常用的机器学习算法,用来建立“解释”变量(自变量X)和观 测值(因变量Y)之间的关系;从机器学习的角度来讲,用于构建一个算法模型(函 数)来做属性(X)与标签(Y)之间的映射关系,在算法的学习过程中,试图寻找一个 函数 h: R d → R R^d\to R RdR使得参数之间的关系拟合性最好。
  • 回归算法中算法(函数)的最终结果是一个连续的数据值,输入值(属性值)是一个d 维度的属性/数值向量

2线性回归

h θ ( x ) = θ 0 + θ 1 x 1 + . . . + θ n x n = θ 0 1 + θ 1 x 1 + . . . + θ n x n = θ 0 x 0 + θ 1 x 1 + . . . . + θ n x n = ∑ i = 0 n θ i x i = θ T x h_\theta(x)=\theta_0+\theta_1x_1+...+\theta_nx_n\\ = \theta_01+\theta_1x_1+...+\theta_nx_n\\ = \theta_0x_0+\theta_1x_1+....+\theta_nx_n\\ = \sum_{i=0}^n\theta_ix_i=\theta^Tx hθ(x)=θ0+θ1x1+...+θnxn=θ01+θ1x1+...+θnxn=θ0x0+θ1x1+....+θnxn=i=0nθixi=θTx
最终要求是计算出 的值,并选择最优的 值构成算法公式

2.1线性回归、最大似然估计及二乘法

y ( i ) = θ T x ( i ) + ε ( i ) y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)} y(i)=θTx(i)+ε(i)
误差 ε ( i ) ( 1 ≤ i ≤ n ) \varepsilon^{(i)}(1\leq i\leq n) ε(i)(1in)是独立同分布的,服从均值为0,方差为某定值 σ 2 \sigma^2 σ2的高斯分布。原因:中心极限定理
实际问题中,很多随机现象可以看做众多因素的独立影响的综合反应,往往服从 正态分布

2.1.1似然函数

y ( i ) = θ T x ( i ) + ε ( i ) p ( ε ( i ) ) = 1 σ 2 π e { − ( ε ( i ) ) 2 2 σ 2 } p ( y ( i ) ∣ x ( i ) ; θ ) = 1 σ 2 π e x p { − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 } L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m 1 σ 2 π e x p { − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 } y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}\\ p(\varepsilon^{(i)})=\frac{1}{\sigma \sqrt{2\pi}}e^{\{-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}\}}\\ p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\}\\ L(\theta)=\prod_{i=1}^mp(y^{(i)}|x^{(i)};\theta)\\ =\prod_{i=1}^m\frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\} y(i)=θTx(i)+ε(i)p(ε(i))=σ2π 1e{2σ2(ε(i))2}p(y(i)x(i);θ)=σ2π 1exp{2σ2(y(i)θTx(i))2}L(θ)=i=1mp(y(i)x(i);θ)=i=1mσ2π 1exp{2σ2(y(i)θTx(i))2}
对数似然、目标函数及最小二乘
ϱ ( θ ) = log ⁡ L ( θ ) = log ⁡ ∏ i = 1 m 1 σ 2 π e x p { − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 } = ∑ i = 1 m log ⁡ 1 σ 2 π e x p { − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 } = m log ⁡ 1 σ 2 π − 1 σ 2 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 l o s s ( y j , y ^ j ) = J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \varrho(\theta)=\log L(\theta) = \log \prod_{i=1}^m\frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\}\\ =\sum_{i=1}^m\log \frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\}\\ =m\log \frac{1}{\sigma\sqrt{2\pi}}-\frac{1}{\sigma^2}\frac{1}{2}\sum_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2\\ loss(y_j,\hat y_j)=J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2 ϱ(θ)=logL(θ)=logi=1mσ2π 1exp{2σ2(y(i)θTx(i))2}=i=1mlogσ2π 1exp{2σ2(y(i)θTx(i))2}=mlogσ2π 1σ2121i=1m(y(i)θTx(i))2loss(yj,y^j)=J(θ)=21i=1m(hθ(x(i))y(i))2

2.1.2最小二乘法的参数最优解

参数解析式
θ = ( X T X ) − 1 X T Y \theta = (X^TX)^{-1}X^TY θ=(XTX)1XTY
最小二乘法的使用要求矩阵 X T X X^T X XTX 是可逆的;为了防止不可逆或者过拟合的问题 存在,可以增加额外数据影响,导致最终的矩阵是可逆的:
θ = ( X T X + λ I ) − 1 X T y \theta = (X^TX+\lambda I)^{-1}X^Ty θ=(XTX+λI)1XTy
最小二乘法直接求解的难点:矩阵逆的求解是一个难处

2.1.3普通最小二乘法线性回归案例

现有一批描述家庭用电情况的数据,对数据进行算法模型预测,并最终得到预测 模型(每天各个时间段和功率之间的关系、功率与电流之间的关系等)
数据来源: Individual household electric power consumption Data Set
建议:使用python的sklearn库的linear_model中LinearRegression来获取算法
具体代码如下:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pandas import DataFrame 
import time
import pandas as pd
patch = 'datas/household_power_consumption_1000.txt'
df = pd.read_csv(patch,sep=';',low_memory=False)
df.head()
DateTimeGlobal_active_powerGlobal_reactive_powerVoltageGlobal_intensitySub_metering_1Sub_metering_2Sub_metering_3
016/12/200617:24:004.2160.418234.8418.40.01.017.0
116/12/200617:25:005.3600.436233.6323.00.01.016.0
216/12/200617:26:005.3740.498233.2923.00.02.017.0
316/12/200617:27:005.3880.502233.7423.00.01.017.0
416/12/200617:28:003.6660.528235.6815.80.01.017.0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
Date                     1000 non-null object
Time                     1000 non-null object
Global_active_power      1000 non-null float64
Global_reactive_power    1000 non-null float64
Voltage                  1000 non-null float64
Global_intensity         1000 non-null float64
Sub_metering_1           1000 non-null float64
Sub_metering_2           1000 non-null float64
Sub_metering_3           1000 non-null float64
dtypes: float64(7), object(2)
memory usage: 70.4+ KB
# 异常数据过滤
new_df = df.replace('?',np.nan)
datas = new_df.dropna(axis=0,how='any')
datas.describe()
Global_active_powerGlobal_reactive_powerVoltageGlobal_intensitySub_metering_1Sub_metering_2Sub_metering_3
count1000.0000001000.0000001000.000001000.0000001000.01000.0000001000.000000
mean2.4187720.089232240.0357910.3510000.02.7490005.756000
std1.2399790.0880884.084425.1222140.08.1040538.066941
min0.2060000.000000230.980000.8000000.00.0000000.000000
25%1.8060000.000000236.940008.4000000.00.0000000.000000
50%2.4140000.072000240.6500010.0000000.00.0000000.000000
75%3.3080000.126000243.2950014.0000000.01.00000017.000000
max7.7060000.528000249.3700033.2000000.038.00000019.000000
# 查看格式信息
datas.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 9 columns):
Date                     1000 non-null object
Time                     1000 non-null object
Global_active_power      1000 non-null float64
Global_reactive_power    1000 non-null float64
Voltage                  1000 non-null float64
Global_intensity         1000 non-null float64
Sub_metering_1           1000 non-null float64
Sub_metering_2           1000 non-null float64
Sub_metering_3           1000 non-null float64
dtypes: float64(7), object(2)
memory usage: 78.1+ KB
## 创建一个时间函数格式化字符串
def date_format(dt):
    # dt显示是一个series/tuple;dt[0]是date,dt[1]是time
    import time
    t = time.strptime(' '.join(dt), '%d/%m/%Y %H:%M:%S')
    return (t.tm_year, t.tm_mon, t.tm_mday, t.tm_hour, t.tm_min, t.tm_sec)
## 需求:构建时间和功率之间的映射关系,可以认为:特征属性为时间;目标属性为功率值。
# 获取x和y变量, 并将时间转换为数值型连续变量
X = datas.iloc[:,0:2]
X = X.apply(lambda x: pd.Series(date_format(x)), axis=1)
Y = datas['Global_active_power']
## 对数据集进行测试集合训练集划分
# X:特征矩阵(类型一般是DataFrame)
# Y:特征对应的Label标签(类型一般是Series)
# test_size: 对X/Y进行划分的时候,测试集合的数据占比, 是一个(0,1)之间的float类型的值
# random_state: 数据分割是基于随机器进行分割的,该参数给定随机数种子;给一个值(int类型)的作用就是保证每次分割所产生的数数据集是完全相同的
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
X_train.describe()

012345
count800.0800.0800.000000800.000000800.000000800.0
mean2006.012.016.59875010.75500029.7237500.0
std0.00.00.4904588.06838617.2665170.0
min2006.012.016.0000000.0000000.0000000.0
25%2006.012.016.0000004.00000015.0000000.0
50%2006.012.017.0000008.00000030.0000000.0
75%2006.012.017.00000019.00000045.0000000.0
max2006.012.017.00000023.00000059.0000000.0
## 数据标准化
# StandardScaler:将数据转换为标准差为1的数据集(有一个数据的映射)
# scikit-learn中:如果一个API名字有fit,那么就有模型训练的含义,没法返回值
# scikit-learn中:如果一个API名字中有transform, 那么就表示对数据具有转换的含义操作
# scikit-learn中:如果一个API名字中有predict,那么就表示进行数据预测,会有一个预测结果输出
# scikit-learn中:如果一个API名字中既有fit又有transform的情况下,那就是两者的结合(先做fit,再做transform)
ss = StandardScaler()
X_train = ss.fit_transform(X_train) # 训练模型并转换训练集
X_test = ss.transform(X_test) ## 直接使用在模型构建数据上进行一个数据标准化操作 (测试集)

print(X_train)

[[ 0.          0.          0.81862454 -0.83774203  1.23299681  0.        ]
 [ 0.          0.          0.81862454 -1.20979622  0.82733427  0.        ]
 [ 0.          0.         -1.22156123  1.39458314  1.52275577  0.        ]
 ...
 [ 0.          0.          0.81862454 -0.96176009  1.3489004   0.        ]
 [ 0.          0.          0.81862454 -1.08577816  0.76938248  0.        ]
 [ 0.          0.          0.81862454 -0.83774203  1.05914144  0.        ]]
pd.DataFrame(X_train).describe()

012345
count800.0800.08.000000e+028.000000e+028.000000e+02800.0
mean0.00.02.050582e-15-5.107026e-175.329071e-170.0
std0.00.01.000626e+001.000626e+001.000626e+000.0
min0.00.0-1.221561e+00-1.333814e+00-1.722545e+000.0
25%0.00.0-1.221561e+00-8.377420e-01-8.532677e-010.0
50%0.00.08.186245e-01-3.416698e-011.600918e-020.0
75%0.00.08.186245e-011.022529e+008.852861e-010.0
max0.00.08.186245e-011.518601e+001.696611e+000.0
# 模型训练
lr = LinearRegression(fit_intercept=True)
lr.fit(X_train,Y_train)
# 模型校验
y_predict = lr.predict(X_test)
print("训练集上R2:",lr.score(X_train, Y_train))
print("测试集上R2:",lr.score(X_test, Y_test))
# 均方误差,分值越高说明效果越差
mse = np.average((y_predict-Y_test)**2)
rmse = np.sqrt(mse)
print(rmse)

训练集上R2: 0.24409311805909026
测试集上R2: 0.1255162851373588
1.1640923459736248
# 输出模型训练得到的相关参数
print("模型系数:θ",end="")
print(lr.coef_)
print("模型的截距:", end='')
print(lr.intercept_)

模型系数:θ[ 0.00000000e+00  2.77555756e-16 -1.41588166e+00 -9.34953243e-01
 -1.02140756e-01  0.00000000e+00]
模型的截距:2.4454375000000024
## 模型保存/持久化
# 在机器学习部署的时候,实际上其中一种方式就是将模型进行输出;另外一种方式就是直接将预测结果输出
# 模型输出一般是将模型输出到磁盘文件
import joblib
# 保存模型要求给定的文件所在的文件夹比较存在
joblib.dump(ss,'data_ss.model')
joblib.dump(lr,'data_lr.model')

['data_lr.model']
# 加载模型
ss3 = joblib.load('data_ss.model')
lr3 = joblib.load('data_lr.model')

date1 = [[2019,11,11,10,7,0]]
data1 = ss3.transform(date1)
print(data1)
lr3.predict(data1)

[[ 13.          -1.         -11.42249008  -0.09363364  -1.31688203
    0.        ]]






array([18.84038214])
## 预测值和实际值画图比较
t = np.arange(len(X_test))
plt.figure(facecolor="w")
plt.plot(t,Y_test,'r-',linewidth=2,label='RealValue')
plt.plot(t,y_predict,'g-',linewidth=2,label='Predict')
plt.legend(loc='upper left')
plt.title('Relationship between time and power by Linear regression',fontsize=20)
plt.grid(b=True)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ys3rfRCm-1573483279268)(output_17_0.png)]

## 功率和电流之间的关系
X = datas.iloc[:,2:4]
X

Global_active_powerGlobal_reactive_power
04.2160.418
15.3600.436
25.3740.498
35.3880.502
43.6660.528
.........
9952.2960.054
9962.2920.054
9970.3700.000
9980.4720.000
9993.0540.060

1000 rows × 2 columns

Y2 = datas.iloc[:,5]
Y2

0      18.4
1      23.0
2      23.0
3      23.0
4      15.8
       ... 
995     9.6
996     9.6
997     2.4
998     2.4
999    13.4
Name: Global_intensity, Length: 1000, dtype: float64
## 数据分割
X2_train,X2_test,Y2_train,Y2_test = train_test_split(X, Y2, test_size=0.2, random_state=0)

## 数据归一化
scaler2 = StandardScaler()
X2_train = scaler2.fit_transform(X2_train) # 训练并转换
X2_test = scaler2.transform(X2_test) ## 直接使用在模型构建数据上进行一个数据标准化操作 

## 模型训练
lr2 = LinearRegression()
lr2.fit(X2_train, Y2_train) ## 训练模型

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
#结果预测
Y2_predict = lr2.predict(X2_test)

#模型估计
print("电流模型评估:",lr2.score(X2_test,Y2_test))
print("电流参数:",lr2.coef_)

电流模型评估: 0.9920420609708968
电流参数: [5.07744316 0.07191391]
## 绘制图表
#### 电流关系
tn = np.arange(len(X2_test))
plt.figure(facecolor='w')
plt.plot(tn,Y2_test,'r-',linewidth=2,label='RealValue')
plt.plot(tn,Y2_predict,'g-',linewidth=2,label='PredictValue')
plt.legend(loc = 'upper left')
plt.title('Relationship between Power and current predicted by Linear regression',fontsize=20)
plt.grid(True)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WEKENEr9-1573483279271)(output_25_0.png)]

2.2目标函数

  • 0-1损失函数
    J ( θ ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( x ) J(\theta)=\begin{cases} 1,Y\neq f(X)\\ 0, Y=f(x) \end{cases} J(θ)={1,Y=f(X)0,Y=f(x)
  • 感知损失函数
    J ( θ ) = { 1 , ∣ Y − f ( X ) ∣ > t 0 , ∣ Y − f ( X ) ∣ ≤ t J(\theta)=\begin{cases} 1,|Y-f(X)|>t\\ 0, |Y-f(X)|\leq t \end{cases} J(θ)={1,Yf(X)>t0,Yf(X)t
  • 平方损失函数

J ( θ ) = ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ)=i=1m(hθ(x(i))y(i))2

  • 绝对值损失函数
    J ( θ ) = ∑ i = 1 m ∣ ( h θ ( x ( i ) ) − y ( i ) ) ∣ J(\theta)=\sum_{i=1}^m|(h_\theta(x^{(i)})-y^{(i)})| J(θ)=i=1m(hθ(x(i))y(i))

  • 对数损失函数
    J ( θ ) = ∑ i = 1 m ( y ( i ) l o g h θ ( x ( i ) ) ) J(\theta)=\sum_{i=1}^m(y^{(i)}logh_\theta(x^{(i)})) J(θ)=i=1m(y(i)loghθ(x(i)))

2.3线性回归的过拟合

目标函数:
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) ) 2 J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)}))^2 J(θ)=21i=1m(hθ(x(i)y(i)))2
为了防止数据过拟合,也就是的θ值在样本空间中不能过大/过小,可以在目标函 数之上增加一个平方和损失:
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) ) 2 + λ ∑ i = 1 n θ j 2 J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)}))^2 + \lambda\sum_{i=1}^n\theta^2_j J(θ)=21i=1m(hθ(x(i)y(i)))2+λi=1nθj2
正则项(norm):
λ ∑ j = 1 m θ j 2 \lambda \sum_{j=1}^m\theta^2_j λj=1mθj2
这里这个正则项叫做L2-norm

2.3.1过拟合和正则项

L2-norm:
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x i ) − y ( i ) ) 2 + λ ∑ i = 1 m θ j 2 ,     λ > 0 J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{i=1}^m\theta^2_j ,\space\space\space\lambda>0 J(θ)=21i=1m(hθ(xi)y(i))2+λi=1mθj2,   λ>0
L1-norm:
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x i ) − y ( i ) ) 2 + λ ∑ j = 1 m ∣ θ j ∣ ,     λ > 0 J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{j=1}^m|\theta_j |,\space\space\space\lambda>0 J(θ)=21i=1m(hθ(xi)y(i))2+λj=1mθj,   λ>0
Ridge回归:
使用L2正则的线性回归模型就称为Ridge回归(岭回归)
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x i ) − y ( i ) ) 2 + λ ∑ i = 1 m θ j 2 ,     λ > 0 J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{i=1}^m\theta^2_j ,\space\space\space\lambda>0 J(θ)=21i=1m(hθ(xi)y(i))2+λi=1mθj2,   λ>0
LASSO回归:
使用L1正则的线性回归模型就称为LASSO回归(Least Absolute Shrinkage and Selection Operator)
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x i ) − y ( i ) ) 2 + λ ∑ j = 1 m ∣ θ j ∣ ,     λ > 0 J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{j=1}^m|\theta_j |,\space\space\space\lambda>0 J(θ)=21i=1m(hθ(xi)y(i))2+λj=1mθj,   λ>0

2.4Ridge(L2-norm)和LASSO(L1-norm)比较

  • L2-norm中,由于对于各个维度的参数缩放是在一个圆内缩放的,不可能导致 有维度参数变为0的情况,那么也就不会产生稀疏解;实际应用中,数据的维度 中是存在噪音和冗余的,稀疏的解可以找到有用的维度并且减少冗余,提高回归 预测的准确性和鲁棒性(减少了overfitting)(L1-norm可以达到最终解的稀疏 性的要求)
  • Ridge模型具有较高的准确性、鲁棒性以及稳定性;LASSO模型具有较高的求解 速度。
  • 如果既要考虑稳定性也考虑求解的速度,就使用Elasitc Net

2.5 Elasitc Net

同时使用L1正则和L2正则的线性回归模型就称为Elasitc Net算法(弹性网络算法)
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x i ) − y ( i ) ) 2 + λ ∑ i = 1 m θ j 2 + λ { p ∑ j = 1 n ∣ θ j ∣ + ( 1 − p ) ∑ j = 1 n θ j 2 } λ > 0 , p ∈ [ 0 , 1 ] J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2+\lambda\sum_{i=1}^m\theta^2_j +\lambda\{p\sum_{j=1}^n|\theta_j|+(1-p)\sum_{j=1}^n\theta^2_j\}\\ \lambda >0 ,p\in[0,1] J(θ)=21i=1m(hθ(xi)y(i))2+λi=1mθj2+λ{pj=1nθj+(1p)j=1nθj2}λ>0,p[0,1]

3模型效果判断

M S E = 1 m ∑ i = 1 m ( y i − y ^ i ) 2 R M S E = M S E = 1 m ∑ i = 1 m ( y i − y ^ i ) 2 R 2 = 1 − R S S T S S = 1 − ∑ i = 1 m ( y i − y ^ ) 2 ∑ i = 1 m ( y i − y ˉ ) 2 y ˉ = 1 m ∑ i = 1 m y i MSE = \frac{1}{m}\sum_{i=1}^m(y_i-\hat y_i)^2\\ RMSE = \sqrt{MSE}=\sqrt{\frac{1}{m}\sum_{i=1}^m(y_i-\hat y_i)^2}\\ R^2 = 1-\frac{RSS}{TSS}=1-\frac{\sum_{i=1}^m(y_i-\hat y)^2}{\sum_{i=1}^m(y_i-\bar y)^2}\\ \bar y = \frac{1}{m}\sum_{i=1}^my_i MSE=m1i=1m(yiy^i)2RMSE=MSE =m1i=1m(yiy^i)2 R2=1TSSRSS=1i=1m(yiyˉ)2i=1m(yiy^)2yˉ=m1i=1myi

  • MSE:误差平方和,越趋近于0表示模型越拟合训练数据。
  • RMSE:MSE的平方根,作用同MSE
  • R2:取值范围(负无穷,1],值越大表示模型越拟合训练数据;最优解是1;当模型 预测为随机值的时候,有可能为负;若预测值恒为样本期望,R2为0
  • TSS:总平方和TSS(Total Sum of Squares),表示样本之间的差异情况,是伪方 差的m倍
  • RSS:残差平方和RSS(Residual Sum of Squares),表示预测值和样本值之 间的差异情况,是MSE的m倍

4机器学习调参

  • 在实际工作中,对于各种算法模型(线性回归)来讲,我们需要获取θ、λ、p的值; θ的求解其实就是算法模型的求解,一般不需要开发人员参与(算法已经实现),主 要需要求解的是λ和p的值,这个过程就叫做调参(超参)
  • 交叉验证:将训练数据分为多份,其中一份进行数据验证并获取最优的超参:λ 和p;比如:十折交叉验证、五折交叉验证(scikit-learn中默认)等

5梯度下降算法

目标函数θ求解
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x i ) − y ( i ) ) 2 J(\theta)=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{i})-y^{(i)})^2 J(θ)=21i=1m(hθ(xi)y(i))2
初始化θ(随机初始化,可以初始为0)
沿着负梯度方向迭代,更新后的θ使J(θ)更小
θ = θ − α ∂ J ( θ ) ∂ θ α : 学 习 率 、 步 长 \theta = \theta-\alpha \frac{\partial J(\theta)}{\partial\theta} \\ \alpha:学习率、步长 θ=θαθJ(θ)α:

5.1梯度方向

∂ ∂ θ j J ( θ ) = ∂ ∂ θ j 1 2 ( h θ ( x ) − y ) 2 = 2 1 2 ( h θ ( x ) − y ) 2 ∂ ∂ θ j ( h θ ( x ) − y ) = ( h θ ( x ) − y ) 2 ∂ ∂ θ j { ∑ i = 1 θ j x j − y } = ( h θ ( x ) − y ) x j \frac{\partial}{\partial \theta_j}J(\theta)=\frac{\partial}{\partial \theta_j}\frac{1}{2}(h_\theta(x)-y)^2=\\ 2\frac{1}{2}(h_\theta(x)-y)^2\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)=\\ (h_\theta(x)-y)^2\frac{\partial}{\partial\theta_j}\{\sum_{i=1}\theta_jx_j-y\}=\\ (h_\theta(x)-y)x_j θjJ(θ)=θj21(hθ(x)y)2=221(hθ(x)y)2θj(hθ(x)y)=(hθ(x)y)2θj{i=1θjxjy}=(hθ(x)y)xj

5.2批量梯度下降算法(BGD)

∂ ∂ θ j J ( θ ) = ( h θ ( x ) − y ) x j ∂ J ( θ ) ∂ θ j = ∑ i = 1 m ∂ ∂ θ j = ∑ i = 1 m ( x j ( i ) ( h θ ( x ( i ) ) − y ( i ) ) ) = ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) = θ j = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \frac{\partial}{\partial \theta_j}J(\theta)=(h_\theta(x)-y)x_j\\ \frac{\partial J(\theta)}{\partial\theta_j}=\sum_{i=1}^m\frac{\partial}{\partial \theta_j}=\sum_{i=1}^m(x_j^{(i)}(h_\theta(x^{(i)})-y^{(i)}))=\\ \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}=\\ \theta_j= \theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} θjJ(θ)=(hθ(x)y)xjθjJ(θ)=i=1mθj=i=1m(xj(i)(hθ(x(i))y(i)))=i=1m(hθ(x(i))y(i))xj(i)=θj=θj+αi=1m(y(i)hθ(x(i)))xj(i)

5.3随机梯度下降算法(SGD)

∂ ∂ θ j J ( θ ) = ( h θ ( x ) − y ) x j \frac{\partial}{\partial \theta_j}J(\theta)=(h_\theta(x)-y)x_j θjJ(θ)=(hθ(x)y)xj
for i to m{
θ j = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_j = \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} θj=θj+α(y(i)hθ(x(i)))xj(i)
}

5.4BGD和SGD算法比较

  • SGD速度比BGD快(迭代次数少)
  • SGD在某些情况下(全局存在多个相对最优解/J(θ)不是一个二次),SGD有可能跳 出某些小的局部最优解,所以不会比BGD坏
  • BGD一定能够得到一个局部最优解(在线性回归模型中一定是得到一个全局最优 解),SGD由于随机性的存在可能导致最终结果比BGD的差
  • 注意:优先选择SGD

5.5小批量梯度下降法(MBGD)

如果即需要保证算法的训练过程比较快,又需要保证最终参数训练的准确率,而 这正是小批量梯度下降法(Mini-batch Gradient Descent,简称MBGD)的初 衷。MBGD中不是每拿一个样本就更新一次梯度,而且拿b个样本(b一般为10)的 平均梯度作为更新方向。
for i = 1 to m/10{
θ j = θ j + α ∑ k = 1 i + 10 ( y ( k ) − h θ ( x ( k ) ) ) x j ( k ) \theta_j =\theta_j+\alpha \sum_{k=1}^{i+10}(y^(k)-h_\theta(x^{(k)}))x_j^{(k)} θj=θj+αk=1i+10(y(k)hθ(x(k)))xj(k)
}

5.6小结

由于梯度下降法中负梯度方向作为变量的变化方向,所以有可能导 致最终求解的值是局部最优解,所以在使用梯度下降的时候,一般需 要进行一些调优策略:

  • 学习率的选择:学习率过大,表示每次迭代更新的时候变化比较大,有可能 会跳过最优解;学习率过小,表示每次迭代更新的时候变化比较小,就会导 致迭代速度过慢,很长时间都不能结束;
  • 算法初始参数值的选择:初始值不同,最终获得的最小值也有可能不同,因 为梯度下降法求解的是局部最优解,所以一般情况下,选择多次不同初始值 运行算法,并最终返回损失函数最小情况下的结果值;
  • 标准化:由于样本不同特征的取值范围不同,可能会导致在各个不同参数上 迭代速度不同,为了减少特征取值的影响,可以将特征进行标准化操作。

5.6.1BGD、SGD、MBGD的区别:

  • 当样本量为m的时候,每次迭代BGD算法中对于参数值更新一次,SGD算法 中对于参数值更新m次,MBGD算法中对于参数值更新m/n次,相对来讲 SGD算法的更新速度最快;
  • SGD算法中对于每个样本都需要更新参数值,当样本值不太正常的时候,就 有可能会导致本次的参数更新会产生相反的影响,也就是说SGD算法的结果 并不是完全收敛的,而是在收敛结果处波动的;
  • SGD算法是每个样本都更新一次参数值,所以SGD算法特别适合样本数据量 大的情况以及在线机器学习(Online ML)。

6线性回归的扩展

  • 线性回归针对的是θ而言是一种,对于样本本身而言,样本可以是非线性的
  • 也就是说最终得到的函数f:x->y;函数f(x)可以是非线性的,比如:曲线等

7线性回归总结

  • 算法模型:线性回归(Linear)、岭回归(Ridge)、LASSO回归、Elastic Net
  • 正则化:L1-norm、L2-norm
  • 损失函数/目标函数: J ( θ ) = ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 → min ⁡ θ J ( θ ) J(\theta)=\sum_{i=1} ^m(h_\theta(x^{(i)})-y^{(i)})^2 \to \min_\theta J(\theta) J(θ)=i=1m(hθ(x(i))y(i))2minθJ(θ)
  • θ求解方式:最小二乘法(直接计算,目标函数是平方和损失函数)、梯度下降 (BGD\SGD\MBGD)

8局部加权回归

8.1局部加权回归-损失函数

  • 普通线性回归损失函数:
    J ( θ ) = ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\sum_{i=1} ^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ)=i=1m(hθ(x(i))y(i))2
  • 局部加权回归损失函数:
    J ( θ ) = ∑ i = 1 m w ( i ) ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\sum_{i=1} ^mw^{(i)}(h_\theta(x^{(i)})-y^{(i)})^2 J(θ)=i=1mw(i)(hθ(x(i))y(i))2

8.2局部加权回归-权重值设置

w ( i ) w^{(i)} w(i)是权重,它根据要预测的点与数据集中的点的距离来为数据集中的点赋权值。 当某点离要预测的点越远,其权重越小,否则越大。常用值选择公式为:
w ( i ) = e x p { − ( x ( i ) − x ˉ ) 2 2 k 2 } w^{(i)}=exp\{-\frac{(x^{(i)}-\bar x)^2}{2k^2}\} w(i)=exp{2k2(x(i)xˉ)2}
该函数称为指数衰减函数,其中k为波长参数,它控制了权值随距离下降的速率
注意:使用该方式主要应用到样本之间的相似性考虑,主要内容在SVM中再考虑 (核函数)

9回归算法综合案例(二):波士顿房屋租赁价格预测

基于波士顿房屋租赁数据进行房屋租赁价格预测模型构建,分别使用Lasso回 归、Ridge回两种回归算法构建模型,并分别构建1/2/3阶算法中的最优算法 (参数),并比较这两种回归算法的效果;另外使用lasso回归算法做特征选择(数据自行百度,就出来了)

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
def notEmpty(s):
    return s != ''
## 加载数据
names = ['CRIM','ZN', 'INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']
path = "boston_housing.data"
## 由于数据文件格式不统一,所以读取的时候,先按照一行一个字段属性读取数据,然后再安装每行数据进行处理
fd = pd.read_csv(path,header=None)
fd.head()
0
00.00632 18.00 2.310 0 0.5380 6.5750 65...
10.02731 0.00 7.070 0 0.4690 6.4210 78...
20.02729 0.00 7.070 0 0.4690 7.1850 61...
30.03237 0.00 2.180 0 0.4580 6.9980 45...
40.06905 0.00 2.180 0 0.4580 7.1470 54...
## 加载数据
names = ['CRIM','ZN', 'INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']
path = "boston_housing.data"
## 由于数据文件格式不统一,所以读取的时候,先按照一行一个字段属性读取数据,然后再安装每行数据进行处理
fd = pd.read_csv(path,header=None)
# print (fd.shape)
data = np.empty((len(fd), 14))
for i, d in enumerate(fd.values):#enumerate生成一列索 引i,d为其元素

    d = map(float, filter(notEmpty, d[0].split(' ')))#filter一个函数,一个list
    
    #根据函数结果是否为真,来过滤list中的项。
    data[i] = list(d)
    
## 分割数据
x, y = np.split(data, (13,), axis=1)
print (x[0:5])
y = y.ravel() # 转换格式 拉直操作
print (y[0:5])
ly=len(y)
print(y.shape)
print ("样本数据量:%d, 特征个数:%d" % x.shape)
print ("target样本数据量:%d" % y.shape[0])
[[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00
  6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 3.9690e+02
  4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
  7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
  9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
  4.0300e+00]
 [3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00
  4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9463e+02
  2.9400e+00]
 [6.9050e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 7.1470e+00
  5.4200e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9690e+02
  5.3300e+00]]
[24.  21.6 34.7 33.4 36.2]
(506,)
样本数据量:506, 特征个数:13
target样本数据量:506
## Pipeline常用于并行调参
models = [
    Pipeline([
            ('ss', StandardScaler()),
            ('poly', PolynomialFeatures()),
            ('linear', RidgeCV(alphas=np.logspace(-3,1,20)))
        ]),
    Pipeline([
            ('ss', StandardScaler()),
            ('poly', PolynomialFeatures()),
            ('linear', LassoCV(alphas=np.logspace(-3,1,20)))
        ])
] 

# 参数字典, 字典中的key是属性的名称,value是可选的参数列表
parameters = {
    "poly__degree": [3,2,1], 
    "poly__interaction_only": [True, False],#不产生交互项,如X1*X1 
    "poly__include_bias": [True, False],#多项式幂为零的特征作为线性模型中的截距;true表示包含
    "linear__fit_intercept": [True, False]
}

# 数据分割
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
## Lasso和Ridge模型比较运行图表展示
titles = ['Ridge', 'Lasso']
colors = ['g-', 'b-']
plt.figure(figsize=(16,8), facecolor='w')
ln_x_test = range(len(x_test))

plt.plot(ln_x_test, y_test, 'r-', lw=2, label=u'真实值')
for t in range(2):
    # 获取模型并设置参数
    # GridSearchCV: 进行交叉验证,选择出最优的参数值出来
    # 第一个输入参数:进行参数选择的模型,
    # param_grid: 用于进行模型选择的参数字段,要求是字典类型;cv: 进行几折交叉验证
    model = GridSearchCV(models[t], param_grid=parameters,cv=5, n_jobs=1)#五折交叉验证
    # 模型训练-网格搜索
    model.fit(x_train, y_train)
    # 模型效果值获取(最优参数)
    print ("%s算法:最优参数:" % titles[t],model.best_params_)
    print ("%s算法:R值=%.3f" % (titles[t], model.best_score_))
    # 模型预测
    y_predict = model.predict(x_test)
    # 画图
    plt.plot(ln_x_test, y_predict, colors[t], lw = t + 3, label=u'%s算法估计值,$R^2$=%.3f' % (titles[t],model.best_score_))
# 图形显示
plt.legend(loc = 'upper left')
plt.grid(True)
plt.title(u"波士顿房屋价格预测")
plt.show()
Ridge算法:最优参数: {'linear__fit_intercept': True, 'poly__degree': 2, 'poly__include_bias': True, 'poly__interaction_only': True}
Ridge算法:R值=0.874
Lasso算法:最优参数: {'linear__fit_intercept': False, 'poly__degree': 3, 'poly__include_bias': True, 'poly__interaction_only': True}
Lasso算法:R值=0.857
## 模型训练 ====> 单个Lasso模型(一阶特征选择)<2参数给定1阶情况的最优参数>
model = Pipeline([
            ('ss', StandardScaler()),
            ('poly', PolynomialFeatures(degree=1, include_bias=False, interaction_only=True)),
            ('linear', LassoCV(alphas=np.logspace(-3,1,20), fit_intercept=False))
        ])
# 模型训练
model.fit(x_train, y_train)

在这里插入图片描述

9.1 模型评测

9.1.1数据输出

print (“参数:”, list(zip(names,model.get_params(‘linear’)[‘linear’].coef_)))
print (“截距:”, model.get_params(‘linear’)[‘linear’].intercept_)


    参数: [('CRIM', -0.0), ('ZN', 0.0), ('INDUS', -0.0), ('CHAS', 0.0), ('NOX', -0.0), ('RM', 2.290127774392436), ('AGE', -0.0), ('DIS', 0.0), ('RAD', -0.0), ('TAX', -0.0), ('PTRATIO', -1.5607620229769363), ('B', 0.0), ('LSTAT', -3.523934059194476)]
    截距: 0.0

一般在实际工作中,当线性模型的参数接近0的时候,我们认为当前参数对应的那个特征属性在模型判断中是没有太大的决策的信息,所以对于这样的属性我们可以删除;一般情况下,如果是手动删除的话,选择小于1e-4的特征属性

10回归算法综合案例(三):葡萄酒质量预测

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
path1 = "winequality-red.csv"
df1 = pd.read_csv(path1,sep=";")
df1.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.40.700.001.90.07611.034.00.99783.510.569.45
17.80.880.002.60.09825.067.00.99683.200.689.85
27.80.760.042.30.09215.054.00.99703.260.659.85
311.20.280.561.90.07517.060.00.99803.160.589.86
47.40.700.001.90.07611.034.00.99783.510.569.45
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
df1['type'] = 1
path2 = "winequality-white.csv"
df2 = pd.read_csv(path2,sep=";")
df2["type"] = 2
df2.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitytype
07.00.270.3620.70.04545.0170.01.00103.000.458.862
16.30.300.341.60.04914.0132.00.99403.300.499.562
28.10.280.406.90.05030.097.00.99513.260.4410.162
37.20.230.328.50.05847.0186.00.99563.190.409.962
47.20.230.328.50.05847.0186.00.99563.190.409.962
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
type                    4898 non-null int64
dtypes: float64(11), int64(2)
memory usage: 497.6 KB
df = pd.concat([df1,df2],axis=0)
df.head()
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitytype
07.40.700.001.90.07611.034.00.99783.510.569.451
17.80.880.002.60.09825.067.00.99683.200.689.851
27.80.760.042.30.09215.054.00.99703.260.659.851
311.20.280.561.90.07517.060.00.99803.160.589.861
47.40.700.001.90.07611.034.00.99783.510.569.451
names = ["fixed acidity","volatile acidity","citric acid",
         "residual sugar","chlorides","free sulfur dioxide",
         "total sulfur dioxide","density","pH","sulphates",
         "alcohol", "type"]
quality = "quality"
names1 = []
for i in list(df):
    names1.append(i)
names1
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality',
 'type']
new_df = df.replace("?",np.nan)
datas = new_df.dropna(how='any')
X = datas[names]
Y = datas[quality]
Y.ravel()
array([5, 5, 5, ..., 6, 7, 6])
models = [
    Pipeline([
        ("Poly",PolynomialFeatures()),
        ("Linear",LinearRegression())
    ]),
    Pipeline([
        ("Poly",PolynomialFeatures()),
        ("Linear",RidgeCV(alphas = np.logspace(-4,2,20)))
    ]),
    Pipeline([
        ("Poly",PolynomialFeatures()),
        ("Linear",LassoCV(alphas=np.logspace(-4,2,20)))
    ]),
    Pipeline([
        ("Poly",PolynomialFeatures()),
        ("Linear",ElasticNetCV(alphas=np.logspace(-4,2,20),l1_ratio=np.linspace(0,1,5)))
    ])
]
plt.figure(figsize=(16,8),facecolor='w')
titles = "Line predict","Ridge predict","Lasso predict","ElasticNet predict"
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.01,random_state=0)
ln_x_test = range(len(X_test))
d_pool = np.arange(1,4,1)
m = len(d_pool)
clrs = []
for c in np.linspace(5570560,255,m):
    clrs.append('#%06x'% int(c))
    
for t in range(4):
    plt.subplot(2,2,t+1)
    model = models[t]
    plt.plot(ln_x_test,Y_test,c='r',lw=2,alpha=0.75,zorder=10,label='Real value')
    for i,d in enumerate(d_pool):
        model.set_params(Poly__degree=d)
        model.fit(X_train,Y_train)
        Y_pre = model.predict(X_test)
        R = model.score(X_train,Y_train)
        lin = model.get_params('Linear')['Linear']
        plt.plot(ln_x_test,Y_pre,c=clrs[i],lw=2,alpha=0.75,zorder=i,label='%d Predict Value,$R^2$=%.3f' % (d,R))
        
    plt.legend(loc='upper left')
    plt.grid(True)
    plt.title(titles[t],fontsize=18)
    plt.xlabel('X',fontsize=16)
    plt.ylabel('Y',fontsize=16)
plt.suptitle('wine quality predict',fontsize=22)
plt.show()

在这里插入图片描述

11Logistic回归

Logistic/sigmoid函数:
p = h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x p = h_\theta(x)=g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}} p=hθ(x)=g(θTx)=1+eθTx1
g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+ez1
g ′ ( z ) = ( 1 1 + e − z ) ′ = e − z ( 1 + e − z ) 2 = 1 1 + e − z e − z 1 + e − z = 1 1 + e − z ( 1 − 1 1 + e − z ) = g ( z ) ( 1 − g ( z ) ) g'(z)=(\frac{1}{1+e^{-z}})'=\frac{e^{-z}}{(1+e^{-z})^2}=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}}=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})=g(z)(1-g(z)) g(z)=(1+ez1)=(1+ez)2ez=1+ez11+ezez=1+ez1(11+ez1)=g(z)(1g(z))

11.1Logistic回归及似然函数

假设:
P ( y = 1 ∣ x ; θ ) = h θ ( x ) P ( y = 1 ∣ x ; θ ) = 1 − h θ ( x ) P ( y ∣ s ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) ( 1 − y ) P(y=1|x;\theta)=h_\theta(x)\\ P(y=1|x;\theta) = 1- h_\theta(x)\\ P(y|s;\theta) = (h_\theta(x))^y(1-h_\theta(x))^{(1-y)} P(y=1x;θ)=hθ(x)P(y=1x;θ)=1hθ(x)P(ys;θ)=(hθ(x))y(1hθ(x))(1y)
似然函数:
L ( θ ) = p ( y ⃗ ∣ X ; θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) ( 1 − y ( i ) ) L(\theta) = p(\vec y|X;\theta) =\prod_{i=1}^mp(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^m(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{(1-y^{(i)})} L(θ)=p(y X;θ)=i=1mp(y(i)x(i);θ)=i=1m(hθ(x(i)))y(i)(1hθ(x(i)))(1y(i))
对数似然函数:
ℓ ( θ ) = log ⁡ L ( θ ) = ∑ i = 1 m ( y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ) \ell(\theta) = \log L(\theta)=\sum_{i=1}^m(y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta (x^{(i)}))) (θ)=logL(θ)=i=1m(y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i))))

11.2最大似然/极大似然函数的随机梯度

∂ ℓ ( θ ) ∂ θ j = ∑ i = 1 m ( y ( i ) − g ( θ T x ( i ) ) ) X j ( i ) \frac{\partial\ell(\theta)}{\partial \theta_j}=\sum_{i=1}^m(y^{(i)}-g(\theta^Tx^{(i)}))X_j^{(i)} θj(θ)=i=1m(y(i)g(θTx(i)))Xj(i)

11.3θ参数求解

Logistic回归θ参数的求解过程为(类似梯度下降方法):
θ j = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) θ j = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_j = \theta_j+\alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\\ \theta_j = \theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} θj=θj+αi=1m(y(i)hθ(x(i)))xj(i)θj=θj+α(y(i)hθ(x(i)))xj(i)

11.4极大似然估计与Logistic回归损失函数

L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m p i y ( i ) p i = h θ ( x ( i ) ) = 1 1 + e − θ T x ( i ) L(\theta) = \prod_{i=1}^mp(y^{(i)}|x^{(i);\theta})=\prod_{i=1}^mp_i^{y^{(i)}}\\ p_i = h_\theta(x^{(i)})=\frac{1}{1+e^{-\theta^Tx^{(i)}}} L(θ)=i=1mp(y(i)x(i);θ)=i=1mpiy(i)pi=hθ(x(i))=1+eθTx(i)1
ℓ ( θ ) = ln ⁡ L ( θ ) = ∑ i = 1 m ln ⁡ [ p i y ( i ) ( 1 − p i ) 1 − y ( i ) ] \ell(\theta)=\ln L(\theta) = \sum_{i=1}^m\ln [p_i^{y^{(i)}}(1-p_i)^{1-y^{(i)}}] (θ)=lnL(θ)=i=1mln[piy(i)(1pi)1y(i)]
l o s s = − ℓ ( θ ) = ∑ i = 1 m [ − y ( i ) ln ⁡ ( h θ ( x ( ( i ) ) ) − ( 1 − y ( i ) ) ln ⁡ ( 1 − h θ ( x ( i ) ) ) ] loss=-\ell(\theta)=\sum_{i=1}^m[-y^{(i)}\ln(h_\theta (x(^{(i)}))-(1-y^{(i)})\ln(1-h_\theta(x^{(i)}))] loss=(θ)=i=1m[y(i)ln(hθ(x((i)))(1y(i))ln(1hθ(x(i)))]

11.5Logistic案例(一):乳腺癌分类

基于病理数据进行乳腺癌预测(复发4/正常2),使用Logistic算法构建模型

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LogisticRegressionCV,LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
path = "breast-cancer-wisconsin.data"
names = ['id','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape',
         'Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei',
        'Bland Chromatin','Normal Nucleoli','Mitoses','Class']
df = pd.read_csv(path,header=None,names=names)
datas = df.replace('?',np.nan).dropna(how='any')
datas.head()
idClump ThicknessUniformity of Cell SizeUniformity of Cell ShapeMarginal AdhesionSingle Epithelial Cell SizeBare NucleiBland ChromatinNormal NucleoliMitosesClass
010000255111213112
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
X = datas[names[1:10]]
Y = datas[names[10]]
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.1,random_state=0)
ss =StandardScaler()
X_train = ss.fit_transform(X_train)
# 3. 模型构建及训练
## penalty: 过拟合解决参数,l1或者l2
## solver: 参数优化方式
### 当penalty为l1的时候,参数只能是:liblinear(坐标轴下降法);
### nlbfgs和cg都是关于目标函数的二阶泰勒展开
### 当penalty为l2的时候,参数可以是:lbfgs(拟牛顿法)、newton-cg(牛顿法变种),seg(minibatch)
# 维度<10000时,lbfgs法比较好,   维度>10000时, cg法比较好,显卡计算的时候,lbfgs和cg都比seg快
## multi_class: 分类方式参数;参数可选: ovr(默认)、multinomial;这两种方式在二元分类问题中,效果是一样的;在多元分类问题中,效果不一样
### ovr: one-vs-rest, 对于多元分类的问题,先将其看做二元分类,分类完成后,再迭代对其中一类继续进行二元分类
### multinomial: many-vs-many(MVM),即Softmax分类效果
## class_weight: 特征权重参数

### TODO: Logistic回归是一种分类算法,不能应用于回归中(也即是说对于传入模型的y值来讲,不能是float类型,必须是int类型)
lr = LogisticRegressionCV(multi_class='ovr',fit_intercept=True, Cs=np.logspace(-2, 2, 20), cv=2, penalty='l2', solver='lbfgs', tol=0.01)
re=lr.fit(X_train, Y_train)
r = re.score(X_train,Y_train)
print ("R值(准确率):", r)
print ("稀疏化特征比率:%.2f%%" % (np.mean(lr.coef_.ravel() == 0) * 100))
print ("参数:",re.coef_)
print ("截距:",re.intercept_)
print(re.predict_proba(X_test)) # 获取sigmoid函数返回的概率值
R值(准确率): 0.9706840390879479
稀疏化特征比率:0.00%
参数: [[1.3926311  0.17397478 0.65749877 0.8929026  0.36507062 1.36092964
  0.91444624 0.63198866 0.75459326]]
截距: [-1.02717163]
[[6.61838068e-06 9.99993382e-01]
 [3.78575185e-05 9.99962142e-01]
 [2.44249065e-15 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [1.52850624e-03 9.98471494e-01]
 [6.67061684e-05 9.99933294e-01]
 [6.75536843e-07 9.99999324e-01]
 [0.00000000e+00 1.00000000e+00]
 [2.43117004e-05 9.99975688e-01]
 [6.13092842e-04 9.99386907e-01]
 [0.00000000e+00 1.00000000e+00]
 [2.00330728e-06 9.99997997e-01]
 [0.00000000e+00 1.00000000e+00]
 [3.78575185e-05 9.99962142e-01]
 [4.65824155e-08 9.99999953e-01]
 [5.47788703e-10 9.99999999e-01]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [6.27260778e-07 9.99999373e-01]
 [3.78575185e-05 9.99962142e-01]
 [3.85098865e-06 9.99996149e-01]
 [1.80189197e-12 1.00000000e+00]
 [9.44640398e-05 9.99905536e-01]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [4.11688915e-06 9.99995883e-01]
 [1.85886872e-05 9.99981411e-01]
 [5.83016713e-06 9.99994170e-01]
 [0.00000000e+00 1.00000000e+00]
 [1.52850624e-03 9.98471494e-01]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [1.51713085e-05 9.99984829e-01]
 [2.34685008e-05 9.99976531e-01]
 [1.51713085e-05 9.99984829e-01]
 [0.00000000e+00 1.00000000e+00]
 [0.00000000e+00 1.00000000e+00]
 [2.34685008e-05 9.99976531e-01]
 [0.00000000e+00 1.00000000e+00]
 [9.97563915e-07 9.99999002e-01]
 [1.70686321e-07 9.99999829e-01]
 [1.38382134e-04 9.99861618e-01]
 [1.36080718e-04 9.99863919e-01]
 [1.52850624e-03 9.98471494e-01]
 [1.68154251e-05 9.99983185e-01]
 [6.66097483e-04 9.99333903e-01]
 [0.00000000e+00 1.00000000e+00]
 [9.77502258e-07 9.99999022e-01]
 [5.83016713e-06 9.99994170e-01]
 [0.00000000e+00 1.00000000e+00]
 [4.09496721e-06 9.99995905e-01]
 [0.00000000e+00 1.00000000e+00]
 [1.37819117e-06 9.99998622e-01]
 [6.27260778e-07 9.99999373e-01]
 [4.52734741e-07 9.99999547e-01]
 [0.00000000e+00 1.00000000e+00]
 [8.88178420e-16 1.00000000e+00]
 [1.06976766e-08 9.99999989e-01]
 [0.00000000e+00 1.00000000e+00]
 [2.45780192e-04 9.99754220e-01]
 [3.92389040e-04 9.99607611e-01]
 [6.10681985e-05 9.99938932e-01]
 [9.44640398e-05 9.99905536e-01]
 [1.51713085e-05 9.99984829e-01]
 [2.45780192e-04 9.99754220e-01]
 [2.45780192e-04 9.99754220e-01]
 [1.51713085e-05 9.99984829e-01]
 [0.00000000e+00 1.00000000e+00]]
X_test = ss.transform(X_test)
Y_predict = re.predict(X_test)
x_len = range(len(X_test))
plt.figure(figsize=(14,7),facecolor='w')
plt.ylim(0,6)
plt.plot(x_len,Y_test,'ro',markersize=8,zorder=3,label='real value')
plt.plot(x_len,Y_predict,'go',markersize=14,zorder=2,label='predcit value,$R^2$=%.3f' % re.score(X_test,Y_test))
plt.legend(loc='upper left')
plt.xlabel('data num',fontsize=18)
plt.ylabel('cancer catelog',fontsize=18)
plt.title('Classification of data by Logistic regression algorithm',fontsize=20)
plt.show()

在这里插入图片描述

12Softmax回归

  • softmax回归是logistic回归的一般化,适用于K分类的问题,第k类的参数为 向量 θ k θ_k θk,组成的二维矩阵为 θ k ∗ n θ_{k*n} θkn;
  • softmax函数的本质就是将一个K维的任意实数向量压缩(映射)成另一个K维 的实数向量,其中向量中的每个元素取值都介于(0,1)之间。
  • softmax回归概率函数为:
    p ( y = k ∣ x ; θ ) = e θ k T x ∑ l = 1 k e θ l T x , k = 1 , 2 , . . . , K p(y=k|x;\theta)=\frac{e^{\theta_k^{T}x}}{\sum_{l=1}^ke^{\theta_l^{T}x}},k=1,2,...,K p(y=kx;θ)=l=1keθlTxeθkTx,k=1,2,...,K

12.1Softmax案例(一):葡萄酒质量分类

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings
import sklearn
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import label_binarize
from sklearn import metrics
path1 = "winequality-red.csv"
df1 = pd.read_csv(path1,sep=";")
df1["type"]=1

path2 = "winequality-white.csv"
df2 = pd.read_csv(path2,sep=";")
df2["type"]=2

df = pd.concat([df1,df2],axis=0)

names = ["fixed acidity","volatile acidity","citric acid",
         "residual sugar","chlorides","free sulfur dioxide",
         "total sulfur dioxide","density","pH","sulphates",
         "alcohol", "type"]
quality = "quality"
df.head(5)
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualitytype
07.40.700.001.90.07611.034.00.99783.510.569.451
17.80.880.002.60.09825.067.00.99683.200.689.851
27.80.760.042.30.09215.054.00.99703.260.659.851
311.20.280.561.90.07517.060.00.99803.160.589.861
47.40.700.001.90.07611.034.00.99783.510.569.451
new_df = df.replace('?',np.nan)
datas = new_df.dropna(how='any')
print ("原始数据条数:%d;异常数据处理后数据条数:%d;异常数据条数:%d" % (len(df), len(datas), len(df) - len(datas)))
X = datas[names]
Y = datas[quality]
原始数据条数:6497;异常数据处理后数据条数:6497;异常数据条数:0
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25,random_state=0)
print ("训练数据条数:%d;数据特征个数:%d;测试数据条数:%d" % (X_train.shape[0], X_train.shape[1], X_test.shape[0]))
训练数据条数:4872;数据特征个数:12;测试数据条数:1625
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
Y_train.value_counts()

6    2132
5    1606
7     805
4     161
8     146
3      20
9       2
Name: quality, dtype: int64
lr = LogisticRegressionCV(fit_intercept=True, Cs=np.logspace(-5, 1, 100), 
                          multi_class='multinomial', penalty='l2', solver='lbfgs')
lr.fit(X_train, Y_train)

LogisticRegressionCV(Cs=array([1.00000000e-05, 1.14975700e-05, 1.32194115e-05, 1.51991108e-05,
       1.74752840e-05, 2.00923300e-05, 2.31012970e-05, 2.65608778e-05,
       3.05385551e-05, 3.51119173e-05, 4.03701726e-05, 4.64158883e-05,
       5.33669923e-05, 6.13590727e-05, 7.05480231e-05, 8.11130831e-05,
       9.32603347e-05, 1.07226722e-04, 1.23284674e-04, 1.41747416e-04,
       1.62975083e-04, 1.87...
       3.76493581e+00, 4.32876128e+00, 4.97702356e+00, 5.72236766e+00,
       6.57933225e+00, 7.56463328e+00, 8.69749003e+00, 1.00000000e+01]),
                     class_weight=None, cv='warn', dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='multinomial', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)
r = lr.score(X_train, Y_train)
print("R值:", r)
print("特征稀疏化比率:%.2f%%" % (np.mean(lr.coef_.ravel() == 0) * 100))
print("参数:",lr.coef_)
print("截距:",lr.intercept_)
print("概率:", lr.predict_proba(X_test)) # 获取sigmoid函数返回的概率值

R值: 0.5447454844006568
特征稀疏化比率:0.00%
参数: [[ 2.33356993e-01  3.96147394e-01 -1.10711540e-01 -1.05855564e-01
   2.61801820e-01  3.56963863e-01  4.81906313e-02  1.94290542e-02
   2.25156551e-02 -1.94888814e-01 -8.98233422e-02  2.92866349e-03]
 [-1.33155433e-02  6.38796366e-01 -1.83921908e-02 -2.75146255e-01
   1.28019817e-01 -6.41981628e-01  6.48908149e-02  1.48959271e-01
   1.64274546e-02 -1.48473096e-01 -4.64261683e-01  6.99712789e-01]
 [-2.68229854e-01  2.62994302e-01  3.93147113e-02 -3.70401547e-01
   5.55961741e-02 -1.48023941e-01  2.95386175e-01  3.18642259e-01
  -2.07857133e-01 -1.78047228e-01 -8.24149481e-01 -2.39410311e-01]
 [-1.80030744e-01 -3.53604839e-01 -4.18188438e-02 -2.46998610e-02
   7.18370492e-04  5.92537487e-02 -7.33975434e-02  1.36161552e-01
  -1.09165220e-01  6.09787017e-02  1.61650854e-02 -2.23098277e-01]
 [ 1.52461941e-01 -6.36661173e-01 -4.56938032e-02  3.70505346e-01
  -2.27488717e-01  1.54955369e-01 -2.11312147e-01 -3.30908669e-01
   1.19653456e-01  3.13549620e-01  5.31754130e-01 -2.75147048e-01]
 [ 8.92749538e-03 -2.72849255e-01  8.55984157e-02  3.81082429e-01
  -1.71045984e-01  2.24782325e-01 -1.25358442e-01 -2.75522717e-01
   1.14952529e-01  1.95264228e-01  7.56680057e-01 -6.66454923e-03]
 [ 6.68297119e-02 -3.48227951e-02  9.17032513e-02  2.45154516e-02
  -4.76014817e-02 -5.94973687e-03  1.60051072e-03 -1.67607497e-02
   4.34732578e-02 -4.83834115e-02  7.36352343e-02  4.16787321e-02]]
截距: [-1.97619117 -0.06391344  2.29175069  2.85891982  1.44293004 -0.40871583
 -4.1447801 ]
概率: [[6.96038503e-06 1.34095016e-14 9.99993040e-01 ... 7.85580648e-17
  8.48744768e-12 6.17261950e-12]
 [8.28176713e-01 2.60641243e-14 1.71744247e-01 ... 3.15689985e-09
  7.89359197e-05 7.49868933e-08]
 [3.60210355e-04 8.25475749e-29 9.99639790e-01 ... 4.25877211e-24
  2.22228431e-15 8.64138861e-17]
 ...
 [6.29409692e-09 9.60374419e-19 9.99999994e-01 ... 1.69431710e-24
  1.15901186e-17 3.45570149e-16]
 [2.61326510e-03 2.64497823e-11 9.97386416e-01 ... 4.48053758e-11
  3.01796409e-07 1.52629118e-08]
 [2.26350417e-02 4.02079320e-26 9.77364958e-01 ... 3.02503793e-20
  2.52993652e-12 2.03862697e-14]]
print("概率:", lr.predict_proba(X_test).shape) # 获取sigmoid函数返回的概率值

概率: (1625, 7)
X_test = ss.transform(X_test)

Y_predict = lr.predict(X_test)

x_len = range(len(X_test))
plt.figure(figsize=(14,7),facecolor='w')
plt.ylim(-1,11)
plt.plot(x_len,Y_test,'ro',markersize=8,zorder=3,label='real value')
plt.plot(x_len,Y_predict,'go',markersize=12,zorder=2,label='predict value')
plt.legend(loc = 'upper left')
plt.xlabel('data num')
plt.ylabel('wine quality')
plt.title('wine predict')
plt.show()

在这里插入图片描述

13分类问题综合案例(一):信贷审批

基于信贷数据进行用户信贷分类,使用Logistic算法和KNN算法构建模型,并 比较这两大类算法的效果

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import warnings

import sklearn
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
pd.read_csv("crx.data",header=None)
0123456789101112131415
0b30.830.000ugwv1.25tt1fg002020+
1a58.674.460ugqh3.04tt6fg00043560+
2a24.500.500ugqh1.50tf0fg00280824+
3b27.831.540ugwv3.75tt5tg001003+
4b20.175.625ugwv1.71tf0fs001200+
...................................................
685b21.0810.085ypeh1.25ff0fg002600-
686a22.670.750ugcv2.00ft2tg00200394-
687a25.2513.500ypffff2.00ft1tg002001-
688b17.920.205ugaav0.04ff0fg00280750-
689b35.003.375ugch8.29ff0tg000000-

690 rows × 16 columns

path = "crx.data"
names = ['A1','A2','A3','A4','A5','A6','A7','A8',
         'A9','A10','A11','A12','A13','A14','A15','A16']
df = pd.read_csv(path, header=None, names=names)
print ("数据条数:", len(df))

# 2. 异常数据过滤
df = df.replace("?", np.nan).dropna(how='any')
print ("过滤后数据条数:", len(df))

df.head(5)
数据条数: 690
过滤后数据条数: 653
A1A2A3A4A5A6A7A8A9A10A11A12A13A14A15A16
0b30.830.000ugwv1.25tt1fg002020+
1a58.674.460ugqh3.04tt6fg00043560+
2a24.500.500ugqh1.50tf0fg00280824+
3b27.831.540ugwv3.75tt5tg001003+
4b20.175.625ugwv1.71tf0fs001200+
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 689
Data columns (total 16 columns):
A1     653 non-null object
A2     653 non-null object
A3     653 non-null float64
A4     653 non-null object
A5     653 non-null object
A6     653 non-null object
A7     653 non-null object
A8     653 non-null float64
A9     653 non-null object
A10    653 non-null object
A11    653 non-null int64
A12    653 non-null object
A13    653 non-null object
A14    653 non-null object
A15    653 non-null int64
A16    653 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.7+ KB
df.A16.value_counts()

-    357
+    296
Name: A16, dtype: int64
# 自定义的一个哑编码实现方式:将v变量转换成为一个向量/list集合的形式
def parse(v, l):
    # v是一个字符串,需要进行转换的数据
    # l是一个类别信息,其中v是其中的一个值
    return [1 if i == v else 0 for i in l]
# 定义一个处理每条数据的函数
def parseRecord(record):
    result = []
    ## 格式化数据,将离散数据转换为连续数据
    a1 = record['A1']
    for i in parse(a1, ('a', 'b')):
        result.append(i)
    
    result.append(float(record['A2']))
    result.append(float(record['A3']))
    
    # 将A4的信息转换为哑编码的形式; 对于DataFrame中,原来一列的数据现在需要四列来进行表示
    a4 = record['A4']
    for i in parse(a4, ('u', 'y', 'l', 't')):
        result.append(i)
    
    a5 = record['A5']
    for i in parse(a5, ('g', 'p', 'gg')):
        result.append(i)
    
    a6 = record['A6']
    for i in parse(a6, ('c', 'd', 'cc', 'i', 'j', 'k', 'm', 'r', 'q', 'w', 'x', 'e', 'aa', 'ff')):
        result.append(i)
    
    a7 = record['A7']
    for i in parse(a7, ('v', 'h', 'bb', 'j', 'n', 'z', 'dd', 'ff', 'o')):
        result.append(i)
    
    result.append(float(record['A8']))
    
    a9 = record['A9']
    for i in parse(a9, ('t', 'f')):
        result.append(i)
        
    a10 = record['A10']
    for i in parse(a10, ('t', 'f')):
        result.append(i)
    
    result.append(float(record['A11']))
    
    a12 = record['A12']
    for i in parse(a12, ('t', 'f')):
        result.append(i)
        
    a13 = record['A13']
    for i in parse(a13, ('g', 'p', 's')):
        result.append(i)
    
    result.append(float(record['A14']))
    result.append(float(record['A15']))
    
    a16 = record['A16']
    if a16 == '+':
        result.append(1)
    else:
        result.append(0)
        
    return result

print(parse('v', ['v', 'y', 'l']))
print(parse('y', ['v', 'y', 'l']))
print(parse('l', ['v', 'y', 'l']))

[1, 0, 0]
[0, 1, 0]
[0, 0, 1]
### 数据特征处理(将数据转换为数值类型的)
new_names =  ['A1_0', 'A1_1',
              'A2','A3',
              'A4_0','A4_1','A4_2','A4_3', # 因为需要对A4进行哑编码操作,需要使用四列来表示一列的值
              'A5_0', 'A5_1', 'A5_2', 
              'A6_0', 'A6_1', 'A6_2', 'A6_3', 'A6_4', 'A6_5', 'A6_6', 'A6_7', 'A6_8', 'A6_9', 'A6_10', 'A6_11', 'A6_12', 'A6_13', 
              'A7_0', 'A7_1', 'A7_2', 'A7_3', 'A7_4', 'A7_5', 'A7_6', 'A7_7', 'A7_8', 
              'A8',
              'A9_0', 'A9_1' ,
              'A10_0', 'A10_1',
              'A11',
              'A12_0', 'A12_1',
              'A13_0', 'A13_1', 'A13_2',
              'A14','A15','A16']
datas = df.apply(lambda x: pd.Series(parseRecord(x), index = new_names), axis=1)
names = new_names

## 展示一下处理后的数据
datas.head(5)

A1_0A1_1A2A3A4_0A4_1A4_2A4_3A5_0A5_1...A10_1A11A12_0A12_1A13_0A13_1A13_2A14A15A16
00.01.030.830.0001.00.00.00.01.00.0...0.01.00.01.01.00.00.0202.00.01.0
11.00.058.674.4601.00.00.00.01.00.0...0.06.00.01.01.00.00.043.0560.01.0
21.00.024.500.5001.00.00.00.01.00.0...1.00.00.01.01.00.00.0280.0824.01.0
30.01.027.831.5401.00.00.00.01.00.0...0.05.01.00.01.00.00.0100.03.01.0
40.01.020.175.6251.00.00.00.01.00.0...1.00.00.01.00.00.01.0120.00.01.0

5 rows × 48 columns

datas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 689
Data columns (total 48 columns):
A1_0     653 non-null float64
A1_1     653 non-null float64
A2       653 non-null float64
A3       653 non-null float64
A4_0     653 non-null float64
A4_1     653 non-null float64
A4_2     653 non-null float64
A4_3     653 non-null float64
A5_0     653 non-null float64
A5_1     653 non-null float64
A5_2     653 non-null float64
A6_0     653 non-null float64
A6_1     653 non-null float64
A6_2     653 non-null float64
A6_3     653 non-null float64
A6_4     653 non-null float64
A6_5     653 non-null float64
A6_6     653 non-null float64
A6_7     653 non-null float64
A6_8     653 non-null float64
A6_9     653 non-null float64
A6_10    653 non-null float64
A6_11    653 non-null float64
A6_12    653 non-null float64
A6_13    653 non-null float64
A7_0     653 non-null float64
A7_1     653 non-null float64
A7_2     653 non-null float64
A7_3     653 non-null float64
A7_4     653 non-null float64
A7_5     653 non-null float64
A7_6     653 non-null float64
A7_7     653 non-null float64
A7_8     653 non-null float64
A8       653 non-null float64
A9_0     653 non-null float64
A9_1     653 non-null float64
A10_0    653 non-null float64
A10_1    653 non-null float64
A11      653 non-null float64
A12_0    653 non-null float64
A12_1    653 non-null float64
A13_0    653 non-null float64
A13_1    653 non-null float64
A13_2    653 non-null float64
A14      653 non-null float64
A15      653 non-null float64
A16      653 non-null float64
dtypes: float64(48)
memory usage: 250.0 KB
## 数据分割
X = datas[names[0:-1]]
Y = datas[names[-1]]

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.1,random_state=0)

X_train.describe().T

countmeanstdmin25%50%75%max
A1_0587.00.3151620.4649770.000.0000.001.0001.00
A1_1587.00.6848380.4649770.000.0001.001.0001.00
A2587.031.68541711.88350613.7522.62528.6738.29076.75
A3587.04.9093195.0735880.001.0403.007.52028.00
A4_0587.00.7614990.4265300.001.0001.001.0001.00
A4_1587.00.2350940.4244190.000.0000.000.0001.00
A4_2587.00.0034070.0583210.000.0000.000.0001.00
A4_3587.00.0000000.0000000.000.0000.000.0000.00
A5_0587.00.7614990.4265300.001.0001.001.0001.00
A5_1587.00.2350940.4244190.000.0000.000.0001.00
A5_2587.00.0034070.0583210.000.0000.000.0001.00
A6_0587.00.2112440.4085390.000.0000.000.0001.00
A6_1587.00.0374790.1900940.000.0000.000.0001.00
A6_2587.00.0613290.2401370.000.0000.000.0001.00
A6_3587.00.0851790.2793860.000.0000.000.0001.00
A6_4587.00.0153320.1229750.000.0000.000.0001.00
A6_5587.00.0732540.2607750.000.0000.000.0001.00
A6_6587.00.0596250.2369930.000.0000.000.0001.00
A6_7587.00.0051110.0713670.000.0000.000.0001.00
A6_8587.00.1209540.3263520.000.0000.000.0001.00
A6_9587.00.0971040.2963520.000.0000.000.0001.00
A6_10587.00.0494040.2168940.000.0000.000.0001.00
A6_11587.00.0357750.1858870.000.0000.000.0001.00
A6_12587.00.0783650.2689740.000.0000.000.0001.00
A6_13587.00.0698470.2551060.000.0000.000.0001.00
A7_0587.00.5877340.4926620.000.0001.001.0001.00
A7_1587.00.2078360.4061050.000.0000.000.0001.00
A7_2587.00.0817720.2742500.000.0000.000.0001.00
A7_3587.00.0119250.1086410.000.0000.000.0001.00
A7_4587.00.0068140.0823370.000.0000.000.0001.00
A7_5587.00.0136290.1160420.000.0000.000.0001.00
A7_6587.00.0102210.1006690.000.0000.000.0001.00
A7_7587.00.0766610.2662800.000.0000.000.0001.00
A7_8587.00.0034070.0583210.000.0000.000.0001.00
A8587.02.2218823.3040410.000.2101.002.60528.50
A9_0587.00.5383300.4989540.000.0001.001.0001.00
A9_1587.00.4616700.4989540.000.0000.001.0001.00
A10_0587.00.4497440.4978920.000.0000.001.0001.00
A10_1587.00.5502560.4978920.000.0001.001.0001.00
A11587.02.5621815.0567560.000.0000.003.00067.00
A12_0587.00.4650770.4992040.000.0000.001.0001.00
A12_1587.00.5349230.4992040.000.0001.001.0001.00
A13_0587.00.9131180.2819030.001.0001.001.0001.00
A13_1587.00.0034070.0583210.000.0000.000.0001.00
A13_2587.00.0834750.2768350.000.0000.000.0001.00
A14587.0178.855196171.4006880.0066.000152.00264.0002000.00
A15587.0943.9591145081.1880980.000.0005.00397.000100000.00
lr = LogisticRegressionCV(Cs=np.logspace(-4,1,50),fit_intercept=True,penalty='l2',solver='lbfgs',tol=0.01,multi_class='ovr')
lr.fit(X_train,Y_train)

LogisticRegressionCV(Cs=array([1.00000000e-04, 1.26485522e-04, 1.59985872e-04, 2.02358965e-04,
       2.55954792e-04, 3.23745754e-04, 4.09491506e-04, 5.17947468e-04,
       6.55128557e-04, 8.28642773e-04, 1.04811313e-03, 1.32571137e-03,
       1.67683294e-03, 2.12095089e-03, 2.68269580e-03, 3.39322177e-03,
       4.29193426e-03, 5.42867544e-03, 6.86648845e-03, 8.68511374e-03,
       1.09854114e-02, 1.38...
       1.20679264e+00, 1.52641797e+00, 1.93069773e+00, 2.44205309e+00,
       3.08884360e+00, 3.90693994e+00, 4.94171336e+00, 6.25055193e+00,
       7.90604321e+00, 1.00000000e+01]),
                     class_weight=None, cv='warn', dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='ovr', n_jobs=None, penalty='l2',
                     random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.01, verbose=0)
lr_r = lr.score(X_train,Y_train)
print ("Logistic算法R值(训练集上的准确率):", lr_r)
print ("Logistic算法稀疏化特征比率:%.2f%%" % (np.mean(lr.coef_.ravel() == 0) * 100))
print ("Logistic算法参数:",lr.coef_)
print ("Logistic算法截距:",lr.intercept_)

Logistic算法R值(训练集上的准确率): 0.8722316865417377
Logistic算法稀疏化特征比率:2.13%
Logistic算法参数: [[-3.16666246e-02  6.81060212e-02 -4.79275864e-03 -7.91375832e-03
   1.01962993e-01 -1.85791599e-01  1.20268003e-01  0.00000000e+00
   1.01962993e-01 -1.85791599e-01  1.20268003e-01 -9.73878221e-03
  -2.01150844e-02  4.10359964e-01 -4.58283536e-01 -2.05522876e-02
  -2.67931426e-01 -1.11132872e-01  3.47130384e-03  9.24560737e-02
   1.91389650e-01  6.42832964e-01  9.25454935e-02 -9.20792566e-02
  -4.16782808e-01  6.55077570e-02  3.38771742e-01 -1.89955583e-01
   1.09471200e-01  1.15000661e-01 -1.18558814e-01  2.30930774e-02
  -2.97691604e-01 -9.19903959e-03  1.32470292e-01  1.41022382e+00
  -1.37378443e+00  1.93156745e-01 -1.56717348e-01  1.36228784e-01
  -3.26934489e-02  6.91328454e-02 -3.00204676e-02 -1.47415133e-02
   8.12013774e-02 -1.42065091e-03  5.01349345e-04]]
Logistic算法截距: [-1.01504439]
lr_y_predict = lr.predict(X_test)
lr_y_predict

array([1., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1., 0., 0., 1., 0., 0., 1.,
       1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
       1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1.,
       0., 1., 0., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1.])
y1 = lr.predict_proba(X_test)
y1

array([[1.00172592e-01, 8.99827408e-01],
       [2.10731730e-01, 7.89268270e-01],
       [1.02564277e-01, 8.97435723e-01],
       [9.18579612e-01, 8.14203879e-02],
       [4.96460800e-01, 5.03539200e-01],
       [1.57562852e-12, 1.00000000e+00],
       [9.19055111e-01, 8.09448889e-02],
       [5.76602518e-02, 9.42339748e-01],
       [9.25514497e-01, 7.44855027e-02],
       [3.07565128e-01, 6.92434872e-01],
       [3.04840001e-06, 9.99996952e-01],
       [9.37140018e-01, 6.28599821e-02],
       [9.46845548e-01, 5.31544517e-02],
       [2.89763929e-02, 9.71023607e-01],
       [5.90593530e-01, 4.09406470e-01],
       [9.47922855e-01, 5.20771450e-02],
       [9.85641175e-03, 9.90143588e-01],
       [4.49637006e-01, 5.50362994e-01],
       [9.75285287e-01, 2.47147128e-02],
       [2.35426074e-01, 7.64573926e-01],
       [9.83683270e-01, 1.63167301e-02],
       [9.58942383e-01, 4.10576167e-02],
       [9.41157486e-01, 5.88425136e-02],
       [8.67416400e-01, 1.32583600e-01],
       [8.63673114e-01, 1.36326886e-01],
       [8.09135599e-01, 1.90864401e-01],
       [9.18740616e-01, 8.12593843e-02],
       [5.07988826e-01, 4.92011174e-01],
       [1.52565645e-01, 8.47434355e-01],
       [8.86917247e-01, 1.13082753e-01],
       [7.68921398e-01, 2.31078602e-01],
       [1.71031614e-02, 9.82896839e-01],
       [8.28998790e-01, 1.71001210e-01],
       [9.38043147e-01, 6.19568528e-02],
       [5.92562341e-02, 9.40743766e-01],
       [9.14115478e-01, 8.58845224e-02],
       [8.71766217e-01, 1.28233783e-01],
       [9.79206513e-01, 2.07934869e-02],
       [9.66223761e-01, 3.37762392e-02],
       [9.37688450e-01, 6.23115502e-02],
       [1.61634863e-01, 8.38365137e-01],
       [5.49906429e-01, 4.50093571e-01],
       [9.30588983e-01, 6.94110169e-02],
       [9.40655454e-01, 5.93445455e-02],
       [9.15741340e-01, 8.42586599e-02],
       [9.37482034e-01, 6.25179665e-02],
       [3.73482331e-01, 6.26517669e-01],
       [9.67523269e-01, 3.24767314e-02],
       [6.22550724e-01, 3.77449276e-01],
       [9.21861425e-02, 9.07813858e-01],
       [2.36736330e-02, 9.76326367e-01],
       [8.93707865e-01, 1.06292135e-01],
       [1.15353240e-01, 8.84646760e-01],
       [7.08249977e-01, 2.91750023e-01],
       [5.81239535e-01, 4.18760465e-01],
       [3.03352701e-02, 9.69664730e-01],
       [9.61968655e-01, 3.80313451e-02],
       [9.51893915e-02, 9.04810608e-01],
       [9.17046779e-02, 9.08295322e-01],
       [5.44721836e-02, 9.45527816e-01],
       [9.49642697e-01, 5.03573032e-02],
       [3.77682261e-01, 6.22317739e-01],
       [9.57133060e-01, 4.28669400e-02],
       [2.48916456e-03, 9.97510835e-01],
       [8.98503538e-01, 1.01496462e-01],
       [3.21273555e-01, 6.78726445e-01]])
## KNN
knn = KNeighborsClassifier(n_neighbors=20,algorithm='kd_tree',weights='distance')
knn.fit(X_train,Y_train)

KNeighborsClassifier(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=20, p=2,
                     weights='distance')
knn_r = knn.score(X_train,Y_train)
print("Logistic algorithm R: %.2f"%knn_r)

Logistic algorithm R: 1.00
knn_y_predict = knn.predict(X_test)
knn_r_test = knn.score(X_test,Y_test)
print("Logistic算法训练上R值(测试集上准确率):%.2f" % knn_r_test)

Logistic算法训练上R值(测试集上准确率):0.71
x_len = range(len(X_test))
plt.figure(figsize=(14,7),facecolor='w')
plt.ylim(-0.1,1.1)
plt.plot(x_len, Y_test, 'ro',markersize = 6, zorder=3, label=u'real value')
plt.plot(x_len, lr_y_predict, 'go', markersize = 10, zorder=2, label=u'Logis predict,$R^2$=%.3f' % lr.score(X_test, Y_test))
plt.plot(x_len, knn_y_predict, 'yo', markersize = 16, zorder=1, label=u'KNN predict,$R^2$=%.3f' % knn.score(X_test, Y_test))
plt.legend(loc = 'center right')
plt.xlabel(u'data num', fontsize=18)
plt.ylabel(u'isApproval(0 N,1 Y)', fontsize=18)
plt.title(u'Logistic and KNN ', fontsize=20)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hfkwpJ8z-1574610475754)(output_18_1.png)]

14总结

  • 线性模型一般用于回归问题,Logistic和Softmax模型一般用于分类问题
  • 求θ的主要方式是梯度下降算法,梯度下降算法是参数优化的重要手段,主要 是SGD,适用于在线学习以及跳出局部极小值
  • Logistic/Softmax回归是实践中解决分类问题的最重要的方法
  • 广义线性模型对样本要求不必要服从正态分布、只需要服从指数分布簇(二项 分布、泊松分布、伯努利分布、指数分布等)即可;广义线性模型的自变量可 以是连续的也可以是离散的。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值