机器学习-回归算法

最新推荐文章于 2024-08-13 09:18:02 发布

HP2LF

最新推荐文章于 2024-08-13 09:18:02 发布

阅读量570

点赞数 17

文章标签：机器学习回归人工智能

本文链接：https://blog.csdn.net/HP2LF/article/details/137695062

版权

线性回归算法

回归算法是一种有监督算法

• 回归算法是一种比较常用的机器学习算法，用来建立“解释”变量(自变量X)和观测值(因变量Y)之间的关系；从机器学习的角度来讲，用于构建一个算法模型(函数)来做属性(X)与标签(Y)之间的映射关系，在算法的学习过程中，试图寻找一个函数使得参数之间的关系拟合性最好。

• 回归算法中算法(函数)的最终结果是一个连续的数据值，输入值(属性值)是一个d维度的属性/数值向量
$h_\theta(x) =\theta_01+\theta_1x_1+\cdots+\theta_nx_n\\ =\theta_01+\theta_1x_1+\cdots+\theta_nx_n\\ =\theta_0x_0+\theta_1x_1+\cdots+\theta_nx_n\\ =\sum_{i=0}^{n}\theta_ix_i=\theta^Tx$
• 认为数据中存在线性关系，也就是特征属性X和目标属性Y之间的关系是满足线性关系。

• 在线性回归算法中，找出的模型对象是期望所有训练数据比较均匀的分布在直线或者平面的两侧。

• 在线性回归中，最优模型也就是所有样本(训练数据)离模型的直线或者平面距离最小

最小二乘

也就是说我们线性回归模型最优的时候是所有样本的预测值和实际值之间的差值最小化，由于预测值和实际值之间的差值存在正负性，所以要求平方后的值最小化。也就是可以得到如下的一个目标函数：
$J(\theta)=\frac{1}{2}\sum_{i=1}^m(\varepsilon^{(i)})=\frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}-y^{(i)}))^2$

线性回归、最大似然估计及二乘法

误差是独立同分布的，服从均值为0，方差为某定值的高斯分布

原因：中心极限定理

实际问题中，很多随机现象可以看做众多因素的独立影响的综合反应，往往服从正态分布
$y^{(i)}=\theta^Tx^{(i)}+\varepsilon^{(i)}\\ p(\varepsilon^{(i)})=\frac{1}{\sigma\sqrt{2\pi}}e^{-(\frac{\varepsilon^{(i)}}{2\sigma^2})}\\ p(y^{(i)}|x^(i);\theta)=\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2})\\ L(\theta)=\prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)\\ =\prod_{i=1}^{m}\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2})$

对数似然、目标函数及最小二乘

在这里插入图片描述

最小二乘法的参数最优解

• 参数解析式
$\theta=(X^TX)^{(-1)}X^TY$
• 最小二乘法的使用要求矩阵是可逆的；为了防止不可逆或者过拟合的问题存在，可以增加额外数据影响，导致最终的矩阵是可逆的：
$\theta=(X^TX+\lambda I)^{(-1)}X^TY$
• 最小二乘法直接求解的难点：矩阵逆的求解是一个难处

目标函数(loss/cost function)

• 0-1损失函数
$J(\theta)=\begin{cases}1, & \text{Y $\neq$ f(X)}\\[2ex] 0, \text{Y=f(X)}\end{cases}$
• 感知器损失函数
$J(\theta)=\begin{cases}1, & \text{|Y-f(X)|>t}\\[2ex] 0, \text{|Y-f(X)|$\leq$t}\end{cases}$
• 平方和损失函数
$J(\theta)=\sum_{i=1}^{m}((h_\theta(x^{(i)})-y^{(i)}))^2$
• 绝对值损失函数
$J(\theta)=\sum_{i=1}^{m}|(h_\theta(x^{(i)})-y^{(i)})|$
• 对数损失函数
$J(\theta)=-\sum_{i=1}^{m}(y^{(i)}\log h_\theta(x^{(i)}))$

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score


## 设置字符集，防止中文乱码
mpl.rcParams['font.sans-serif'] = [u'simHei']
mpl.rcParams['axes.unicode_minus'] = False


# 一、构造数据
X1 = np.array([10, 15, 20, 30, 50, 60, 60, 70]).reshape((-1, 1))
Y = np.array([0.8, 1.0, 1.8, 2.0, 3.2, 3.0, 3.1, 3.5]).reshape((-1, 1))

# 添加一个截距项对应的X值 np.column_stack()
# X = np.hstack((np.ones_like(X1), X1))
X = np.column_stack((np.ones_like(X1), X1))
# 不加入截距项
# X = X1
# # print(X)
# # print(Y)

# 二、为了求解比较方便，将numpy的'numpy.ndarray'的数据类型转换为矩阵的形式的。
X = np.mat(X)
Y = np.mat(Y)

# 三、根据解析式的公式求解theta的值
theta = (X.T * X).I * X.T * Y
print(theta)

# 四、 根据求解出来的theta求出预测值
predict_y = X * theta
print(predict_y)

# 查看MSE和R^2
predict_y = np.asarray(predict_y)
Y = np.asarray(Y)
mse = mean_squared_error(Y, predict_y)
print('MSE', mse)
r2 = r2_score(Y, predict_y)
print('r^2', r2)

x_test = [[1, 55]]
y_test_hat = theta * x_test
print('价格', y_test_hat)
# 四、画图可视化
plt.plot(X1, Y, 'bo', label=u'真实值')
plt.plot(X1, predict_y, 'r--o', label=u'预测值')
plt.legend(loc='lower right')
plt.show()

# 基于训练好的模型参数对一个未知的样本做一个预测



###1、数据加载

###2、数据进行清洗

###3、获取我们的数据的特征属性X和目标属性Y

###4、数据分割【指的是把数据划分为训练集和测试集】

###5、特征工程  正则化、标准化，文本的处理

###6、构建模型

###7、训练模型

###8、模型效果的评估 （效果不好，返回第二步进行优化，达到要求）

###9、模型保存/模型的持久化

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.use('TkAgg')
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from mpl_toolkits.mplot3d import Axes3D
## 设置字符集，防止中文乱码
mpl.rcParams['font.sans-serif'] = [u'simHei']
mpl.rcParams['axes.unicode_minus'] = False

flag = True
# 一、构造数据
X1 = np.array([
    [10, 1],
    [15, 1],
    [20, 1],
    [30, 1],
    [50, 2],
    [60, 1],
    [60, 2],
    [70, 2]]).reshape((-1, 2))
Y = np.array([0.8, 1.0, 1.8, 2.0, 3.2, 3.0, 3.1, 3.5]).reshape((-1, 1))

# 添加一个截距项对应的X值
if flag:
    X = np.column_stack((X1, np.ones(shape=(X1.shape[0], 1))))
else:
    X = X1

# 二、为了求解比较方便，将numpy的'numpy.ndarray'的数据类型转换为矩阵的形式的。
X = np.mat(X)
Y = np.mat(Y)

# 三、根据解析式的公式求解theta的值
theta = (X.T * X).I * X.T * Y
print(theta)


# 四、 根据求解出来的theta求出预测值
predict_y = X * theta
print(predict_y)
predict_y = np.asarray(predict_y)
Y = np.asarray(Y)
mes = mean_squared_error(y_true=Y, y_pred=predict_y)
r2 = r2_score(y_true=Y, y_pred=predict_y)
print('mes:',mes)
print('r2:',r2)


# 基于训练好的模型参数对一个未知的样本做一个预测
if flag:
    x = np.mat(np.array([[55.0, 2.0, 1.0]]))
else:
    x = np.mat(np.array([[55.0, 2.0]]))
predict_y = x * theta
print("当面积为55平并且房间数目为2的时候，预测价格为:{}".format(predict_y))
print(theta)

# 四、画图可视化(TODO: 自己更改为立体的图像)
x1 = X[:, 0]
x2 = X[:, 1]

fig = plt.figure(facecolor='w')
# ax = Axes3D(fig)
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x1, x2, Y, s=40, c='r', depthshade=False)

x1 = np.arange(0, 100)
x2 = np.arange(0, 4)
x1, x2 = np.meshgrid(x1, x2)

def predict(x1, x2, theta, base=False):
    if base:
        y_ = x1 * theta[0] + x2 * theta[1] + theta[2]
    else:
        y_ = x1 * theta[0] + x2 * theta[1]
    return y_

z = np.array(list(map(lambda t: predict(t[0], t[1], theta, base=flag), zip(x1.flatten(), x2.flatten()))))
z.shape = x1.shape
print(z.shape)
ax.plot_surface(x1, x2, z, rstride=1, cstride=1, cmap=plt.cm.jet)  ##画超平面   cmap=plt.cm.jet彩图
ax.set_title(u'房屋租赁价格预测')
plt.show()

import numpy as np
import pandas as pd
from sklearn.metrics import r2_score
import joblib
from sklearn.model_selection import train_test_split

class Linear():
    def __init__(self, use_b=True):
        self.use_b = use_b
        self.theta = None
        self.theta0 = 0

    def train(self, X, Y):
        if self.use_b:
            X = np.column_stack((np.ones((X.shape[0], 1)), X))
        X = np.mat(X)
        Y = np.mat(Y).reshape((-1, 1))

        theta = (X.T * X).I * X.T * Y
        if self.use_b:
            self.theta0 = theta[0]
            self.theta = theta[1:]
        else:
            self.theta = theta
            self.theta0 = 0

    def predict(self, X):
        X = np.mat(X)
        predict_y = X * self.theta + self.theta0
        return predict_y

    def score(self, X, Y):
        y_pred = self.predict(X)
        Y = np.asarray(Y)
        y_pred = np.asarray(y_pred)
        return r2_score(y_true=Y, y_pred=y_pred)
    def save(self):
        pass
    def load(self):
        pass

if __name__ == '__main__':
    # X1 = np.array([10, 15, 20, 30, 50, 60, 60, 70]).reshape((-1, 1))
    # Y = np.array([0.8, 1.0, 1.8, 2.0, 3.2, 3.0, 3.1, 3.5]).reshape((-1, 1))
    # linear = Linear(use_b=True)
    # linear.train(X1, Y)
    # x_test = [[55]]
    # y_test_hat = linear.predict(x_test)
    # print(y_test_hat)
    # print(linear.theta)
    # print(linear.theta0)
    #

    data = pd.read_csv('./data/boston_housing.data', sep='\s+', header=None)
    # print(data.shape)
    # print(data.info())
    # 数据预处理
    #
    # # 获取特征属性X和目标属性Y
    X = data.iloc[:, :-1]
    Y = data.iloc[:, -1]

    x_train, x_test, y_train, y_tst = train_test_split(X, Y, train_size=0.7, random_state=10)
    # print(x_train.shape)
    linear = Linear(use_b=True)
    linear.train(x_train, y_train)
    print(linear.score(x_train, y_train))
    print(linear.score(x_test, y_tst))

线性回归的过拟合

目标函数：
$J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$
为了防止数据过拟合，也就是的θ值在样本空间中不能过大，可以在目标函

数之上增加一个平方和损失：
$J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{i=1}{n}\theta_j^2$
• **正则项(norm)/惩罚项：这里这个正则项叫做L2-norm

过拟合和正则化

L2-norm：

使用L2正则的线性回归模型就称为Ridge回归(岭回归)
$J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{i=1}{n}\theta_j^2 \quad\quad \lambda>0$

L1-norm：

使用L1正则的线性回归模型就称为LASSO回归(Least Absolute Shrinkage and Selection Operator)
$J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{i=1}^{n}|\theta_j| \quad\quad \lambda>0$

Ridge(L2-norm)和LASSO(L1-norm)比较

• L2-norm中，由于对于各个维度的参数缩放是在一个圆内缩放的，不可能导致有维度参数变为0的情况，那么也就不会产生稀疏解；实际应用中，数据的维度中是存在噪音和冗余的，稀疏的解可以找到有用的维度并且减少冗余，提高后续算法预测的准确性和鲁棒性（减少了overfitting）(L1-norm可以达到最终解的稀疏性的要求)

• Ridge模型具有较高的准确性、鲁棒性以及稳定性(冗余特征已经被删除了)； LASSO模型具有较高的求解速度。

• 如果既要考虑稳定性也考虑求解的速度，就使用Elasitc Net

Elasitc Net

同时使用L1正则和L2正则的线性回归模型就称为Elasitc Net算法(弹性网络算法)
$J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2+\lambda(p\sum_{i=1}{n}\theta_j^2+(1-p)\sum_{i=1}^{n}|\theta_j|) \quad\quad \lambda>0$

模型效果判断

• MSE：误差平方和，越趋近于0表示模型越拟合训练数据。

• RMSE：MSE的平方根，作用同MSE

• R2：取值范围(负无穷,1]，值越大表示模型越拟合训练数据；最优解是1；当模型预测为随机值的时候，有可能为负；若预测值恒为样本期望，R2为0

• TSS：总平方和TSS(Total Sum of Squares)，表示样本之间的差异情况，是伪方差的m倍

• RSS：残差平方和RSS（Residual Sum of Squares），表示预测值和样本值之间的差异情况，是MSE的m倍

机器学习调参

• 在实际工作中，对于各种算法模型(线性回归)来讲，我们需要获取θ、λ、p的值；θ的求解其实就是算法模型的求解，一般不需要开发人员参与(算法已经实现)，主要需要求解的是λ和p的值，这个过程就叫做调参(超参)

• 交叉验证：将训练数据分为多份，其中一份进行数据验证并获取最优的超参：λ和p；比如：十折交叉验证、五折交叉验证(scikit-learn中默认)等

线性回归总结

• 算法模型：线性回归(Linear)、岭回归(Ridge)、LASSO回归、Elastic Net

• 正则化：L1-norm、L2-norm

• 损失函数/目标函数：
$J(\theta)=\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\quad\quad->min_\theta J(\theta)$
• θ求解方式：最小二乘法(直接计算，目标函数是平方和损失函数)、梯度下降 (BGD\SGD\MBGD)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.use('TkAgg')
# 加载数据
data = pd.read_csv('./data/boston_housing.data', sep='\s+', header=None)

# 数据预处理

# 获取特征属性X和目标属性Y
X = data.iloc[:, :-1]
Y = data.iloc[:, -1]
# (506,)
# (506, 13)
# print(Y.shape)
# print(X.shape)

# 划分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.3, random_state=10)

# 构建模型
# fit_intercept是否需要截距项
linear = LinearRegression(fit_intercept=True)

# 训练模型
linear.fit(x_train, y_train)
print('pramater:\n',linear.coef_) #参数
print('biase:',linear.intercept_) # 截距
print('train score:',linear.score(x_train, y_train))
print('test score:',linear.score(x_test, y_test))
yPredict = linear.predict(x_test)
y_train_predict = linear.predict(x_train)

# 画图
plt.figure(num='train')
plt.plot(range(len(x_train)), y_train, 'r', label=u'true')
plt.plot(range(len(x_train)), y_train_predict, 'g', label=u'predict')
# plt.show()

plt.figure(num='test')
plt.plot(range(len(x_test)), y_test, 'r', label=u'true')
plt.plot(range(len(x_test)), yPredict, 'g', label=u'preedict')
plt.show()

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.use('TkAgg')

# 加载数据集
data = pd.read_csv('./data/boston_housing.data', sep='\s+', header=None)

# 数据预处理

# 获取特征属性X和目标属性Y
X = data.iloc[:, :-1]
Y = data.iloc[:, -1]

# 训练集和测试集划分
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.3, random_state=10)

# 特征工程
# 多项式扩展
'''
PolynomialFeatures ####多项式扩展
degree=2,扩展的阶数
interaction_only=False,是否只保留交互项
include_bias=True，是否需要偏置项
'''
poly = PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)
"""
fit
fit_transform ==> fit+transform
transform
"""
poly.fit_transform(x_train)
poly.transform(x_test)

# 构建模型
# LinearRegression
# Ridge
# lasso
# linear = LinearRegression(fit_intercept=True)
ridge = Ridge(alpha=100, fit_intercept=True)
# lasso = Lasso(alpha=0.1, fit_intercept=True)

# 训练模型
# linear.fit(x_train, y_train)
ridge.fit(x_train, y_train)
# lasso.fit(x_train, y_train)

# print('linea pramater:', linear.coef_)
# print('linear baise:', linear.intercept_)
print('Ridge pramater:', ridge.coef_)
print('Ridge baise:', ridge.intercept_)
# print('lasso pramater:', lasso.coef_)
# print('lasso baise:', lasso.intercept_)
print("-" * 100)

# 预测
# yTrain_linear_pedict = linear.predict(x_train)
yTrain_ridge_pedict = ridge.predict(x_train)
# yTTrain_lasso_pedict = lasso.predict(x_train)
# y_linear_pedict = linear.predict(x_test)
y_ridge_pedict = ridge.predict(x_test)
# y_lasso_pedict = lasso.predict(x_test)
#

# 评分
# print('linear train score:', linear.score(x_train, y_train))
# print('linear test score:', linear.score(x_test, y_test))
print('ridge train score:', ridge.score(x_train, y_train))
print('ridge test score:', ridge.score(x_test, y_test))
# print('lasso train score:', lasso.score(x_train, y_train))
# print('lassp test score:', lasso.score(x_test, y_test))


# 画图
plt.figure(num='train')
plt.plot(range(len(x_train)), y_train, 'r', label='train')
plt.plot(range(len(x_train)), yTrain_ridge_pedict)
plt.legend(loc='upper right')
plt.title('train')

plt.figure(num="test")
plt.plot(range(len(x_test)), y_test, 'r', label=u'true')
plt.plot(range(len(x_test)), y_ridge_pedict, 'g', label=u'predict')
plt.legend(loc='upper right')
plt.title("test")
plt.show()


# 保存
joblib.dump(ridge, '.model/ridge.m')

HP2LF

关注

17
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
机器学习-回归算法

• 算法模型：线性回归(Linear)、岭回归(Ridge)、LASSO回归、Elastic Net• 正则化：L1-norm、L2-normJθ∑i1mhθxi−yi2−minθJθJθi1∑mhθxi−yi2−minθJθ• θ求解方式：最小二乘法(直接计算，目标函数是平方和损失函数)、梯度下降 (BGD\SGD\MBGD)# 加载数据# 数据预处理# 获取特征属性X和目标属性Y
复制链接

扫一扫