机器学习笔记(1)-线性回归


线性回归

一. 问题概述
回归的目的就是建立一个回归方程用来预测目标值,回归的求解就是求这个回归方程的回归系数。预测的方法当然十分简单,回归系数乘以输入值再全部相加就得到了预测值。
回归最简单的定义是,给出一个点集D,用一个函数去拟合这个点集,并且使得点集与拟合函数间的误差最小,如果这个函数曲线是一条直线,那就被称为线性回归,如果曲线是一条二次曲线,就被称为二次回归。

二. 线性回归的的求解

  1. 最小二乘法
    1.1. 特点
    Normal Equation算法也叫做普通最小二乘法(ordinary least squares),其特点是:给定输人矩阵X,如果X.T*X的逆存在并可以求得的话,就可以直接采用该方法求解。其求解理论也十分简单:既然是是求最小误差平方和,另其导数为0即可得出回归系数。
    矩阵X为(m,n+1)矩阵(m表示样本数、n表示一个样本的特征数),y为(m,1)列向量。
    上述公式中包含X.T*X, 也就是需要对矩阵求逆,因此这个方程只在逆矩阵存在的时候适用。然而,矩阵的逆可能并不存在,后面“岭回归”会讨论处理方法。
    1.2 推导
    (具体见周志华教材)
    1.3 实现
# -*- coding:utf-8 -*-
"""
最小二乘法实现线性回归
by LeYuan
"""
from numpy import *
import matplotlib.pyplot as plt
from sklearn import datasets


# Ordinary least square(普通的最小二乘法)
def ordinary_least_square(X, Y):
        """
         例如:给定数据集D={(X1,y1),(X2,y2),(X3,y3),...,(Xm,ym)}, m为样本大小,其中Xi=(xi1;xi2;...xid),d为特征个数,欲
         求模型f(Xi)=w*Xi+b , w=(w1;w2;...wd)
          为了方便计算,我们扩展 x0=1, W=(b;w1;w2;...;wd),则
        :param X:  X为mxd维度矩阵 matrix
        :param Y:  Y为mx1维度矩阵  matrix
        :return: 回归系数 W=(b;w1;w2;...;wd)
        """
        X0 = mat(ones((X.shape[0], 1)))
        X = hstack((X0, X))  # extend x0=1 for each sample
        xTx = X.T * X
        if linalg.det(xTx) == 0.0:           # 计算矩阵的行列式
            print("This matrix is singular, cannot do inverse")  # 奇异矩阵,不能求逆
        else:
            return xTx.I * (X.T*Y)        # 返回回归系数 W=(b;w1;w2;...;wd)  xTx.I表示矩阵的逆


def predict(X, W):
    """
    预测数据
    :param X: X为mxd维度矩阵 matrix
    :param W: 回归系数 W=(b;w1;w2;...;wd)
    :return: the predict matrix
    """
    X0 = mat(ones((X.shape[0], 1)))
    X = hstack((X0, X))  # extend x0=1 for each sample
    Y = X*W
    return Y


def errors_compute(X, W, Y):
    """
     compute the errors
    :param X: X为mxd维度矩阵 matrix
    :param W: 回归系数 W=(b;w1;w2;...;wd)
    :param Y:  Y为mx1维度矩阵  matrix ,the real value matrix
    :return: the errors
    """
    y_predict = predict(X, W)
    total_error = np.mean(multiply(y_predict-Y, y_predict-Y))
    return total_error


def test_regulation():
    # load the diabetes dataset
    diabetes = datasets.load_diabetes()
    # use multiple feature
    diabetes_X =diabetes.data[:, 0:5]

    # Split the data into training/testing sets
    diabetes_X_train = mat(diabetes_X[:-20])
    diabetes_X_test = mat(diabetes_X[-20:])

    # Split the targets into training/testing sets
    diabetes_y_train = mat(diabetes.target[:-20]).T
    diabetes_y_test = mat(diabetes.target[-20:]).T

    W = ordinary_least_square(diabetes_X_train, diabetes_y_train)

    # the coeffcients
    print('Coefficients: \n', W)
    # the mean squared error
    yerror = predict(diabetes_X_train, W)-diabetes_y_train
    print("Mean squared error: %.2f" % mean(multiply(yerror, yerror)))

    # plot outputs
    y_test_array = diabetes_y_test.A.reshape(diabetes_y_test.shape[0],)
    y_predict_martrix = predict(diabetes_X_test, W)
    y_predict_array = y_predict_martrix.A.reshape(y_predict_martrix.shape[0],)
    plt.scatter(y_test_array, y_predict_array)
    plt.plot(y_test_array, y_test_array)

    plt.xlabel('true value')
    plt.ylabel('predict value')
    plt.title('ordinary_least_square')
    plt.show()


if __name__ == "__main__":
    test_regulation()
  1. 梯度下降法
    2.1特点
    详细见:
    线性回归与梯度下降:http://blog.csdn.net/xiazdong/article/details/7950084
    多元线性回归、梯度下降http://m.blog.csdn.net/article/details?id=51169525
    2.2算法
    2.3实现
# -*- coding:utf-8 -*-
"""
梯度下降法实现多元线性回归
方法思想参考: http://m.blog.csdn.net/article/details?id=51169525
by LeYuan
"""
from numpy import *
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

def gradient_decent(feature, Y, W, a, tolerance, iterations_num):
    """
     给定数据矩阵feature,计算参数列表w=(w0;w1;w2;w3;...;wn),学得模型hw(x)=w0X0+w1X1+...+wnXn=w.T*X
    :param feature: 特征值矩阵  mxn  m个样例,n个特征
    :param Y: 标记向量         Y=(y1;y2;y3;...;ym)   mx1
    :param W: 初始化参数                                  mx1
    :param a: 步长
    :param tolerance:下界
    :return:w=(w0;w1;w2;w3;...;wn)                      (n+1)x1
    """
    # feature->D extend x0=1 for each sample
    x0 = mat(ones((feature.shape[0], 1)))
    D = hstack((x0, feature))     # mx(n+1)

    # feature scaling

    converged = False
    for i in range(iterations_num):
        y_predict = D * W
        # compute error
        errors = np.mean(multiply(y_predict-Y, y_predict-Y))
        # while we haven't reached the tolerance yet, update each feature's weight
        # print (y_predict-Y).T.shape
        derive = ((y_predict-Y).T * D) / float(D.shape[0])          # 1x(n+1)
        W = W - a * derive.T
    print("The iteration_num:", iterations_num, " The errors:", errors)
    return W


def predict(feature, W):
    """
    输入数据,预测结果
    :param feature: 特征值矩阵
    :param W: 参数向量
    :return: 预测的向量
    """
    # feature->D extend x0=1 for each sample
    x0 = mat(ones((feature.shape[0], 1)))
    D = hstack((x0, feature))  # mx(n+1)
    # predict
    y_predict = D * W
    return y_predict


def test_regulation():
    # load the diabetes dataset
    diabetes = datasets.load_diabetes()
    # use multiple feature
    diabetes_X = diabetes.data[:, 0:5]

    # Split the data into training/testing sets
    diabetes_X_train = mat(diabetes_X[:-20])
    diabetes_X_test = mat(diabetes_X[-20:])

    # Split the targets into training/testing sets
    diabetes_y_train = mat(diabetes.target[:-20]).T
    diabetes_y_test = mat(diabetes.target[-20:]).T

    # 初始化W =(0;0;0;...;0)
    W = np.zeros([diabetes_X_train.shape[1]+1, 1])
    a = 0.32
    tolerance = 2200
    iterations_num = 5000
    W = gradient_decent(diabetes_X_train, diabetes_y_train, W, a, tolerance, iterations_num)
    print("W=", W)
    # 测试结果
    y_predict_martrix = predict(diabetes_X_test, W)
    print("y_predict_martrix", y_predict_martrix)

    # plot
    y_test_array = diabetes_y_test.A.reshape(diabetes_y_test.shape[0], )
    y_predict_array = y_predict_martrix.A.reshape(y_predict_martrix.shape[0], )
    plt.scatter(y_test_array, y_predict_array)
    plt.plot(y_test_array, y_test_array)

    plt.xlabel('true value')
    plt.ylabel('predict value')
    plt.title('gradientdecent')
    plt.show()

if __name__ == "__main__":
    test_regulation()












  1. 局部加权线性回归
  2. 岭回归
    5.1特点
    5.2算法与思想
    5.3实现
"""
岭回归实现线性回归
by LeYuan
"""

from numpy import *
import matplotlib.pyplot as plt
from sklearn import datasets


def ridge_regression(X, Y, lam):
    """
    例如:给定数据集D={(X1,y1),(X2,y2),(X3,y3),...,(Xm,ym)}, m为样本大小,其中Xi=(xi1;xi2;...xid),d为特征个数,欲
         求模型f(Xi)=w*Xi+b , w=(w1;w2;...wd)
          为了方便计算,我们扩展 x0=1, W=(b;w1;w2;...;wd),则
    :param X: mxd矩阵
    :param Y:  mx1矩阵
    :param lam: 得到lambda参数
    :return: W=(b;w1;w2;...;wd),
    """
    X, Y = featurescaling(X, Y)

    # extend x0=1 for each exmple
    x0 = mat(ones((X.shape[0], 1)))
    X = hstack((x0, X))

    xTx = X.T * X

    # 产生对角矩阵
    I = eye(X.shape[1])
    I[0][0] = 0  # w0 has no punish factor
    denom = xTx + I * lam

    if linalg.det(denom) == 0:
        print("this matrix is singular,  cannot do inverse")
        return
    else:
        W = denom.I * X.T * Y
        return W


def featurescaling(X, Y):
    """
    feature scaling : Mean Nomalization ,即(x-mean(x))/(max-min)
    :param X: mxd矩阵
    :param Y: mx1矩阵
    :return: X ,Y
    """
    # feature scaling
    yMean = mean(Y, 0)
    Y = (Y - yMean)/(amax(Y, 0)-amin(Y, 0))
    xMean = mean(X, 0)  # calc mean the substract it off
    xMax = amax(X, 0)
    xMin = amin(X, 0)
    X = (X - xMean)/(xMax - xMin)
    return X, Y


def predict(X, W):
    """
    预测数据
    :param X: X为mxd维度矩阵 matrix
    :param W: 回归系数 W=(b;w1;w2;...;wd)
    :return: the predict matrix
    """
    X0 = mat(ones((X.shape[0], 1)))
    X = hstack((X0, X))  # extend x0=1 for each sample
    Y = X*W
    return Y


def errors_compute(X, W, Y):
    """
     compute the errors
    :param X: X为mxd维度矩阵 matrix
    :param W: 回归系数 W=(b;w1;w2;...;wd)
    :param Y:  Y为mx1维度矩阵  matrix ,the real value matrix
    :return: the errors
    """
    y_predict = predict(X, W)
    total_error = np.mean(multiply(y_predict-Y, y_predict-Y))
    return total_error


def test_regulation():
    # load the diabetes dataset
    diabetes = datasets.load_diabetes()
    # use multiple feature
    diabetes_X = diabetes.data[:, 0:5]

    # Split the data into training/testing sets
    diabetes_X_train = mat(diabetes_X[:-20])
    diabetes_X_test = mat(diabetes_X[-20:])

    # Split the targets into training/testing sets
    diabetes_y_train = mat(diabetes.target[:-20]).T
    diabetes_y_test = mat(diabetes.target[-20:]).T

    # print((diabetes_X_test, diabetes_y_test))
    # print(featurescaling(diabetes_X_test, diabetes_y_test))
    diabetes_X_test, diabetes_y_test = featurescaling(diabetes_X_test, diabetes_y_test)

    # set lam
    lam = exp(-4)

    W = ridge_regression(diabetes_X_train, diabetes_y_train, lam)

    # the coeffcients
    print('Coefficients: \n', W)

    # the mean squared error
    diabetes_X_train, diabetes_y_train = featurescaling(diabetes_X_train, diabetes_y_train)
    yerror = predict(diabetes_X_train, W)-diabetes_y_train
    print("Mean squared error: %.2f" % mean(multiply(yerror, yerror)))

    # plot outputs
    y_test_array = diabetes_y_test.A.reshape(diabetes_y_test.shape[0],)
    y_predict_martrix = predict(diabetes_X_test, W)
    y_predict_array = y_predict_martrix.A.reshape(y_predict_martrix.shape[0],)

    plt.scatter(y_test_array, y_predict_array)
    plt.plot(y_test_array, y_test_array)

    plt.xlabel('true value')
    plt.ylabel('predict value')
    plt.title('ordinary_least_square')
    plt.show()


if __name__ == "__main__":
    test_regulation()





三. 应用与模型调优
对于需要根据一些特征的组合来预测一个值(如预测房价、菜价等)且预测值和特征组合间的关系是线性时既可以采用线性回归建立预测模型。通过机器学习算法建立起一个模型之后就需要在使用中不断的调优和修正,对于线性回归来说,最佳模型就是取得预测偏差和模型方差之间的平衡(高偏差就是欠拟合,高方差就是过拟合)。线性回归模型中模型调优和修正的方法包括:

  • 获取更多的训练样本 - 解决高方差

  • 尝试使用更少的特征的集合 - 解决高方差

  • 尝试获得其他特征 - 解决高偏差

  • 尝试添加多项组合特征 - 解决高偏差

  • 尝试减小 λ - 解决高偏差

  • 尝试增加 λ -解决高方差


参考网址:

  1. 机器学习经典算法详解及Python实现–线性回归(Linear Regression)算法: http://blog.csdn.net/suipingsp/article/details/42101139
  2. Dataasporant : http://dataaspirant.com/2017/02/15/simple-linear-regression-python-without-any-machine-learning-libraries/
  3. 机器学习理论与实战(八)回归:http://blog.csdn.net/pi9nc/article/details/26616063
  4. 岭回归的讲解与实现:http://blog.csdn.net/u014303046/article/details/53053019?locationNum=2&fps=1
  5. 机器学习笔记博客csdn(很详细):http://blog.csdn.net/artprog/article/details/51104192
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值