预测数值型数据-回归 from机器学习实战U8

最新推荐文章于 2022-09-16 15:33:43 发布

菜逼庄水水

最新推荐文章于 2022-09-16 15:33:43 发布

阅读量269

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/qq_33604355/article/details/89005853

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

本文介绍了回归分析的基础，包括线性回归、局部加权线性回归和岭回归。讨论了如何处理自变量间的复共线性问题，以及如何通过向前逐步递归简化模型。此外，还探讨了偏差和方差在模型复杂度中的影响，并给出了预测乐高玩具价格的案例，但由于API关闭未完成实践。

摘要由CSDN通过智能技术生成

回归方程（直线）应该比较简单了，这里就简单记录一下
前面理由梯度上升（随机梯度）方法处理过，这里利用回归方程计算
相对计算量会小很多

平方误差 =（y-y预测）^2
= (yi-Xw)^2
最小，对w求偏导
w =（X.T * X).IX.T*y

import numpy as np
import matplotlib.pyplot as plt


def loaddata(filename):
    fr = open(filename)
    m = len(fr.readline().strip().split('\t')) - 1

    dataset = []
    labelset = []
    for line in fr.readlines():
        curline = line.split('\t')
        arrline = []
        for i in range(m):
            arrline.append(float(curline[i]))

        dataset.append(arrline)
        labelset.append(float(curline[-1]))

    return dataset, labelset


def plotlogisitc(dataset, labelset, w1):
    '''
    画图，核心是先画散点图吧
    '''
    x = np.mat(dataset)
    y = np.mat(labelset)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(x[:, 1].flatten().A[0], y.T[:, 0].flatten().A[0])

    # 再画折线图
    xcopy = x.copy()
    xcopy.sort(0)
    ycopy = xcopy * w1
    ax.plot(xcopy[:, 1], ycopy, c='red')

    # 再画局部加权的图 k=1.0
    yhat = Regres2_2(xcopy, dataset, labelset)
    yh =np.mat(yhat)
    ax.plot(xcopy[:,1].flatten().A[0], yh.T[:,0].flatten().A[0],c='green')

    


    #k-100
    yhat2 = Regres2_2(xcopy, dataset, labelset,100)
    yh2 = np.mat(yhat2)
    ax.plot(xcopy[:,1].flatten().A[0], yh2.T[:, 0].flatten().A[0], c='black')
    plt.show()
    return


def standRegres(dataset, labelset):
    '''
    标准线性回归的计算
    '''
    x = np.mat(dataset)
    y = np.mat(labelset).T

    xTx = x.T * x
    # 计算行列式，判断是否可逆
    if np.linalg.det(xTx) == 0:
        print("行列式不可逆\n")
        return

    w1 = xTx.I * (x.T * y)
    return w1


def Regres2_1(temp, dataset, labelset, k=1.0):
    # 局部加权回归
    # w(i,i)=exp(xi-x)/-2k**2
    # 返回的是某个点的预测值
    x = np.mat(dataset)
    y = np.mat(labelset).T
    m = np.shape(x)[0]
    w2 = np.eye(m)
    # 初始化权值
    for i in range(m):
        diffmat = temp - x[i]
        w2[i][i] = np.exp(diffmat * diffmat.T / (-2.0) * k ** 2)

    xTx = x.T * (w2 * x)
    if np.linalg.det(xTx) == 0:
        print("行列式不可逆\n")
        return

    ws = xTx.I * (x.T * (w2 * y))
    return temp * ws


def Regres2_2(testdata, dataset, labelset, k=1.0):
    m = np.shape(testdata)[0]
    yhat = np.zeros(m)
    for i in range(m):
        yhat[i] = Regres2_1(testdata[i], dataset, labelset,k)
    return yhat


if __name__ == '__main__':
    dataset, labelset = loaddata('C:\\Users\\SSY\\Desktop\\ttttt\\Ch08\\ex0.txt')

    w1 = standRegres(dataset, labelset)
    plotlogisitc(dataset, labelset, w1)

在这里插入图片描述
该方法可能会出现欠拟合问题，采用局部加权线性回归
使用核* 对测试点附近的点给予更高的权重

w(i,i)=exp(|xi-x|/-2*k^2)

w =(X.TWX).IX.TW*y

代码也在上面体现了
上图的图像是K =100的时候的图像。

岭回归
实质上是一种改良的最小二乘估计法，通过放弃最小二乘法的无偏性，以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法，对病态数据的拟合要强于最小二乘法。
在这里插入图片描述
实际计算中可选非常多的 k值，做出一个岭迹图，看看这个图在取哪个值的时候变稳定了，那就确定k值了。

函数：var

>>> help(numpy.var)
Help on function var in module numpy:

var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>)
    Compute the variance along the specified axis.
    
    Returns the variance of the array elements, a measure of the spread of a
    distribution.  The variance is computed for the flattened array by
    default, otherwise over the specified axis.

这里查询了numpy.var的用法，该函数返回平均方差，第二个参数是默认为None，意思为计算所有值的方差，而axis= 0，则按列计算方差，返回一个行向量;axis =1，按行计算方差，返回一个列向量

numpy.mean()和他用法相同，显然返回的是平均值

岭回归的代码如下：

'''
    岭回归
'''
def ridgeRegres(x, y, k=0.2):
    m = np.shape(x)[1]
    xTx = x.T * x + k * np.eye(m)
    if np.linalg.det(xTx)==0:
        print ("行列式为0...")
        return

    ws = xTx.I * (x.T * y)

    return ws

#初始化标准化
def ridgeTest(xarr, yarr):

    xmat=np.mat(xarr)
    ymat=np.mat(yarr).T
    ymean=np.mean(ymat,0)
    ymat =ymat-ymean

    xmean=np.mean(xmat,0)

    xvar =np.var(xmat,0)
    xmat =(xmat-xmean)/xvar

    #测试寻找最好的k的次数
    number =30
    ans= np.zeros((number,np.shape(xmat)[1]))
    for i in range(number):
        ws =ridgeRegres(xmat, ymat, np.exp(i-10))

        ans[i,:]= ws.T
    return ans

第二个函数主要作用是数据标准化，具体做法是所有特征都减去各自均值并除以方差
绘图如下：

>>> import matplotlib.pyplot as plt
>>> fig =plt.figure()
>>> ax =fig.add_subplot(111)
>>> ax.plot(ridgeweights)
[<matplotlib.lines.Line2D object at 0x000001D1A2B79828>, <matplotlib.lines.Line2D object at 0x000001D1A2B79978>, <matplotlib.lines.Line2D object at 0x000001D1A2B79AC8>, <matplotlib.lines.Line2D object at 0x000001D1A2B79C18>, <matplotlib.lines.Line2D object at 0x000001D1A2B79D68>, <matplotlib.lines.Line2D object at 0x000001D1A2B79EB8>, <matplotlib.lines.Line2D object at 0x000001D1A2B83048>, <matplotlib.lines.Line2D object at 0x000001D1A2B83198>]
>>> plt.show()

岭迹图，横坐标单位是log(k)
[岭迹图，横坐标单位是log(k-10)
搜索岭回归的相关知识：
问题：在自变量之间存在复共线性（eg x1=kx3,k为一个常系数）时，普通回归分析估计的方差就很大，估计值不稳定
岭回归思路：
如果自变量之间存在复共线性时，|X’X|约等于0，我们加上一个正常数矩阵kI，使得其行列式不为0
考虑到量纲问题，先对数据标准化处理
k=0时，退化为普通的最小二乘法；k值不同，得到的岭回归估计，也就是前文提到的w向量不同，上图，就是一张岭迹图
岭迹图分析用来了解自变量间的作用和自变量之间相互关系

k值选择原理：
（1）各回归系数的岭估计基本稳定；
（2）用最小二乘估计时符号不合理的回归系数，其岭估计的符号变得合理；
（3）回归系数没有不合乎经济意义的绝对值；
（4）残差平方和增大不太多。
前面那个栗子用的数据是鲍鱼年龄的栗子，似乎普通最小二乘法的结果还不错，可能可以删除几个一直特别小的量（影响效果小）

向前逐步递归

思路：贪心法，每次迭代，修改一个系数（修改一个固定的步长），使误差减少做多

def stagewise(xarr,yarr,eps=0.01,numit=100):
    """
    #向前逐步回归法//贪心
    :param xarr: 数据
    :param yarr: 值
    :param eps: 每次迭代调整的步长
    :param numit:
    :return:
    """
    xmat =np.mat(xarr)
    ymat =np.mat(yarr).T


    ymat =ymat - np.mean(ymat,0)

    xmat =regularize(xmat)

    m,n =np.shape(xmat)

    returnmat=np.zeros((numit,n))
    ws=np.zeros((n,1))
    wstest =ws.copy() #深拷贝
    wsmax =ws.copy()

    for i in range(numit):
        print (ws.T)
        lowesterror =float("inf")
        for j in range(n):
            #-1/1里面选一个，相当于做了+/-各一次
            for sign in [-1,1]:
                wstest =ws.copy()
                wstest[j]+=eps*sign
                ytest=xmat*wstest
                rsse=rsserror(ymat.A,ytest.A)
                if rsse<lowesterror:
                    lowesterror=rsse
                    wsmax =wstest
        ws =wsmax.copy()
        returnmat[i,:]=ws.T
    return returnmat

def regularize(xMat):
    inMat = xMat.copy()
    inMeans = np.mean(inMat,0)
    inVar = np.var(inMat,0)
    inMat = (inMat - inMeans)/inVar
    return inMat

注意两点
1.书中并没有给regularize（）定义，但这是一个规范化的过程，前面岭回归的时候就有类似处理方法，只是这里包装成函数了
2.inf表示最大值，应该用
lowesterror =float(“inf”)
lowesterror=inf是不能得到正确结果的

我感觉这个的主要问题在于限制步长实际上就限制了精度
咸鱼步长=0.001，迭代20000次跑了一下

>>> returnmat
array([[ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   ,  0.   ],
       ...,
       [ 0.041, -0.01 ,  0.119, ..., -0.967, -0.106,  0.185],
       [ 0.042, -0.01 ,  0.119, ..., -0.967, -0.106,  0.185],
       [ 0.041, -0.01 ,  0.119, ..., -0.967, -0.106,  0.185]])

在这里插入图片描述

大概在5000次左右就达到了平衡
在这里插入图片描述
这个方法的好处在于，帮助人们对现有的方法改进。例如发现某个值的系数接近于0，就可以舍弃（不再继续收集相关的信息）

偏差和方差

预测误差and来源：
1.复杂过程的简化
2.无法理解数据真实形成过程
3.测量过程本身也会存在误差
随着模型复杂度提高，预测误差会降低，但方差可能会先降低后升高（后面是过度拟合了）

前文提到的岭方法/向前逐步回归法，将一些系数缩减很小/甚至为0，简化了模型，但增大了模型偏差；8个特征值减少2个，模型更容易理解，降低了预测误差；减少更多特征值，就很可能丢失数据，预测误差变大

实例：预测乐高玩具价格

emmm、、
发现API关闭了，暂时就不做这个项目了，读完书中的代码吧。

主要新的内容是通过交叉验证测试岭回归（确定最佳回归系数，对系数进行缩减），代码抄过来

def crossValidation(xArr,yArr,numVal=10):
    m = len(yArr)                           
    indexList = range(m)
    errorMat = zeros((numVal,30))#create error mat 30columns numVal rows
    for i in range(numVal):
        trainX=[]; trainY=[]
        testX = []; testY = []
        random.shuffle(indexList)
        for j in range(m):#create training set based on first 90% of values in indexList
            if j < m*0.9: 
                trainX.append(xArr[indexList[j]])
                trainY.append(yArr[indexList[j]])
            else:
                testX.append(xArr[indexList[j]])
                testY.append(yArr[indexList[j]])
        wMat = ridgeTest(trainX,trainY)    #get 30 weight vectors from ridge
        for k in range(30):#loop over all of the ridge estimates
            matTestX = mat(testX); matTrainX=mat(trainX)
            meanTrain = mean(matTrainX,0)
            varTrain = var(matTrainX,0)
            matTestX = (matTestX-meanTrain)/varTrain #regularize test with training params
            yEst = matTestX * mat(wMat[k,:]).T + mean(trainY)#test ridge results and store
            errorMat[i,k]=rssError(yEst.T.A,array(testY))
            #print errorMat[i,k]
    meanErrors = mean(errorMat,0)#calc avg performance of the different ridge weight vectors
    minMean = float(min(meanErrors))
    bestWeights = wMat[nonzero(meanErrors==minMean)]
    #can unregularize to get model
    #when we regularized we wrote Xreg = (x-meanX)/var(x)
    #we can now write in terms of x not Xreg:  x*w/var(x) - meanX/var(x) +meanY
    xMat = mat(xArr); yMat=mat(yArr).T
    meanX = mean(xMat,0); varX = var(xMat,0)
    unReg = bestWeights/varX
    print "the best model from Ridge Regression is:\n",unReg
    print "with constant term: ",-1*sum(multiply(meanX,unReg)) + mean(yMat)

知识点有：
1.np.random.shuffle()
对元素进行混洗，实现数据集/测试集随机抽取的作用

>>> help(numpy.random.shuffle)
Help on built-in function shuffle:

shuffle(...) method of mtrand.RandomState instance
    shuffle(x)
    
    Modify a sequence in-place by shuffling its contents.
    Examples
    --------
    >>> arr = np.arange(10)
    >>> np.random.shuffle(arr)
    >>> arr
    [1 7 5 2 9 4 3 6 0 8]

通过这种方法，代码中随机选择了90%训练集，10%测试集
2.ridgeTest()函数在之前岭回归中写过，返回的是系数k取值为log(k-10) …0<=k<30的情况下，各个特征值的系数（30*特征值个数的矩阵）
然后分别用这30组回归系数，带入测试集中，和正确结果比较，计算误差，选择误差最小的一组的k值和回归系数作为结果

3.这里还是没有利用岭回归来减少特征的栗子，也是因为这个实例用只有4个特征值太少了。通过比较岭迹图，对于系数极小的特征值，其影响是不重要的，可以舍弃。