Predicting numeric values: regression

最新推荐文章于 2024-07-20 17:10:09 发布

DB架构

最新推荐文章于 2024-07-20 17:10:09 发布

阅读量218

点赞数

分类专栏： Machine Learning 文章标签：机器学习

本文链接：https://blog.csdn.net/u011868279/article/details/124419693

版权

Machine Learning 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

This paper will covers:

Linear regression
Locally weighted linear regression(局部加权线性回归)
Ridge regression and stagewise line regression(岭回归和逐步线性回归)
Predicting the age of an abalone and an antique selling price(预测鲍鱼年龄和玩具售价)

8.1 Finding best-fit lines with linear regression(用线性回归找到最佳拟合直线)

Linear regression:

Pros: Easy to interpret results,computationally inexpensive(结果易于解释，计算不复杂。)

Cons: Poorly models nonlinear data（对非线性数据拟合不友好）

Works with: Numeric values, nominal values（适用于数值型和标称型数据类型）

Our goal when using regression is to predict a numeric target value.

regression equation:

$HorsePower = 0.0015*annualSalary - 0.99*hoursListeningToPublicRadio$

regression weights:

0.0015 and 0.99

regression(linear regression):

The process of finding these regression weights

Linear regression means:

add up the inputs multiplied by some constants to get the output.

nonlinear regression means:

the output may be a function of the inputs multiplied together.

$HorsePower = 0.0015*annualSalary/hoursListeningToPublicRadio$

General approach to regression

1. Collect: Any method.

2. Prepare: We’ll need numeric values for regression. Nominal values should be mapped to binary values.

3. Analyze: It’s helpful to visualized 2D plots. Also, we can visualize the regression weights if we apply shrinkage methods.

4. Train: Find the regression weights.

5. Test: We can measure the R2, or correlation of the predicted value and data, to measure the success of our models.

6. Use: With regression, we can forecast a numeric value for a number of inputs. This is an improvement over classification because we’re predicting a continuous value rather than a discrete category.

the squared error :

$\sum_{i=1}^{m}(y_{i}-{x_{i}}^{T}w)^2$

用矩阵表示还可以写成：

$(y-Xw)^T(y-Xw)$

如果对w求导，得到：

$X^T(Y-Xw)$

令其等于0，解出w如下:

$\widehat{w} = (X^{T}X)^{-1}X^Ty$

'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-26 12:10:55
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
#  and assumes the last value is the target value
def loadDataSet(fileName):
    numFeat = len(open(fileName).readline().split('\t')) - 1
    dataMat = []
    labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr = []
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    xTx = xMat.T*xMat
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular,cannot do inverse")
    ws = xTx.I * (xMat.T*yMat)
    return ws

import numpy as np
import matplotlib.pyplot as plt
%matplotlib auto
%matplotlib inline

ws = regression.standRegres(xArr,yArr)
ws
# this variable ws is now our weights,which we multiply by our constant tern,and the second one we multiply by our input vairable X1.
xMat = np.mat(xArr)
yMat = np.mat(yArr)
yHat = xMat*ws
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xMat[:,1].flatten().A[0], yMat.T[:,0].flatten().A[0])
# This commands create the figure and plot the original data.To plot the best-fit line we've calculated
xCopy=xMat.copy()
xCopy.sort(0)
yHat=xCopy*ws
ax.plot(xCopy[:,1],yHat)
plt.show()

8.2 Locally weighted linear regression

One problem with linear regression is that is tends to underfit the data.It gives us the lowest mean-squared error for unbiased estimators.with the model underfit,we aren't getting the best predictions.There are a number of ways to reduce this mean-squared error by adding some bias into our estimator.

One way to reduce the mean-squared error is a technique known as locally weighted linear regression(局部加权线性回归)(LWLR)

$\widehat{w}=(X^{T}WX)^{-1}X^{T}W_{y}$

where W is a matrix that's used to weight the data points.

The most common kernel to use is a Gaussian. The kernel assigns a weight given by

$w(i,i) = exp(\frac{\left | x^{i}-x \right |}{-2k^2})$

# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    m = np.shape(xMat)[0]
    weights = np.mat(np.eye((m)))  # create diagonal matrix
    for j in range(m):
        # Populate weights with exponentially decaying values
        diffMat = testPoint - xMat[j,:]
        weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
    xTx = xMat.T * (weights * xMat)
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))
    return testPoint * ws

def lwlrTest(testArr,xArr,yArr,k=1.0):
    m = np.shape(testArr)[0]
    yHat = np.zeros(m)
    for i in range(m):
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat

8.3 Example: predicting the age of an abalone

In the data folder there is some data from the UCI data repository describing the age of a shellfish called abalone.

'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-26 15:25:55
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
#  and assumes the last value is the target value
def loadDataSet(fileName):
    numFeat = len(open(fileName).readline().split('\t')) - 1
    dataMat = []
    labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr = []
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    xTx = xMat.T*xMat
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular,cannot do inverse")
    ws = xTx.I * (xMat.T*yMat)
    return ws

# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    m = np.shape(xMat)[0]
    weights = np.mat(np.eye((m)))  # create diagonal matrix
    for j in range(m):
        # Populate weights with exponentially decaying values
        diffMat = testPoint - xMat[j,:]
        weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
    xTx = xMat.T * (weights * xMat)
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))
    return testPoint * ws

def lwlrTest(testArr,xArr,yArr,k=1.0):
    m = np.shape(testArr)[0]
    yHat = np.zeros(m)
    for i in range(m):
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat

def rssError(yArr,yHatArr):
    return ((yArr-yHatArr)**2).sum()

import imp
import regression
import matplotlib.pyplot as plt
imp.reload(regression)
abX,abY=regression.loadDataSet('abalone.txt')
yHat01=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],0.1)
yHat1=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],1)
yHat10=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],10)
regression.rssError(abY[0:99],yHat01.T)

regression.rssError(abY[0:99],yHat1.T)

regression.rssError(abY[0:99],yHat10.T)

yHat01=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],0.1)
regression.rssError(abY[100:199],yHat01.T)

yHat1=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],1)
regression.rssError(abY[100:199],yHat1.T)

yHat10=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],10)
regression.rssError(abY[100:199],yHat10.T)

ws = regression.standRegres(abX[0:99],abY[0:99])
yHat=np.mat(abX[100:199])*ws
regression.rssError(abY[100:199],yHat.T.A)

This example showed how one method—locally weighted linear regression—can
be used to build a model that may be better at forecasting than regular regression.

8.4 Shriking coefficients to understand our data

ridge regression, which is the first of two shrinkage methods we'll look at in this section.

forward stagewise regression ,which is an easy to approximate the lasso.

8.4.1 Ridge regression

Ridge regression adds an additional matrix $\lambda I$ to the matrix $X^TX$ so that it's non-singular and we can take the inverse of the whole thing: $X^TX+\lambda I$ . I is an mxm identity matrix.the symbol $\lambda$ is a user-defined scalar value,which we'll discuss shortly.

The formula for estimating our coefficients is now:

$\widehat{w} = (X^TX+\lambda I)^{-1}X^Ty$

Shrinkage methods allows us to throw out unimportant parameters so that we can get a better feel and human understanding of the data.

Shrinkage can give us a better prediction value than linear regression.

# Ridge regression

def ridgeRegres(xMat,yMat, lam=0.2):
    xTx = xMat.T*xMat
    denom = xTx + np.eye(np.shape(xMat)[1])*lam
    if np.linalg.det(denom) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = denom.I * (xMat.T*yMat)
    return ws

def ridgeTest(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    yMean = np.mean(yMat,0)
    yMat = yMat - yMean
    xMeans = np.mean(xMat,0)
    xVar = np.var(xMat,0)
    xMat = (xMat - xMeans) / xVar
    numTestPts = 30
    wMat = np.zeros((numTestPts,np.shape(xMat)[1]))
    for i in range(numTestPts):
        ws = ridgeRegres(xMat,yMat,np.exp(i-10))
        wMat[i,:] = ws.T
    return wMat

import regression

import imp

imp.reload(regression)

abX,abY=regression.loadDataSet('abalone.txt')

ridgeWeights=regression.ridgeTest(abX,abY)

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(ridgeWeights)
plt.show()

8.4.2 The lasso

$\sum_{k=1}^{n}{w_{k}}^{2} \leq \lambda$

This means that the sum of the squares of all our weights has to be less than or equal to $\lambda$ .

there's another shrinkage technique called the lasso. The lasso imposes a different constraint on the weights:

$\sum_{k=1}^{n}\leqslant \lambda$

8.4.3 Forward stagewise regression

There's an easier algorithm than the lasso that gives close results: stagewise linear regression.This algorithm is a greedy algorithm in that at each step it makes the decision that will reduce the error the most at that step.

'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-27 06:35:20
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
from cmath import inf
from shutil import register_unpack_format
from matplotlib.font_manager import FontProperties
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
#  and assumes the last value is the target value
def loadDataSet(fileName):
    numFeat = len(open(fileName).readline().split('\t')) - 1
    dataMat = []
    labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr = []
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    xTx = xMat.T*xMat
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular,cannot do inverse")
    ws = xTx.I * (xMat.T*yMat)
    return ws

# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    m = np.shape(xMat)[0]
    weights = np.mat(np.eye((m)))  # create diagonal matrix
    for j in range(m):
        # Populate weights with exponentially decaying values
        diffMat = testPoint - xMat[j,:]
        weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
    xTx = xMat.T * (weights * xMat)
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))
    return testPoint * ws

def lwlrTest(testArr,xArr,yArr,k=1.0):
    m = np.shape(testArr)[0]
    yHat = np.zeros(m)
    for i in range(m):
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat

def rssError(yArr,yHatArr):
    return ((yArr-yHatArr)**2).sum()

# Ridge regression

def ridgeRegres(xMat,yMat, lam=0.2):
    xTx = xMat.T*xMat
    denom = xTx + np.eye(np.shape(xMat)[1])*lam
    if np.linalg.det(denom) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = denom.I * (xMat.T*yMat)
    return ws

def ridgeTest(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    yMean = np.mean(yMat,0)
    yMat = yMat - yMean
    xMeans = np.mean(xMat,0)
    xVar = np.var(xMat,0)
    xMat = (xMat - xMeans) / xVar
    numTestPts = 30
    wMat = np.zeros((numTestPts,np.shape(xMat)[1]))
    for i in range(numTestPts):
        ws = ridgeRegres(xMat,yMat,np.exp(i-10))
        wMat[i,:] = ws.T
    return wMat


# Forward stagewise linear regression
def regularize(xMat):#regularize by columns
    inMat = xMat.copy()
    inMeans = np.mean(inMat,0)   #calc mean then subtract it off
    inVar = np.var(inMat,0)      #calc variance of Xi then divide by it
    inMat = (inMat - inMeans)/inVar
    return inMat

    
def stageWise(xArr,yArr,eps=0.01,numIt=100):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    yMean = np.mean(yMat,0)
    yMat = yMat - yMean
    xMat = regularize(xMat)
    m,n = np.shape(xMat)
    returnMat = np.zeros((numIt,n)) #testing code remove
    ws = np.zeros((n,1))
    wsTest = ws.copy()
    wsMax = ws.copy()
    for i in range(numIt):
        print(ws.T)
        lowestError = inf
        for j in range(n):
            for sign in [-1,1]:
                wsTest = ws.copy()
                wsTest[j] += eps*sign
                yTest = xMat*wsTest
                rssE = rssError(yMat.A,yTest.A)
                if rssE < lowestError:
                    lowestError = rssE
                    wsMax = wsTest
        ws = wsMax.copy()
        returnMat[i,:]=ws.T
    return returnMat

import regression
import imp
import numpy as np

imp.reload(regression)
xArr,yArr=regression.loadDataSet('abalone.txt')
regression.stageWise(xArr,yArr,0.01,200)

regression.stageWise(xArr,yArr,0.001,5000)


xMat=np.mat(xArr)
yMat=np.mat(yArr).T
xMat=regression.regularize(xMat)
yM = np.mean(yMat,0)
yMat=yMat-yM
weights=regression.standRegres(xMat,yMat.T)
weights.T

8.5 The bias/variance tradeoff

8.6 Example: forecasting the price of LEGO sets

Example: using regression to predict the price of a LEGO set

1. Collect: Collect from Google Shopping API.

2. Prepare: Extract price data from the returned JSON.

3. Analyze: Visually inspect the data.

4. Train: We’ll build different models with stagewise linear regression and straightforward linear regression.

5. Test: We’ll use cross validation to test the different models to see which one performs the best. 6. Use: The resulting model will be the object of this exercise.

Regression is the process of predicting a target value similar to classification. The difference between regression and classification is that the variable forecasted in regression is continuous, whereas it’s discrete in classification. Regression is one of the most useful tools in statistics. Minimizing the sum-of-squares error is used to find the best weights for the input features in a regression equation.

DB架构

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Predicting numeric values: regression

This paper will covers:Linear regression Locally weighted linear regression(局部加权线性回归) Ridge regression and stagewise line regression(岭回归和逐步线性回归) Predicting the age of an abalone and an antique selling price(预测鲍鱼年龄和玩具售价)1.1 Finding best-fit lines with
复制链接

扫一扫