Predicting numeric values: regression

This paper will covers:

  • Linear regression
  • Locally weighted linear regression(局部加权线性回归)
  • Ridge regression and stagewise line regression(岭回归逐步线性回归)
  • Predicting the age of an abalone and an antique selling price(预测鲍鱼年龄和玩具售价)

8.1 Finding best-fit lines with linear regression(用线性回归找到最佳拟合直线)

Linear regression:

Pros: Easy to interpret results,computationally inexpensive(结果易于解释,计算不复杂。)

Cons: Poorly models nonlinear data(对非线性数据拟合不友好)

Works with: Numeric values, nominal values(适用于数值型和标称型数据类型)

Our goal when using regression is to predict a numeric target value.

regression equation:

HorsePower = 0.0015*annualSalary - 0.99*hoursListeningToPublicRadio

regression weights:

0.0015 and 0.99

regression(linear regression):

The process of finding these regression weights

Linear regression means:

add up the inputs multiplied by some constants to get the output.

nonlinear regression means:

the output may be a function of the inputs multiplied together.

HorsePower = 0.0015*annualSalary/hoursListeningToPublicRadio

General approach to regression

1. Collect: Any method.

2. Prepare: We’ll need numeric values for regression. Nominal values should be mapped to binary values.

3. Analyze: It’s helpful to visualized 2D plots. Also, we can visualize the regression weights if we apply shrinkage methods.

4. Train: Find the regression weights.

5. Test: We can measure the R2, or correlation of the predicted value and data, to measure the success of our models.

6. Use: With regression, we can forecast a numeric value for a number of inputs. This is an improvement over classification because we’re predicting a continuous value rather than a discrete category.

the squared error :

\sum_{i=1}^{m}(y_{i}-{x_{i}}^{T}w)^2

用矩阵表示还可以写成:

(y-Xw)^T(y-Xw)

如果对w求导,得到:

X^T(Y-Xw)

令其等于0,解出w如下:

\widehat{w} = (X^{T}X)^{-1}X^Ty

'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-26 12:10:55
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
#  and assumes the last value is the target value
def loadDataSet(fileName):
    numFeat = len(open(fileName).readline().split('\t')) - 1
    dataMat = []
    labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr = []
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    xTx = xMat.T*xMat
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular,cannot do inverse")
    ws = xTx.I * (xMat.T*yMat)
    return ws

import numpy as np
import matplotlib.pyplot as plt
%matplotlib auto
%matplotlib inline

ws = regression.standRegres(xArr,yArr)
ws
# this variable ws is now our weights,which we multiply by our constant tern,and the second one we multiply by our input vairable X1.
xMat = np.mat(xArr)
yMat = np.mat(yArr)
yHat = xMat*ws
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xMat[:,1].flatten().A[0], yMat.T[:,0].flatten().A[0])
# This commands create the figure and plot the original data.To plot the best-fit line we've calculated
xCopy=xMat.copy()
xCopy.sort(0)
yHat=xCopy*ws
ax.plot(xCopy[:,1],yHat)
plt.show()

 

8.2 Locally weighted linear regression

One problem with linear regression is that is tends to underfit the data.It gives us the lowest mean-squared error for unbiased estimators.with the model  underfit,we aren't getting the best predictions.There are a number of ways to reduce this mean-squared error by adding some bias into our estimator.

One way to reduce the mean-squared error is a technique known as locally weighted linear regression(局部加权线性回归)(LWLR)

\widehat{w}=(X^{T}WX)^{-1}X^{T}W_{y}

where W is a matrix that's used to weight the data points.

The most common kernel to use is a Gaussian. The kernel assigns a weight given by

w(i,i) = exp(\frac{\left | x^{i}-x \right |}{-2k^2})

 

# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    m = np.shape(xMat)[0]
    weights = np.mat(np.eye((m)))  # create diagonal matrix
    for j in range(m):
        # Populate weights with exponentially decaying values
        diffMat = testPoint - xMat[j,:]
        weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
    xTx = xMat.T * (weights * xMat)
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))
    return testPoint * ws

def lwlrTest(testArr,xArr,yArr,k=1.0):
    m = np.shape(testArr)[0]
    yHat = np.zeros(m)
    for i in range(m):
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat

 

8.3 Example: predicting the age of an abalone

In the data folder there is some data from the UCI data repository describing the age of a shellfish called abalone.

'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-26 15:25:55
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
#  and assumes the last value is the target value
def loadDataSet(fileName):
    numFeat = len(open(fileName).readline().split('\t')) - 1
    dataMat = []
    labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr = []
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    xTx = xMat.T*xMat
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular,cannot do inverse")
    ws = xTx.I * (xMat.T*yMat)
    return ws

# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    m = np.shape(xMat)[0]
    weights = np.mat(np.eye((m)))  # create diagonal matrix
    for j in range(m):
        # Populate weights with exponentially decaying values
        diffMat = testPoint - xMat[j,:]
        weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
    xTx = xMat.T * (weights * xMat)
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))
    return testPoint * ws

def lwlrTest(testArr,xArr,yArr,k=1.0):
    m = np.shape(testArr)[0]
    yHat = np.zeros(m)
    for i in range(m):
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat

def rssError(yArr,yHatArr):
    return ((yArr-yHatArr)**2).sum()

 

import imp
import regression
import matplotlib.pyplot as plt
imp.reload(regression)
abX,abY=regression.loadDataSet('abalone.txt')
yHat01=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],0.1)
yHat1=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],1)
yHat10=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],10)
regression.rssError(abY[0:99],yHat01.T)

regression.rssError(abY[0:99],yHat1.T)

regression.rssError(abY[0:99],yHat10.T)

yHat01=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],0.1)
regression.rssError(abY[100:199],yHat01.T)

yHat1=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],1)
regression.rssError(abY[100:199],yHat1.T)

yHat10=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],10)
regression.rssError(abY[100:199],yHat10.T)

ws = regression.standRegres(abX[0:99],abY[0:99])
yHat=np.mat(abX[100:199])*ws
regression.rssError(abY[100:199],yHat.T.A)

This example showed how one method—locally weighted linear regression—can
be used to build a model that may be better at forecasting than regular regression.

8.4 Shriking coefficients to understand our data

ridge regression, which is the first of two shrinkage methods we'll look at in this section.

forward stagewise regression ,which is an easy to approximate the lasso.

8.4.1 Ridge regression

Ridge regression adds an additional matrix \lambda I to the matrix X^TX so that it's non-singular and we can take the inverse of the whole thing: X^TX+\lambda I. I is an mxm identity matrix.the symbol \lambdais a user-defined scalar value,which we'll discuss shortly.

The formula for estimating our coefficients is now:

\widehat{w} = (X^TX+\lambda I)^{-1}X^Ty

Shrinkage methods allows us to throw out unimportant parameters so that we can get a better feel and human understanding of the data.

Shrinkage can give us a better prediction value than linear regression.

# Ridge regression

def ridgeRegres(xMat,yMat, lam=0.2):
    xTx = xMat.T*xMat
    denom = xTx + np.eye(np.shape(xMat)[1])*lam
    if np.linalg.det(denom) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = denom.I * (xMat.T*yMat)
    return ws

def ridgeTest(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    yMean = np.mean(yMat,0)
    yMat = yMat - yMean
    xMeans = np.mean(xMat,0)
    xVar = np.var(xMat,0)
    xMat = (xMat - xMeans) / xVar
    numTestPts = 30
    wMat = np.zeros((numTestPts,np.shape(xMat)[1]))
    for i in range(numTestPts):
        ws = ridgeRegres(xMat,yMat,np.exp(i-10))
        wMat[i,:] = ws.T
    return wMat

 

import regression

import imp

imp.reload(regression)

abX,abY=regression.loadDataSet('abalone.txt')

ridgeWeights=regression.ridgeTest(abX,abY)

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(ridgeWeights)
plt.show()

 8.4.2 The lasso

\sum_{k=1}^{n}{w_{k}}^{2} \leq \lambda

This means that the sum of the squares of all our weights has to be less than or equal to \lambda.

there's another shrinkage technique called the lasso. The lasso imposes a different constraint on the weights:

\sum_{k=1}^{n}\leqslant \lambda

8.4.3 Forward stagewise regression

There's an easier algorithm than the lasso that gives close results: stagewise linear regression.This algorithm is a greedy algorithm in that at each step it makes the decision that will reduce the error the most at that step.

 

 

'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-27 06:35:20
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
from cmath import inf
from shutil import register_unpack_format
from matplotlib.font_manager import FontProperties
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
#  and assumes the last value is the target value
def loadDataSet(fileName):
    numFeat = len(open(fileName).readline().split('\t')) - 1
    dataMat = []
    labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr = []
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    xTx = xMat.T*xMat
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular,cannot do inverse")
    ws = xTx.I * (xMat.T*yMat)
    return ws

# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    m = np.shape(xMat)[0]
    weights = np.mat(np.eye((m)))  # create diagonal matrix
    for j in range(m):
        # Populate weights with exponentially decaying values
        diffMat = testPoint - xMat[j,:]
        weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
    xTx = xMat.T * (weights * xMat)
    if np.linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))
    return testPoint * ws

def lwlrTest(testArr,xArr,yArr,k=1.0):
    m = np.shape(testArr)[0]
    yHat = np.zeros(m)
    for i in range(m):
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat

def rssError(yArr,yHatArr):
    return ((yArr-yHatArr)**2).sum()

# Ridge regression

def ridgeRegres(xMat,yMat, lam=0.2):
    xTx = xMat.T*xMat
    denom = xTx + np.eye(np.shape(xMat)[1])*lam
    if np.linalg.det(denom) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = denom.I * (xMat.T*yMat)
    return ws

def ridgeTest(xArr,yArr):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    yMean = np.mean(yMat,0)
    yMat = yMat - yMean
    xMeans = np.mean(xMat,0)
    xVar = np.var(xMat,0)
    xMat = (xMat - xMeans) / xVar
    numTestPts = 30
    wMat = np.zeros((numTestPts,np.shape(xMat)[1]))
    for i in range(numTestPts):
        ws = ridgeRegres(xMat,yMat,np.exp(i-10))
        wMat[i,:] = ws.T
    return wMat


# Forward stagewise linear regression
def regularize(xMat):#regularize by columns
    inMat = xMat.copy()
    inMeans = np.mean(inMat,0)   #calc mean then subtract it off
    inVar = np.var(inMat,0)      #calc variance of Xi then divide by it
    inMat = (inMat - inMeans)/inVar
    return inMat

    
def stageWise(xArr,yArr,eps=0.01,numIt=100):
    xMat = np.mat(xArr)
    yMat = np.mat(yArr).T
    yMean = np.mean(yMat,0)
    yMat = yMat - yMean
    xMat = regularize(xMat)
    m,n = np.shape(xMat)
    returnMat = np.zeros((numIt,n)) #testing code remove
    ws = np.zeros((n,1))
    wsTest = ws.copy()
    wsMax = ws.copy()
    for i in range(numIt):
        print(ws.T)
        lowestError = inf
        for j in range(n):
            for sign in [-1,1]:
                wsTest = ws.copy()
                wsTest[j] += eps*sign
                yTest = xMat*wsTest
                rssE = rssError(yMat.A,yTest.A)
                if rssE < lowestError:
                    lowestError = rssE
                    wsMax = wsTest
        ws = wsMax.copy()
        returnMat[i,:]=ws.T
    return returnMat

import regression
import imp
import numpy as np

imp.reload(regression)
xArr,yArr=regression.loadDataSet('abalone.txt')
regression.stageWise(xArr,yArr,0.01,200)

regression.stageWise(xArr,yArr,0.001,5000)


xMat=np.mat(xArr)
yMat=np.mat(yArr).T
xMat=regression.regularize(xMat)
yM = np.mean(yMat,0)
yMat=yMat-yM
weights=regression.standRegres(xMat,yMat.T)
weights.T

 8.5 The bias/variance tradeoff

 8.6 Example: forecasting the price of LEGO sets

Example: using regression to predict the price of a LEGO set

1. Collect: Collect from Google Shopping API.

2. Prepare: Extract price data from the returned JSON.

3. Analyze: Visually inspect the data.

4. Train: We’ll build different models with stagewise linear regression and straightforward linear regression.

5. Test: We’ll use cross validation to test the different models to see which one performs the best. 6. Use: The resulting model will be the object of this exercise.

Regression is the process of predicting a target value similar to classification. The difference between regression and classification is that the variable forecasted in regression is continuous, whereas it’s discrete in classification. Regression is one of the most useful tools in statistics. Minimizing the sum-of-squares error is used to find the best weights for the input features in a regression equation.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值