This paper will covers:
- Linear regression
- Locally weighted linear regression(局部加权线性回归)
- Ridge regression and stagewise line regression(岭回归和逐步线性回归)
- Predicting the age of an abalone and an antique selling price(预测鲍鱼年龄和玩具售价)
8.1 Finding best-fit lines with linear regression(用线性回归找到最佳拟合直线)
Linear regression:
Pros: Easy to interpret results,computationally inexpensive(结果易于解释,计算不复杂。)
Cons: Poorly models nonlinear data(对非线性数据拟合不友好)
Works with: Numeric values, nominal values(适用于数值型和标称型数据类型)
Our goal when using regression is to predict a numeric target value.
regression equation:
regression weights:
0.0015 and 0.99
regression(linear regression):
The process of finding these regression weights
Linear regression means:
add up the inputs multiplied by some constants to get the output.
nonlinear regression means:
the output may be a function of the inputs multiplied together.
General approach to regression
1. Collect: Any method.
2. Prepare: We’ll need numeric values for regression. Nominal values should be mapped to binary values.
3. Analyze: It’s helpful to visualized 2D plots. Also, we can visualize the regression weights if we apply shrinkage methods.
4. Train: Find the regression weights.
5. Test: We can measure the R2, or correlation of the predicted value and data, to measure the success of our models.
6. Use: With regression, we can forecast a numeric value for a number of inputs. This is an improvement over classification because we’re predicting a continuous value rather than a discrete category.
the squared error :
用矩阵表示还可以写成:
如果对w求导,得到:
令其等于0,解出w如下:
'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-26 12:10:55
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
# and assumes the last value is the target value
def loadDataSet(fileName):
numFeat = len(open(fileName).readline().split('\t')) - 1
dataMat = []
labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr = []
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
dataMat.append(lineArr)
labelMat.append(float(curLine[-1]))
return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular,cannot do inverse")
ws = xTx.I * (xMat.T*yMat)
return ws
import numpy as np
import matplotlib.pyplot as plt
%matplotlib auto
%matplotlib inline
ws = regression.standRegres(xArr,yArr)
ws
# this variable ws is now our weights,which we multiply by our constant tern,and the second one we multiply by our input vairable X1.
xMat = np.mat(xArr)
yMat = np.mat(yArr)
yHat = xMat*ws
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xMat[:,1].flatten().A[0], yMat.T[:,0].flatten().A[0])
# This commands create the figure and plot the original data.To plot the best-fit line we've calculated
xCopy=xMat.copy()
xCopy.sort(0)
yHat=xCopy*ws
ax.plot(xCopy[:,1],yHat)
plt.show()
8.2 Locally weighted linear regression
One problem with linear regression is that is tends to underfit the data.It gives us the lowest mean-squared error for unbiased estimators.with the model underfit,we aren't getting the best predictions.There are a number of ways to reduce this mean-squared error by adding some bias into our estimator.
One way to reduce the mean-squared error is a technique known as locally weighted linear regression(局部加权线性回归)(LWLR)
where W is a matrix that's used to weight the data points.
The most common kernel to use is a Gaussian. The kernel assigns a weight given by
# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
m = np.shape(xMat)[0]
weights = np.mat(np.eye((m))) # create diagonal matrix
for j in range(m):
# Populate weights with exponentially decaying values
diffMat = testPoint - xMat[j,:]
weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
xTx = xMat.T * (weights * xMat)
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T * (weights * yMat))
return testPoint * ws
def lwlrTest(testArr,xArr,yArr,k=1.0):
m = np.shape(testArr)[0]
yHat = np.zeros(m)
for i in range(m):
yHat[i] = lwlr(testArr[i],xArr,yArr,k)
return yHat
8.3 Example: predicting the age of an abalone
In the data folder there is some data from the UCI data repository describing the age of a shellfish called abalone.
'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-26 15:25:55
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
# and assumes the last value is the target value
def loadDataSet(fileName):
numFeat = len(open(fileName).readline().split('\t')) - 1
dataMat = []
labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr = []
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
dataMat.append(lineArr)
labelMat.append(float(curLine[-1]))
return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular,cannot do inverse")
ws = xTx.I * (xMat.T*yMat)
return ws
# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
m = np.shape(xMat)[0]
weights = np.mat(np.eye((m))) # create diagonal matrix
for j in range(m):
# Populate weights with exponentially decaying values
diffMat = testPoint - xMat[j,:]
weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
xTx = xMat.T * (weights * xMat)
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T * (weights * yMat))
return testPoint * ws
def lwlrTest(testArr,xArr,yArr,k=1.0):
m = np.shape(testArr)[0]
yHat = np.zeros(m)
for i in range(m):
yHat[i] = lwlr(testArr[i],xArr,yArr,k)
return yHat
def rssError(yArr,yHatArr):
return ((yArr-yHatArr)**2).sum()
import imp
import regression
import matplotlib.pyplot as plt
imp.reload(regression)
abX,abY=regression.loadDataSet('abalone.txt')
yHat01=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],0.1)
yHat1=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],1)
yHat10=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],10)
regression.rssError(abY[0:99],yHat01.T)
regression.rssError(abY[0:99],yHat1.T)
regression.rssError(abY[0:99],yHat10.T)
yHat01=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],0.1)
regression.rssError(abY[100:199],yHat01.T)
yHat1=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],1)
regression.rssError(abY[100:199],yHat1.T)
yHat10=regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],10)
regression.rssError(abY[100:199],yHat10.T)
ws = regression.standRegres(abX[0:99],abY[0:99])
yHat=np.mat(abX[100:199])*ws
regression.rssError(abY[100:199],yHat.T.A)
This example showed how one method—locally weighted linear regression—can
be used to build a model that may be better at forecasting than regular regression.
8.4 Shriking coefficients to understand our data
ridge regression, which is the first of two shrinkage methods we'll look at in this section.
forward stagewise regression ,which is an easy to approximate the lasso.
8.4.1 Ridge regression
Ridge regression adds an additional matrix to the matrix
so that it's non-singular and we can take the inverse of the whole thing:
. I is an mxm identity matrix.the symbol
is a user-defined scalar value,which we'll discuss shortly.
The formula for estimating our coefficients is now:
Shrinkage methods allows us to throw out unimportant parameters so that we can get a better feel and human understanding of the data.
Shrinkage can give us a better prediction value than linear regression.
# Ridge regression
def ridgeRegres(xMat,yMat, lam=0.2):
xTx = xMat.T*xMat
denom = xTx + np.eye(np.shape(xMat)[1])*lam
if np.linalg.det(denom) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = denom.I * (xMat.T*yMat)
return ws
def ridgeTest(xArr,yArr):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
yMean = np.mean(yMat,0)
yMat = yMat - yMean
xMeans = np.mean(xMat,0)
xVar = np.var(xMat,0)
xMat = (xMat - xMeans) / xVar
numTestPts = 30
wMat = np.zeros((numTestPts,np.shape(xMat)[1]))
for i in range(numTestPts):
ws = ridgeRegres(xMat,yMat,np.exp(i-10))
wMat[i,:] = ws.T
return wMat
import regression
import imp
imp.reload(regression)
abX,abY=regression.loadDataSet('abalone.txt')
ridgeWeights=regression.ridgeTest(abX,abY)
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(ridgeWeights)
plt.show()
8.4.2 The lasso
This means that the sum of the squares of all our weights has to be less than or equal to .
there's another shrinkage technique called the lasso. The lasso imposes a different constraint on the weights:
8.4.3 Forward stagewise regression
There's an easier algorithm than the lasso that gives close results: stagewise linear regression.This algorithm is a greedy algorithm in that at each step it makes the decision that will reduce the error the most at that step.
'''
Author: Maxwell Pan
Date: 2022-04-26 10:46:59
LastEditTime: 2022-04-27 06:35:20
FilePath: \cp08\regression.py
Description: Finding best-fit with linear regression
Software:VSCode,env:
'''
from cmath import inf
from shutil import register_unpack_format
from matplotlib.font_manager import FontProperties
import numpy as np
import matplotlib.pyplot as plt
# this function opens a text file with tab-delimited values
# and assumes the last value is the target value
def loadDataSet(fileName):
numFeat = len(open(fileName).readline().split('\t')) - 1
dataMat = []
labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr = []
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
dataMat.append(lineArr)
labelMat.append(float(curLine[-1]))
return dataMat,labelMat
# this function that computes the best-fit line.
def standRegres(xArr,yArr):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular,cannot do inverse")
ws = xTx.I * (xMat.T*yMat)
return ws
# Locally weighted linear regression function
def lwlr(testPoint,xArr,yArr,k=1.0):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
m = np.shape(xMat)[0]
weights = np.mat(np.eye((m))) # create diagonal matrix
for j in range(m):
# Populate weights with exponentially decaying values
diffMat = testPoint - xMat[j,:]
weights[j,j] = np.exp(diffMat*diffMat.T/(-2.0*k**2))
xTx = xMat.T * (weights * xMat)
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T * (weights * yMat))
return testPoint * ws
def lwlrTest(testArr,xArr,yArr,k=1.0):
m = np.shape(testArr)[0]
yHat = np.zeros(m)
for i in range(m):
yHat[i] = lwlr(testArr[i],xArr,yArr,k)
return yHat
def rssError(yArr,yHatArr):
return ((yArr-yHatArr)**2).sum()
# Ridge regression
def ridgeRegres(xMat,yMat, lam=0.2):
xTx = xMat.T*xMat
denom = xTx + np.eye(np.shape(xMat)[1])*lam
if np.linalg.det(denom) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = denom.I * (xMat.T*yMat)
return ws
def ridgeTest(xArr,yArr):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
yMean = np.mean(yMat,0)
yMat = yMat - yMean
xMeans = np.mean(xMat,0)
xVar = np.var(xMat,0)
xMat = (xMat - xMeans) / xVar
numTestPts = 30
wMat = np.zeros((numTestPts,np.shape(xMat)[1]))
for i in range(numTestPts):
ws = ridgeRegres(xMat,yMat,np.exp(i-10))
wMat[i,:] = ws.T
return wMat
# Forward stagewise linear regression
def regularize(xMat):#regularize by columns
inMat = xMat.copy()
inMeans = np.mean(inMat,0) #calc mean then subtract it off
inVar = np.var(inMat,0) #calc variance of Xi then divide by it
inMat = (inMat - inMeans)/inVar
return inMat
def stageWise(xArr,yArr,eps=0.01,numIt=100):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T
yMean = np.mean(yMat,0)
yMat = yMat - yMean
xMat = regularize(xMat)
m,n = np.shape(xMat)
returnMat = np.zeros((numIt,n)) #testing code remove
ws = np.zeros((n,1))
wsTest = ws.copy()
wsMax = ws.copy()
for i in range(numIt):
print(ws.T)
lowestError = inf
for j in range(n):
for sign in [-1,1]:
wsTest = ws.copy()
wsTest[j] += eps*sign
yTest = xMat*wsTest
rssE = rssError(yMat.A,yTest.A)
if rssE < lowestError:
lowestError = rssE
wsMax = wsTest
ws = wsMax.copy()
returnMat[i,:]=ws.T
return returnMat
import regression
import imp
import numpy as np
imp.reload(regression)
xArr,yArr=regression.loadDataSet('abalone.txt')
regression.stageWise(xArr,yArr,0.01,200)
regression.stageWise(xArr,yArr,0.001,5000)
xMat=np.mat(xArr)
yMat=np.mat(yArr).T
xMat=regression.regularize(xMat)
yM = np.mean(yMat,0)
yMat=yMat-yM
weights=regression.standRegres(xMat,yMat.T)
weights.T
8.5 The bias/variance tradeoff
8.6 Example: forecasting the price of LEGO sets
Example: using regression to predict the price of a LEGO set
1. Collect: Collect from Google Shopping API.
2. Prepare: Extract price data from the returned JSON.
3. Analyze: Visually inspect the data.
4. Train: We’ll build different models with stagewise linear regression and straightforward linear regression.
5. Test: We’ll use cross validation to test the different models to see which one performs the best. 6. Use: The resulting model will be the object of this exercise.
Regression is the process of predicting a target value similar to classification. The difference between regression and classification is that the variable forecasted in regression is continuous, whereas it’s discrete in classification. Regression is one of the most useful tools in statistics. Minimizing the sum-of-squares error is used to find the best weights for the input features in a regression equation.