通常情况下的线性拟合不能很好地预测所有的值,因为它容易导致欠拟合(under fitting),比如数据集是
一个钟形的曲线。而多项式拟合能拟合所有数据,但是在预测新样本的时候又会变得很糟糕,因为它导致数据的
过拟合(overfitting),不符合数据真实的模型。
今天来讲一种非参数学习方法,叫做局部加权回归(LWR)。为什么局部加权回归叫做非参数学习方法呢? 首
先参数学习方法是这样一种方法:在训练完成所有数据后得到一系列训练参数,然后根据训练参数来预测新样本
的值,这时不再依赖之前的训练数据了,参数值是确定的。而非参数学习方法是这样一种算法:在预测新样本值
时候每次都会重新训练数据得到新的参数值,也就是说每次预测新样本都会依赖训练数据集合,所以每次得到的
参数值是不确定的。
接下来,介绍局部加权回归的原理。
有上面的原理,我们来实践一下,使用python的代码来实现,如下:
#python 3.5.3 蔡军生
#http://edu.csdn.net/course/detail/2592
# 计算加权回归
import numpy as np
import random
import matplotlib.pyplot as plt
def gaussian_kernel(x, x0, c, a=1.0):
"""
Gaussian kernel.
:Parameters:
- `x`: nearby datapoint we are looking at.
- `x0`: data point we are trying to estimate.
- `c`, `a`: kernel parameters.
"""
# Euclidian distance
diff = x - x0
dot_product = diff * diff.T
return a * np.exp(dot_product / (-2.0 * c**2))
def get_weights(training_inputs, datapoint, c=1.0):
"""
Function that calculates weight matrix for a given data point and training
data.
:Parameters:
- `training_inputs`: training data set the weights should be assigned to.
- `datapoint`: data point we are trying to predict.
- `c`: kernel function parameter
:Returns:
NxN weight matrix, there N is the size of the `training_inputs`.
"""
x = np.mat(training_inputs)
n_rows = x.shape[0]
# Create diagonal weight matrix from identity matrix
weights = np.mat(np.eye(n_rows))
for i in range(n_rows):
weights[i, i] = gaussian_kernel(datapoint, x[i], c)
return weights
def lwr_predict(training_inputs, training_outputs, datapoint, c=1.0):
"""
Predict a data point by fitting local regression.
:Parameters:
- `training_inputs`: training input data.
- `training_outputs`: training outputs.
- `datapoint`: data point we want to predict.
- `c`: kernel parameter.
:Returns:
Estimated value at `datapoint`.
"""
weights = get_weights(training_inputs, datapoint, c=c)
x = np.mat(training_inputs)
y = np.mat(training_outputs).T
xt = x.T * (weights * x)
betas = xt.I * (x.T * (weights * y))
return datapoint * betas
def genData(numPoints, bias, variance):
x = np.zeros(shape=(numPoints, 2))
y = np.zeros(shape=numPoints)
# 构造一条直线左右的点
for i in range(0, numPoints):
# 偏移
x[i][0] = 1
x[i][1] = i
# 目标值
y[i] = bias + i * variance + random.uniform(0, 1) * 20
return x, y
#生成数据
a1, a2 = genData(100, 10, 0.6)
a3 = []
#计算每一点
for i in a1:
pdf = lwr_predict(a1, a2, i, 1)
a3.append(pdf.tolist()[0])
plt.plot(a1[:,1], a2, "x")
plt.plot(a1[:,1], a3, "r-")
plt.show()
采用C = 1.0的结果图:
采用C = 2.0的结果图: