线性回归
线性回归的概念
“回归”的形式化定义可描述为,设定m个训练样本构成的数据集D={(
X
1
,
Y
1
X^1, Y^1
X1,Y1),(
X
2
,
Y
2
X^2, Y^2
X2,Y2),……,(
X
m
,
Y
m
X^m, Y^m
Xm,Ym)}
这里
X
i
=
(
X
1
i
,
X
2
i
,
…
…
,
X
n
i
)
T
X^i=(X_1^i, X_2^i,……,X_n^i)^T
Xi=(X1i,X2i,……,Xni)T, 表示第i个训练样本的输入特征向量,
Y
i
∈
R
Y^i\in R
Yi∈R,为实数阈输出。我们知道,回归的核心任务在于面对一堆输入、输出的数据集D。构建一个模型T(通常表示为f(x)),是的T尽可能的拟合D中输入和输出数据之间的关系。然后对新的输入
X
n
e
w
X_{new}
Xnew,能应用新的模型T,给出预测结果f(
X
n
e
w
X_{new}
Xnew)。
简单来说,线性回归就是假设输入变量(x)和单个输出变量(y)之间的线性关系。其模型可以用公式y=
w
1
w_1
w1x+
w
0
w_0
w0来表示。
拟合出的回归线预测的值
Y
h
a
t
Y_{hat}
Yhat和真实值
Y
t
u
r
e
Y_{ture}
Yture之间的差值叫做残差,我们的目的就是找出一条你和先使得误差值和
∑
i
=
0
n
∣
y
t
u
r
e
−
y
h
a
t
∣
\sum_{i=0}^n|y_{ture}-y_{hat}|
i=0∑n∣yture−yhat∣最小。我们找到这样一条曲线的方法就是最小二乘法。
H
m
i
n
=
∑
(
y
i
−
(
y
i
)
h
a
t
)
2
=
∑
(
y
i
−
w
i
x
i
−
w
0
)
2
H_{min} = \sum(y_i-(y_i)_{hat})^2=\sum(y_i-w_ix_i-w_0)^2
Hmin=∑(yi−(yi)hat)2=∑(yi−wixi−w0)2为了让H达到最小。对
w
i
w_i
wi和
w
0
w_0
w0求偏导。可得以下式子:
w
1
=
∑
(
x
i
−
x
ˉ
)
(
y
i
−
y
ˉ
)
∑
(
x
i
−
x
ˉ
)
2
,
w
0
=
y
ˉ
−
w
1
x
ˉ
w_1=\frac{\sum{(x_i-\bar x)(y_i-\bar y)}}{\sum{(x_i-\bar x)^2}},w_0=\bar y-w_1 \bar x
w1=∑(xi−xˉ)2∑(xi−xˉ)(yi−yˉ),w0=yˉ−w1xˉ
线性回归Python实现
设计均值和方差
均值函数:
# 求均值
def mean(values):
return sum(values) / float(len(values))
方差函数
# 求方差
def variance(values, mean):
return sum([(x - mean) ** 2 for x in values])
开始计算均值和方差
# 开始计算均值和方差
dataset = [[1.2, 1.1], [2.4, 3.5], [4.1, 3.2], [3.4, 2.8], [5, 5.4]]
x = [row[0] for row in dataset]
y = [row[1] for row in dataset]
mean_x, mean_y = mean(x), mean(y)
var_x, var_y = variance(x, mean_x), variance(y, mean_y)
print('x统计特性:均值 = %.3f 方差 = %.3f'%(mean_x, var_x))
print('y统计特性:均值 = %.3f 方差 = %.3f'%(mean_y, var_y))
输出结果为:
x统计特性:均值 = 3.220 方差 = 8.728
y统计特性:均值 = 3.200 方差 = 9.500
想要获得 w 1 w_1 w1还需计算协方差:
# 计算协方差
def cpvariance(x, x_mean, y, y_mean):
covar = 0.0
for i in range(len(x)):
covar += (x[i] - x_mean) * (y[i] - y_mean)
return covar
covar = cpvariance(x, mean_x, y, mean_y)
print('协方差 = %.3f'%covar)
输出结果为:
协方差 = 7.840
计算回归系数:
# 计算回归系数的函数
def coefficients(dataset):
x_mean, y_mean = mean(x), mean(y)
w1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
w0 = y_mean - w1 * x_mean
return w0, w1
# 获得回归系数
w0, w1 = coefficients(dataset)
print('回归系数分别为:w0 = %.3f, w1 = %.3f' % (w0, w1))
预测:
# 在测试集中完成预测任务
def simple_linear_regression(train, test):
predict = list() # 构建一个空列表
w0, w1 = coefficients(train) # 从训练结合中获取回归系数
for row in test: # 从测试集中读取每一个不同的x
y_model = w1 * row[0] + w0 # 用模型预测y
predict.append(y_model) # 记录每一个预测值y
return predict
接下来需要一个函数来表示预测结果与实际结果的偏离程度:
R
M
S
E
=
∑
(
y
m
o
d
e
l
−
y
a
c
t
u
a
l
)
2
n
RMSE = \sqrt{{\sum{(y_{model} - y_{actual})^2}}\over n}
RMSE=n∑(ymodel−yactual)2
# 计算均方根误差RMSE
def rmse_metric(actual, predicted):
sum_error = 0.0
for i in range(len(actual)):
prediction_error = predicted[i] - actual[i]
sum_error += (prediction_error ** 2)
mean_error = sum_error / float(len(actual))
return sqrt(mean_error)
最后我们把后请统筹工作也设计为一个函数:
# 评估算法数据准备及协调
def evaluate_algorithm(dataset, algorithm):
test_set = list()
for row in dataset:
row_copy = list(row)
row_copy[-1] = None
test_set.append(row_copy)
predicted = algorithm(dataset, test_set)
for i in range(len(dataset)):
print('x : %.3f y : %.3f\n' % (dataset[i], predicted[i]))
actual = [row[-1] for row in dataset]
rmse = rmse_metric(actual, predicted)
return rmse
整体代码:
from math import sqrt
# 求平均值
def mean(values):
return sum(values) / float(len(values))
# 求方差
def variance(values, mean):
return sum([(x - mean) ** 2 for x in values])
# 开始计算均值和方差
dataset = [[1.2, 1.1], [2.4, 3.5], [4.1, 3.2], [3.4, 2.8], [5, 5.4]]
x = [row[0] for row in dataset]
y = [row[1] for row in dataset]
mean_x, mean_y = mean(x), mean(y)
var_x, var_y = variance(x, mean_x), variance(y, mean_y)
print('x统计特性:均值 = %.3f 方差 = %.3f' % (mean_x, var_x))
print('y统计特性:均值 = %.3f 方差 = %.3f' % (mean_y, var_y))
# 计算协方差
def covariance(x, x_mean, y, y_mean):
covar = 0.0
for i in range(len(x)):
covar += (x[i] - x_mean) * (y[i] - y_mean)
return covar
covar = covariance(x, mean_x, y, mean_y)
print('协方差 = %.3f' % covar)
# 计算回归系数的函数
def coefficients(dataset):
x_mean, y_mean = mean(x), mean(y)
w1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
w0 = y_mean - w1 * x_mean
return w0, w1
# 获得回归系数
w0, w1 = coefficients(dataset)
print('回归系数分别为:w0 = %.3f, w1 = %.3f' % (w0, w1))
# 在测试集中完成预测任务
def simple_linear_regression(train, test):
predict = list() # 构建一个空列表
w0, w1 = coefficients(train) # 从训练结合中获取回归系数
for row in test: # 从测试集中读取每一个不同的x
y_model = w1 * row[0] + w0 # 用模型预测y
predict.append(y_model) # 记录每一个预测值y
return predict
# 计算均方根误差RMSE
def rmse_metric(actual, predicted):
sum_error = 0.0
for i in range(len(actual)):
prediction_error = predicted[i] - actual[i]
sum_error += (prediction_error ** 2)
mean_error = sum_error / float(len(actual))
return sqrt(mean_error)
# 评估算法数据准备及协调
def evaluate_algorithm(dataset, algorithm):
test_set = list()
for row in dataset:
row_copy = list(row)
row_copy[-1] = None
test_set.append(row_copy)
predicted = algorithm(dataset, test_set)
for i in range(len(dataset)):
print('x : %.3f y : %.3f'.format(dataset[i], predicted[i]))
actual = [row[-1] for row in dataset]
rmse = rmse_metric(actual, predicted)
return rmse
rmse = evaluate_algorithm(dataset, simple_linear_regression)
print('RMSE: %.3f' % (rmse))
图形表示: