实验内容
(1)描述线性回归原理,并给出关键公式推导
(2)处理Salary数据,写清楚实验步骤
Salary_Data.csv
(3)分析回归误差,并画出线性回归图和误差曲线(均方跟误差)
问题1
线性模型的一般形式:
f
(
x
)
=
w
1
x
1
+
w
2
x
2
+
.
.
.
+
w
d
x
d
+
b
f(x) = w_1x_1+w_2x_2+...+w_dx_d+b
f(x)=w1x1+w2x2+...+wdxd+b
向量表示:
f
(
x
)
=
w
T
x
+
b
f(x)=w^Tx+b
f(x)=wTx+b
线性模型的优点:
- 形式简单、易于建模
- 可解释性
- 是非线性模型的基础,可以在线性模型的基础上引入层级结构或高维映射
单一属性的线性回归目标:
f
(
x
)
=
w
x
i
+
b
使
得
f
(
x
i
)
≃
y
i
f(x)=wx_i+b使得f\left(x_{i}\right) \simeq y_{i}
f(x)=wxi+b使得f(xi)≃yi
参数/模型估计:最小二乘法
(
w
∗
,
b
∗
)
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
f
(
x
i
)
−
y
i
)
2
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
\begin{aligned} \left(w^{*}, b^{*}\right) &=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(f\left(x_{i}\right)-y_{i}\right)^{2} \\ &=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2} \end{aligned}
(w∗,b∗)=(w,b)argmini=1∑m(f(xi)−yi)2=(w,b)argmini=1∑m(yi−wxi−b)2
最小化均方误差:
E
(
w
,
b
)
=
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
E_{(w, b)}=\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}
E(w,b)=i=1∑m(yi−wxi−b)2
分别对ω和b求导,得:
∂
E
(
w
,
b
)
∂
w
=
2
(
w
∑
i
=
1
m
x
i
2
−
∑
i
=
1
m
(
y
i
−
b
)
x
i
)
\frac{\partial E_{(w, b)}}{\partial w}=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)
∂w∂E(w,b)=2(wi=1∑mxi2−i=1∑m(yi−b)xi)
∂
E
(
w
,
b
)
∂
b
=
2
(
m
b
−
∑
i
=
1
m
(
y
i
−
w
x
i
)
)
\frac{\partial E_{(w, b)}}{\partial b}=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right)
∂b∂E(w,b)=2(mb−i=1∑m(yi−wxi))
令导数等于0,得到closed-form解
w
=
∑
i
=
1
m
y
i
(
x
i
−
x
ˉ
)
∑
i
=
1
m
x
i
2
−
1
m
(
∑
i
=
1
m
x
i
)
2
b
=
1
m
∑
i
=
1
m
(
y
i
−
w
x
i
)
w=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}} \quad b=\frac{1}{m} \sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)
w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−xˉ)b=m1i=1∑m(yi−wxi)
问题2
- 提取数据并初始化 w w w、 b b b,初始值均为0
points = np.genfromtxt("Salary_Data.csv", delimiter=",")
x_mean = 0
y_mean = 0
w = 0
b = 0
- 计算出横坐标 x x x及纵坐标 y y y的均值记作 x _ m e a n x\_mean x_mean和 y _ m e a n y\_mean y_mean
for i in range(0,len(points)):
x_mean += points[i,0]
y_mean += points[i,1]
x_mean/=len(points)
y_mean/=len(points)
- 利用上述推导公式求出最小二乘法的 w w w、 b b b
for i in range(0,len(points)):
w += (points[i,0]-x_mean)*(points[i,1]-y_mean)
b += (points[i,0]-x_mean)**2
W = w/b
B = y_mean-W*x_mean
- 利用 p l t plt plt函数库画出回归曲线并分析误差大小
plt.scatter(i,compute_error(y_mean-(w/b)*x_mean, w/b, points))
print("误差大小为{0}".format(compute_error(y_mean-(w/b)*x_mean, w/b, points)))
问题3
线性回归图如下:
误差散点图如下: