线性回归概述
概述
原理
X 1 X1 X1是年龄特征, X 2 X2 X2是业绩特征, Y Y Y是收入
年龄 | 业绩 | 收入 |
---|---|---|
28 | 700 | 9000 |
45 | 800 | 8000 |
33 | 400 | 6000 |
40 | 350 | 4000 |
20 | 403 | 6000 |
39 | 250 | 8000 |
θ
0
\theta_0
θ0是偏置参数,
θ
1
\theta_1
θ1是年龄参数,
θ
2
\theta_2
θ2是成绩参数,找一个高维的的线/面拟合数据
h
θ
(
x
)
=
θ
0
+
θ
1
x
1
+
θ
2
x
2
h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2\\
hθ(x)=θ0+θ1x1+θ2x2
转换为矩阵表达为:
h
θ
(
x
)
=
∑
i
=
0
m
θ
i
x
i
=
θ
T
x
h_\theta(x) = \sum_{i=0}^m \theta_i x_i = \theta^T x
hθ(x)=i=0∑mθixi=θTx
预测值与真实值之间的误差用
ϵ
\epsilon
ϵ表示:
y
(
i
)
=
θ
T
x
(
i
)
+
ϵ
(
i
)
y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)}
y(i)=θTx(i)+ϵ(i)
误差
ϵ
(
i
)
\epsilon^{(i)}
ϵ(i)是独立并且具有相同服从均值为0,方差为
θ
2
\theta^2
θ2的高斯分布,独立:甲和乙没有关系,不互相影响,同分布:甲和乙都是在这个公司工作,由于
ϵ
(
i
)
\epsilon^{(i)}
ϵ(i)服从高斯分布,因此:
p
(
ϵ
(
i
)
)
=
1
2
π
σ
e
(
−
(
ϵ
(
i
)
)
2
2
σ
2
)
p \left( \epsilon^{(i)} \right) = \frac 1 { \sqrt{2 \pi}\sigma } e^{ \left( - \frac {\left( \epsilon^{(i)} \right)^2}{2\sigma^2}\right)}
p(ϵ(i))=2πσ1e(−2σ2(ϵ(i))2)
将
y
(
i
)
=
θ
T
x
(
i
)
+
ϵ
(
i
)
y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)}
y(i)=θTx(i)+ϵ(i)代入上式:
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
1
2
π
σ
e
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
p \left( y^{(i)} | x^{(i)}; \theta \right) = \frac 1 { \sqrt{2 \pi}\sigma } e^{ \left( - \frac {\left( y^{(i)} - \theta^T x^{(i)}\right)^2}{2\sigma^2}\right)}
p(y(i)∣x(i);θ)=2πσ1e(−2σ2(y(i)−θTx(i))2)
什么样的参数和我们的数据组合恰好是真实值,用似然函数:
L
(
θ
)
=
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
1
m
1
2
π
σ
e
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
L(\theta) = \prod_{i=1}^{m} p \left( y^{(i)} | x^{(i)}; \theta \right) = \prod_{i=1}^{m} \frac 1 { \sqrt{2 \pi}\sigma } e^{ \left( - \frac {\left( y^{(i)} - \theta^T x^{(i)}\right)^2}{2\sigma^2}\right)}
L(θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m2πσ1e(−2σ2(y(i)−θTx(i))2)
为求解方便转化为对数似然函数:
l
o
g
L
(
θ
)
=
l
o
g
∏
i
=
1
m
1
2
π
σ
e
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
=
∑
i
=
0
m
l
o
g
(
1
2
π
σ
e
(
−
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
2
σ
2
)
)
=
m
l
o
g
1
2
π
σ
−
1
2
σ
2
∑
i
=
0
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
2
\begin{aligned} logL(\theta) &= log \prod_{i=1}^{m} \frac 1 { \sqrt{2 \pi}\sigma } e^{ \left( - \frac {\left( y^{(i)} - \theta^T x^{(i)}\right)^2}{2\sigma^2}\right)} \\ &=\sum_{i=0}^m log\left( \frac 1{ \sqrt{2 \pi}\sigma } e^{ \left( - \frac {\left( y^{(i)} - \theta^T x^{(i)}\right)^2}{2\sigma^2}\right)}\right)\\ &= m log \frac 1{ \sqrt{2 \pi}\sigma} - \frac1{2 \sigma^2} \sum_{i=0}^m ( y^{(i)} - \theta^T x^{(i)})^2 \end{aligned}
logL(θ)=logi=1∏m2πσ1e(−2σ2(y(i)−θTx(i))2)=i=0∑mlog⎝⎜⎛2πσ1e(−2σ2(y(i)−θTx(i))2)⎠⎟⎞=mlog2πσ1−2σ21i=0∑m(y(i)−θTx(i))2
转换为的目标函数:
J
(
θ
)
=
1
2
∑
i
=
0
m
(
y
(
i
)
−
θ
T
x
(
i
)
)
=
1
2
∑
i
=
0
m
(
y
i
−
h
θ
(
x
i
)
)
2
\begin{aligned} J(\theta) &= \frac 12 \sum_{i=0}^m ( y^{(i)} - \theta^T x^{(i)})\\ &= \frac 12 \sum_{i=0}^m (y^{i} - h_\theta{(x^i)})^2 \end{aligned}
J(θ)=21i=0∑m(y(i)−θTx(i))=21i=0∑m(yi−hθ(xi))2
批量梯度下降:
∂
J
(
θ
)
∂
(
θ
)
=
−
∑
i
=
0
m
(
y
i
−
h
θ
(
x
i
)
)
x
j
i
θ
j
′
=
θ
j
+
∑
i
=
0
m
(
y
i
−
h
θ
(
x
i
)
)
x
j
i
\begin{aligned} \frac {\partial J (\theta)}{\partial (\theta)} &= - \sum_{i=0}^m (y^{i} - h_\theta{(x^i)})x_j^i\\ \theta '_j&= \theta _j+ \sum_{i=0}^m (y^{i} - h_\theta{(x^i)})x_j^i \end{aligned}
∂(θ)∂J(θ)θj′=−i=0∑m(yi−hθ(xi))xji=θj+i=0∑m(yi−hθ(xi))xji
批量梯度下降容易得到最优解,但每次考虑所有样本,速度慢
随机梯度下降:
θ
j
′
=
θ
j
+
(
y
i
−
h
θ
(
x
i
)
)
x
j
i
\theta '_j= \theta _j+ (y^{i} - h_\theta{(x^i)})x_j^i
θj′=θj+(yi−hθ(xi))xji
随机梯度下降每次找一个样本,迭代速度快,但每次不一定朝着收敛的方向
小批量梯度下降:
θ
j
′
=
θ
j
−
α
1
10
∑
k
=
i
i
+
9
(
y
(
k
)
−
h
θ
(
x
(
k
)
)
)
x
j
k
\theta '_j= \theta _j- \alpha \frac {1}{10} \sum_{k=i}^{i+9} (y^{(k)} - h_\theta{(x^{(k)})})x_j^{k}
θj′=θj−α101k=i∑i+9(y(k)−hθ(x(k)))xjk
每次更新选择一小部分数据来算,实用
- 学习率:尽量选择小一些的学习率
- 批处理数量:32,64,128都课可以
- 评估方法
R
2
R^2
R2,
R
2
R^2
R2取值越接近于1,模型拟合越好:
1 − ∑ i = 0 m ( y ^ i − y i ) 2 ∑ i = 0 m ( y i − y ‾ i ) 2 1-\frac { \sum_{i=0}^m(\hat y_i - y_i)^2 }{\sum_{i=0}^m(y_i-\overline y_i)^2} 1−∑i=0m(yi−yi)2∑i=0m(y^i−yi)2
实现代码
class linear(object):
def __init__(self):
self.W = None
self.b = None
def loss(self,X,y):
num_feature = X.shape[1]
num_train = X.shape[0]
# 2.1 计算当前权重及偏置下预测值
h = X.dot(self.W) + self.b
# 2.2 计算损失值
loss = 0.5 *np.sum(np.square(h - y)) / num_train
# 2.3 计算当前梯度
dW = X.T.dot((h-y)) / num_train
db = np.sum((h-y)) / num_train
return loss,dW,db
def train(self,X,y,learn_rate = 0.001,iters = 10000):
num_feature = X.shape[1]
# 1.初始化权重参数
self.W = np.zeros((num_feature,1))
# 1.初始化偏置参数
self.b = 0
loss_list = []
for i in xrange(iters):
# 2.计算损失值
loss,dW,db = self.loss(X,y)
loss_list.append(loss)
# 3.更新权重与偏置参数
self.W += -learn_rate*dW
self.b += -learn_rate*db
if i%500 == 0:
print 'iters = %d,loss = %f' % (i,loss)
return loss_list
def predict(self,X_test):
y_pred = X.dot(self.W) + self.b
return y_pred
pass