过拟合与欠拟合概念
首先看一个实例:
来源:周志华《机器学习》 图2.1
过拟合和欠拟合可以狭义的理解为:过拟合是考虑的太多,太面面俱到而发生错误;欠拟合是考虑的太少,漏掉很多店从而产生错误。
其次结合线性回归以及逻辑回归:
比如在 线性回归 Linear Regression 以及 逻辑回归 Logistic Regression 中,过拟合与欠拟合图:
过拟合解决办法
解决办法一:在训练集中加入更多数据
在训练集中加入更多的数据,可以优化训练模型!
解决办法二:优化数据集 feature selection
优化数据集理解举例:
1、特征很多,删除不必要特征
在对房子价格预估中,你有很多属性,其中包括:房子的面积,房子中卫生间个数,房子中楼层数,房子的花园大小,前任主人的岁数,房子前任主人有多少条狗,房子中前任主人有多少个孩子…
这些里有一些是必要的,有一些是无关紧要的。
属性 | 是否必要 |
---|---|
房子的面积 | 必要 |
房子中卫生间的个数 | 必要 |
房子中楼层数 | 必要 |
房子花园的大小 | 必要 |
前任主人的岁数 | 不必要 |
房子前任主人有多少条狗 | 不必要 |
房子前任主人有多少个孩子 | 不必要 |
事实证明,无关紧要的数据会使得数据集因为太多特征而过拟合。
删除无关紧要的数据,仅保留对你的判断有价值的必要数据,会使得解决过拟合问题。
2、特征太少,增加必要特征
在对房子价格预估中,你只有一个属性,房子的面积,也是不够的,增加必要属性,从而优化模拟模型。
解决方法三:正则化 Regularization
正则化可以理解为:减少幂特别高的自变量的系数,比如下图中,将 x 4 x^4 x4 的系数从 174 174 174 减小为 0.0001 0.0001 0.0001 ,而不是将 x 4 x^4 x4 前的系数设置为 0 0 0。
图片来源:吴恩达《ML》第三周课程,仅用于学习
以下内容可选择查看
明白以上三种方法的原理,那么我们程序中到底该如何做呢?首先在系数非常少的情形下的确可以通过筛选进行处理。但是如果系数非常多的情形下,我们的方案是通过在损失函数中增添 正则化 regularization term 的部分,从而对所有的系数进行衰减,即:
J ( w , b ) = 1 2 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) 2 J_{(w,b)} = \frac 1 {2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})^2 J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2
J ( w , b ) = 1 2 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) 2 + λ 2 m ∑ j = 1 n w j 2 J_{(w,b)} = \frac 1 {2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})^2 + \frac λ {2m} \sum\limits_{j = 1}^{n} w_j^2 J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2+2mλj=1∑nwj2
J ( w , b ) = 1 2 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) 2 + λ 2 m ∑ j = 1 n w j 2 + λ 2 m ∑ j = 1 n b 2 J_{(w,b)} = \frac 1 {2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})^2 + \frac λ {2m} \sum\limits_{j = 1}^{n} w_j^2 + \frac λ {2m} \sum\limits_{j = 1}^{n} b^2 J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2+2mλj=1∑nwj2+2mλj=1∑nb2
示意图
同时需要注意的是,正则化部分的 λ λ λ 值,不可太大也不可太小。太小假设趋近于0,则跟没有一样;太大假设趋近于无穷,则会导致在做 m a x ( J ( w , b ) ) max(J_{(w,b)}) max(J(w,b)) 时,忽略预测值与实际值的差,从而大量衰减系数,使得最终趋近于一条平行于 x x x轴 的直线,即 y = b y=b y=b。
正则化线性回归
Regularized Linear Regression
首先,recape 一下线性回归的损失函数以及相关处理步骤
Recape of Cost Function of Linear Regression
J ( w , b ) = 1 2 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) 2 J_{(w,b)} = \frac 1 {2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})^2 J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2
w
j
=
w
j
−
α
d
d
w
j
J
(
w
,
b
)
w_j = w_j-α\frac d {dw_j}J_{(w,b)}
wj=wj−αdwjdJ(w,b)
b
j
=
b
j
−
α
d
d
b
j
J
(
w
,
b
)
b_j = b_j-α\frac d {db_j}J_{(w,b)}
bj=bj−αdbjdJ(w,b)
d
d
w
j
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\frac d {dw_j}J_{(w,b)}=\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}
dwjdJ(w,b)=m1i=1∑m(fw,b(x(i))−y(i))xj(i)
d
d
b
j
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\frac d {db_j}J_{(w,b)}=\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})
dbjdJ(w,b)=m1i=1∑m(fw,b(x(i))−y(i))
Add the regularized term
J ( w , b ) = 1 2 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) 2 + λ 2 m ∑ j = 1 n w j 2 J_{(w,b)} = \frac 1 {2m} \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})^2+\fracλ {2m}\sum\limits_{j = 1}^{n}w_j^2 J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2+2mλj=1∑nwj2
w
j
=
w
j
−
α
d
d
w
j
J
(
w
,
b
)
w_j = w_j-α\frac d {dw_j}J_{(w,b)}
wj=wj−αdwjdJ(w,b)
b
j
=
b
j
−
α
d
d
b
j
J
(
w
,
b
)
b_j = b_j-α\frac d {db_j}J_{(w,b)}
bj=bj−αdbjdJ(w,b)
d
d
w
j
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
+
λ
m
w
j
\frac d {dw_j}J_{(w,b)}=\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}+\frac λ mw_j
dwjdJ(w,b)=m1i=1∑m(fw,b(x(i))−y(i))xj(i)+mλwj
d
d
b
j
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\frac d {db_j}J_{(w,b)}=\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})
dbjdJ(w,b)=m1i=1∑m(fw,b(x(i))−y(i))
So what is regularized doing for Linear Regression
对 系数 w w w 的 损失 做进一步化简:
w
j
=
w
j
−
α
d
d
w
j
J
(
w
,
b
)
=
w
j
−
α
(
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
+
λ
m
w
j
)
w_j = w_j-α\frac d {dw_j}J_{(w,b)} = w_j-α(\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}+\frac λ mw_j)
wj=wj−αdwjdJ(w,b)=wj−α(m1i=1∑m(fw,b(x(i))−y(i))xj(i)+mλwj)
即:
w
j
=
(
w
j
−
α
λ
m
w
j
)
−
α
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
w_j =(w_j-α\frac λ mw_j)-α\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}
wj=(wj−αmλwj)−αm1i=1∑m(fw,b(x(i))−y(i))xj(i)
w
j
=
(
1
−
α
λ
m
)
w
j
−
α
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
w_j =(1-α\frac λ m)w_j-α\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}
wj=(1−αmλ)wj−αm1i=1∑m(fw,b(x(i))−y(i))xj(i)
其中,
α
α
α learning rate,取值范围为:
[
0
,
1
]
[0,1]
[0,1],一般为
0.01
0.01
0.01;
λ
λ
λ 为 regularized 的系数,一般取值为
1
或
10
1或10
1或10。
m
m
m 为训练集元素个数,为一个常数项。假设为50.
那么:
w
j
=
(
1
−
0.01
1
50
)
w
j
−
α
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
w_j =(1-0.01\frac 1 {50})w_j-α\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}
wj=(1−0.01501)wj−αm1i=1∑m(fw,b(x(i))−y(i))xj(i)
w
j
=
0.9998
w
j
−
α
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
w_j =0.9998w_j-α\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}
wj=0.9998wj−αm1i=1∑m(fw,b(x(i))−y(i))xj(i)
对比不通过正则化的损失函数,发现区别在于对首项 w j w_j wj 的值做部分衰减。而在每次减法时,都会对 w j w_j wj 的值做部分衰减。
正则化逻辑回归
Regularized Logistic Regression
首先还是 recape 一下 逻辑回归函数Logistic Regression 的损失函数。
Recape of Cost Function of Logistic Regression
J ( w , b ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( f w , b ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − f w , b ( x ( i ) ) ) ] J_{(w,b)} = -\frac 1 m \sum\limits_{i = 1}^{m}[y^{(i)}log(f_{w,b}(x^{(i)}))+(1-y^{(i)})log(1-f_{w,b}(x^{(i)}))] J(w,b)=−m1i=1∑m[y(i)log(fw,b(x(i)))+(1−y(i))log(1−fw,b(x(i)))]
w
j
=
w
j
−
α
d
d
w
j
J
(
w
,
b
)
w_j = w_j-α\frac d {dw_j}J_{(w,b)}
wj=wj−αdwjdJ(w,b)
b
j
=
b
j
−
α
d
d
b
j
J
(
w
,
b
)
b_j = b_j-α\frac d {db_j}J_{(w,b)}
bj=bj−αdbjdJ(w,b)
d
d
w
j
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\frac d {dw_j}J_{(w,b)}=\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}
dwjdJ(w,b)=m1i=1∑m(fw,b(x(i))−y(i))xj(i)
d
d
b
j
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\frac d {db_j}J_{(w,b)}=\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})
dbjdJ(w,b)=m1i=1∑m(fw,b(x(i))−y(i))
Add the regularized term
J ( w , b ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( f w , b ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − f w , b ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n w j 2 J_{(w,b)} = -\frac 1 m \sum\limits_{i = 1}^{m}[y^{(i)}log(f_{w,b}(x^{(i)}))+(1-y^{(i)})log(1-f_{w,b}(x^{(i)}))]+\frac λ {2m} \sum\limits_{j = 1}^{n}w_j^2 J(w,b)=−m1i=1∑m[y(i)log(fw,b(x(i)))+(1−y(i))log(1−fw,b(x(i)))]+2mλj=1∑nwj2
w
j
=
w
j
−
α
d
d
w
j
J
(
w
,
b
)
w_j = w_j-α\frac d {dw_j}J_{(w,b)}
wj=wj−αdwjdJ(w,b)
b
j
=
b
j
−
α
d
d
b
j
J
(
w
,
b
)
b_j = b_j-α\frac d {db_j}J_{(w,b)}
bj=bj−αdbjdJ(w,b)
d
d
w
j
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
[
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
]
+
λ
m
w
j
\frac d {dw_j}J_{(w,b)}=\frac 1 m \sum\limits_{i = 1}^{m} [(f_{w,b}(x^{(i)})- y^{(i)})x_j^{(i)}]+\frac λ m w_j
dwjdJ(w,b)=m1i=1∑m[(fw,b(x(i))−y(i))xj(i)]+mλwj
d
d
b
j
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\frac d {db_j}J_{(w,b)}=\frac 1 m \sum\limits_{i = 1}^{m} (f_{w,b}(x^{(i)})- y^{(i)})
dbjdJ(w,b)=m1i=1∑m(fw,b(x(i))−y(i))
python in 正则化线性回归
Cost function for regularized linear regression
code:
def compute_cost_linear_reg(X, y, w, b, lambda_=1):
m = X.shape[0]
n = len(w)
cost = 0.
for i in range(m):
f_wb_i = np.dot(X[i], w) + b
cost = cost + (f_wb_i - y[i]) ** 2
cost = cost / (2 * m) # scalar
reg_cost = 0
for j in range(n):
reg_cost += (w[j] ** 2)
reg_cost = (lambda_ / (2 * m)) * reg_cost
total_cost = cost + reg_cost
return total_cost
explaination:
Gradient function for regularized linear regression
code:
def compute_gradient_linear_reg(X, y, w, b, lambda_):
m, n = X.shape # (number of examples, number of features)
dj_dw = np.zeros((n,))
dj_db = 0.
for i in range(m):
err = (np.dot(X[i], w) + b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m
for j in range(n):
dj_dw[j] = dj_dw[j] + (lambda_ / m) * w[j]
return dj_db, dj_dw
explaination:
python in 正则化逻辑回归
Cost function for regularized Logistic regression
code:
def compute_cost_logistic_reg(X, y, w, b, lambda_=1):
m, n = X.shape
cost = 0.
for i in range(m):
z_i = np.dot(X[i], w) + b
f_wb_i = sigmoid(z_i)
cost += -y[i] * np.log(f_wb_i) - (1 - y[i]) * np.log(1 - f_wb_i)
cost = cost / m
reg_cost = 0
for j in range(n):
reg_cost += (w[j] ** 2)
reg_cost = (lambda_ / (2 * m)) * reg_cost
total_cost = cost + reg_cost
return total_cost
explaination:
Gradient function for regularized Logistic regression
code:
def compute_gradient_logistic_reg(X, y, w, b, lambda_):
m,n = X.shape
dj_dw = np.zeros((n,))
dj_db = 0.0
for i in range(m):
f_wb_i = sigmoid(np.dot(X[i],w) + b)
err_i = f_wb_i - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err_i * X[i,j]
dj_db = dj_db + err_i
dj_dw = dj_dw/m
dj_db = dj_db/m
for j in range(n):
dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]
return dj_db, dj_dw
explaination:
end — >