1.简介
正则化用于解决过拟合问题是一个非常好的方法!!通过限制某个项对应的权重来削减其对模型的影响,同时又保留该项不至于删掉特征值导致模型不够完整
2. 正则化代价函数
2.1 线性回归
其代价函数经过正则化后如下所示
J
(
w
⃗
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
⃗
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
+
λ
2
m
∑
j
=
0
n
−
1
w
j
2
J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\vec{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2+2mλj=0∑n−1wj2 其中
λ
\lambda
λ 是一个很小的数,通过它来约束
w
j
w_j
wj ,当然读者也可以对
b
b
b 也进行正则化,但是一般都只正则化
w
j
w_j
wj
这里的λ当取值较大时,所限制的w更趋向于正态分布;而当取值较小时,所限制的w更趋向平均分布
以上正则化代价函数的最终代码如下
def compute_cost_linear_reg(X, y, w, b, lambda_ = 1):
"""
Computes the cost over all examples
Args:
X (ndarray (m,n): Data, m examples with n features
y (ndarray (m,)): target values
w (ndarray (n,)): model parameters
b (scalar) : model parameter
lambda_ (scalar): Controls amount of regularization
Returns:
total_cost (scalar): cost
"""
m = X.shape[0]
n = len(w)
cost = 0.
#左半部分
for i in range(m):
f_wb_i = np.dot(X[i], w) + b #(n,)(n,)=scalar, see np.dot
cost = cost + (f_wb_i - y[i])**2 #scalar
cost = cost / (2 * m) #scalar
#右半部分
reg_cost = 0
for j in range(n):
reg_cost += (w[j]**2) #scalar
reg_cost = (lambda_/(2*m)) * reg_cost #scalar
total_cost = cost + reg_cost #scalar
return total_cost #scalar
2.2 逻辑回归
其代价函数经过正则化后如下所示
J
(
w
⃗
,
b
)
=
1
m
∑
i
=
0
m
−
1
[
−
y
(
i
)
log
(
f
w
⃗
,
b
(
x
(
i
)
)
)
−
(
1
−
y
(
i
)
)
log
(
1
−
f
w
⃗
,
b
(
x
(
i
)
)
)
]
+
λ
2
m
∑
j
=
0
n
−
1
w
j
2
J(\vec{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\vec{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\vec{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] \\ + \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2
J(w,b)=m1i=0∑m−1[−y(i)log(fw,b(x(i)))−(1−y(i))log(1−fw,b(x(i)))]+2mλj=0∑n−1wj2 其中
λ
\lambda
λ 的解释同上,注意逻辑回归用的loss function与线性回归不同噢
以上正则化代价函数的最终代码如下
def compute_cost_logistic_reg(X, y, w, b, lambda_ = 1):
"""
Computes the cost over all examples
Args:
Args:
X (ndarray (m,n): Data, m examples with n features
y (ndarray (m,)): target values
w (ndarray (n,)): model parameters
b (scalar) : model parameter
lambda_ (scalar): Controls amount of regularization
Returns:
total_cost (scalar): cost
"""
m,n = X.shape
cost = 0.
#上左式
for i in range(m):
z_i = np.dot(X[i], w) + b #(n,)(n,)=scalar, see np.dot
f_wb_i = sigmoid(z_i) #scalar
cost += -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1-f_wb_i) #scalar
cost = cost/m #scalar
#上右式
reg_cost = 0
for j in range(n):
reg_cost += (w[j]**2) #scalar
reg_cost = (lambda_/(2*m)) * reg_cost #scalar
total_cost = cost + reg_cost #scalar
return total_cost #scalar
3. 梯度下降
有了代价函数之后我们要通过梯度下降来求解最优的参数组合,那其公式跟之前没用正则化的时候有什么不同呢🧐 大体上跟之前的样式不变还是,如下
repeat until convergence:
{
w
j
=
w
j
−
α
∂
J
(
w
⃗
,
b
)
∂
w
j
for j
∈
[0,n-1]
b
=
b
−
α
∂
J
(
w
⃗
,
b
)
∂
b
}
\begin{align*} &\text{repeat until convergence:} \; \lbrace \\ & \; \; \;w_j = w_j - \alpha \frac{\partial J(\vec{w},b)}{\partial w_j} \; & \text{for j $\in$ [0,n-1] } \\ & \; \; \; \; \;b = b - \alpha \frac{\partial J(\vec{w},b)}{\partial b} \\ &\rbrace \end{align*}
repeat until convergence:{wj=wj−α∂wj∂J(w,b)b=b−α∂b∂J(w,b)}for j ∈ [0,n-1] 而求偏导部分就有所不同,由于加入了正则化多了一项再求导过程中出现
2
w
j
2w_j
2wj,然后与前面的常数项相乘化简后结果如下
∂
J
(
w
⃗
,
b
)
∂
w
j
=
1
m
∑
i
=
0
m
−
1
(
f
w
⃗
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
+
λ
m
w
j
∂
J
(
w
⃗
,
b
)
∂
b
=
1
m
∑
i
=
0
m
−
1
(
f
w
⃗
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\begin{align*} \frac{\partial J(\vec{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\vec{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m} w_j \\ \frac{\partial J(\vec{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\vec{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \end{align*}
∂wj∂J(w,b)∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))xj(i)+mλwj=m1i=0∑m−1(fw,b(x(i))−y(i))
值得注意的是对于线性回归和逻辑回归,它们的求导公式是一样的,区别就在于函数
f
(
w
⃗
,
b
)
(
x
(
i
)
)
f_{(\vec w,b)}(x^{(i)})
f(w,b)(x(i))不同
- 线性回归是 f ( w ⃗ , b ) ( x ( i ) ) = w ⃗ x ⃗ + b f_{(\vec w,b)}(x^{(i)}) = \vec w \vec x + b f(w,b)(x(i))=wx+b 其对应的梯度下降计算代码如下
def compute_gradient_linear_reg(X, y, w, b, lambda_):
"""
Computes the gradient for linear regression
Args:
X (ndarray (m,n): Data, m examples with n features
y (ndarray (m,)): target values
w (ndarray (n,)): model parameters
b (scalar) : model parameter
lambda_ (scalar): Controls amount of regularization
Returns:
dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar): The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape #(number of examples, number of features)
dj_dw = np.zeros((n,))
dj_db = 0.
for i in range(m):
#函数f
err = (np.dot(X[i], w) + b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m
for j in range(n):
dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]
return dj_db, dj_dw
- 逻辑回归是 f ( w ⃗ , b ) ( x ( i ) ) = g ( z ) = g ( w ⃗ x ⃗ + b ) = 1 1 + e w ⃗ x ⃗ + b f_{(\vec w,b)}(x^{(i)}) = g(z) = g(\vec w \vec x + b) = \frac{1}{1+e^{\vec w \vec x + b}} f(w,b)(x(i))=g(z)=g(wx+b)=1+ewx+b1 其对应的梯度下降计算代码如下
def compute_gradient_logistic_reg(X, y, w, b, lambda_):
"""
Computes the gradient for linear regression
Args:
X (ndarray (m,n): Data, m examples with n features
y (ndarray (m,)): target values
w (ndarray (n,)): model parameters
b (scalar) : model parameter
lambda_ (scalar): Controls amount of regularization
Returns
dj_dw (ndarray Shape (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar) : The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape
dj_dw = np.zeros((n,)) #(n,)
dj_db = 0.0 #scalar
for i in range(m):
#函数f
f_wb_i = sigmoid(np.dot(X[i],w) + b) #(n,)(n,)=scalar
err_i = f_wb_i - y[i] #scalar
for j in range(n):
dj_dw[j] = dj_dw[j] + err_i * X[i,j] #scalar
dj_db = dj_db + err_i
dj_dw = dj_dw/m #(n,)
dj_db = dj_db/m #scalar
for j in range(n):
dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]
return dj_db, dj_dw