Python3——线性回归模型及K-score归一化方法实现
前言:本文是博主参考吴恩达的机器学习课程记录的杂文笔记,主要内容是线性回归的代码实现和K-score归一化方法的代码实现,以及线性回归的主要公式内容。鉴于博主水平,如有错误,请帮忙斧正,若有意见,还望不吝赐教!
文章目录
一元线性梯度下降代码
一元线性梯度下降三部曲
-
计算下降
执行公式:
重复 直到收敛: { w = w − α ∂ J ( w , b ) ∂ w b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{重复}&\text{直到收敛:} \newline \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \ \ \ \ \ \ \ \ \ \ \tag{1} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \ \ \ \ \ \ \ \ \ \ \tag{2} \newline \rbrace \end{align*} 重复{wb}直到收敛:=w−α∂w∂J(w,b) =b−α∂b∂J(w,b) (1)(2)
∂ J ( w , b ) ∂ w = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \ \ \ \ \ \ \ \ \ \ \tag{3} \\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{4} \\ \end{align} ∂w∂J(w,b)∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))x(i) =m1i=0∑m−1(fw,b(x(i))−y(i))(3)(4) -
计算代价
执行公式 J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (5) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \ \ \ \ \ \ \ \ \ \ \tag{5} J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2 (5) -
梯度下降
重复执行上述两个步骤。
代码实现
import math, copy
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients
# 代价函数计算方法
def compute_cost(x, y, w, b):
m = x.shape[0]
cost = 0
for i in range(m):
f_wb = w * x[i] + b
cost = cost + (f_wb - y[i])**2
total_cost = 1 / (2 * m) * cost
return total_cost
# 计算两个权重的梯度下降的导数项
def compute_gradient(x, y, w, b):
"""
Computes the gradient for linear regression
计算梯度下降的下降率
Args:(参数)
x (ndarray (m,)): Data, m examples
y (ndarray (m,)): target values
w,b (scalar) : model parameters
Returns(返回值)
dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
dj_db (scalar): The gradient of the cost w.r.t. the parameter b
"""
# Number of training examples
m = x.shape[0]
dj_dw = 0
dj_db = 0
for i in range(m):
f_wb = w * x[i] + b
dj_dw_i = (f_wb - y[i]) * x[i]
dj_db_i = f_wb - y[i]
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_db
# 计算梯度下降函数
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
"""
Performs gradient descent to fit w,b. Updates w,b by taking
num_iters gradient steps with learning rate alpha
Args:(参数)
x (ndarray (m,)) : Data, m examples
y (ndarray (m,)) : target values
w_in,b_in (scalar): initial values of model parameters (模型变量的初始值)
alpha (float): Learning rate (学习率)
num_iters (int): number of iterations to run gradient descent (指定迭代次数)
cost_function: function to call to produce cost (代价函数的方法)
gradient_function: function to call to produce gradient (梯度下降偏导计算)
Returns:(返回值)
w (scalar): Updated value of parameter after running gradient descent
b (scalar): Updated value of parameter after running gradient descent
J_history (List): History of cost values
p_history (list): History of parameters [w,b]
"""
w = copy.deepcopy(w_in) # avoid modifying global w_in (规避全局变量)
# 用于储存返回值的历史代价函数和历史参数对
J_history = []
p_history = []
b = b_in
w = w_in
for i in range(num_iters):
# Calculate the gradient and update the parameters using gradient_function
# 计算梯度下降的下降率
dj_dw, dj_db = gradient_function(x, y, w , b)
# Update Parameters using equation (3) above
# 计算梯度下降的参数值并反代
b = b - alpha * dj_db
w = w - alpha * dj_dw
# Save cost J at each iteration
# 储存代价函数J
if i<100000: # prevent resource exhaustion (避免资源耗尽)
J_history.append( cost_function(x, y, w , b))
p_history.append([w,b])
# Print cost every at intervals 10 times or as many iterations if < 10
# 每10次运行就打印一次参数和代价函数
if i% math.ceil(num_iters/10) == 0:
print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",
f"w: {w: 0.3e}, b:{b: 0.5e}")
return w, b, J_history, p_history #return w and J,w history for graphing
注意事项
- 对于梯度下降公式,导数项是同步的
- 取得w, b最好远离最合适的值
- 学习率α要适当,过大会导致发散
多元线性梯度下降代码
数据矩阵
样本数据
假如拥有
m
m
m个样本(example)和
n
n
n个特征(further),则这些数据可以用如下的
(
m
,
n
)
(m, n)
(m,n)矩阵
x
\mathbf{x}
x表示:
X
=
(
x
0
(
0
)
x
1
(
0
)
⋯
x
n
−
1
(
0
)
x
0
(
1
)
x
1
(
1
)
⋯
x
n
−
1
(
1
)
⋯
x
0
(
m
−
1
)
x
1
(
m
−
1
)
⋯
x
n
−
1
(
m
−
1
)
)
\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \cdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix}
X=
x0(0)x0(1)⋯x0(m−1)x1(0)x1(1)x1(m−1)⋯⋯⋯xn−1(0)xn−1(1)xn−1(m−1)
其中:
- x ( i ) \mathbf{x}^{(i)} x(i) 代表一个样本, x ( i ) = ( x 0 ( i ) , x 1 ( i ) , ⋯ , x n − 1 ( i ) ) \mathbf{x}^{(i)}=(x_{0}^{(i)}, x_{1}^{(i)}, \cdots, x_{n-1}^{(i)}) x(i)=(x0(i),x1(i),⋯,xn−1(i))
- x j ( i ) x_{j}^{(i)} xj(i) 代表第 i i i个样本的第 j j j个元素。上标是样本号,下标是元素索引。
参数向量
用 w \mathbf{w} w是一个带有 n n n个元素的向量:
- w \mathbf{w} w每一个元素都有一个关联的特征
- 称之为列向量(column vector)
w = ( w 0 w 1 ⋯ w n − 1 ) \mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \cdots\\ w_{n-1} \end{pmatrix} w= w0w1⋯wn−1 -
b
b
b 是一个标量参数
此时给定的参数 w w w的初始值同样是个向量 w \mathbf{w} w。
计算线性回归 f ( x ) f(x) f(x)
用非向量表示:
f
w
,
b
(
x
)
=
w
_
0
x
_
0
+
w
_
1
x
_
1
+
.
.
.
+
w
n
−
1
x
n
−
1
+
b
(1)
f_{\mathbf{w},b}(\mathbf{x}) = w\_0x\_0 + w\_1x\_1 +... + w_{n-1}x_{n-1} + b \ \ \ \ \ \ \ \tag{1}
fw,b(x)=w_0x_0+w_1x_1+...+wn−1xn−1+b (1)
用向量表示:
f
w
,
b
(
x
)
=
w
⋅
x
+
b
(2)
f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \ \ \ \ \ \ \tag{2}
fw,b(x)=w⋅x+b (2)
这里的
⋅
\cdot
⋅代表的是向量点乘(dot product)
方法
循环法,对应公式(1)
def predict_single_loop(x, w, b):
"""
single predict using linear regression 用线性回归预测一次
Args:
x (ndarray): Shape (n,) example with multiple features(一个样本的向量)
w (ndarray): Shape (n,) model parameters(权重的向量)
b (scalar): model parameter(模型标参)
Returns:
p (scalar): prediction(函数预测值)
"""
n = x.shape[0]
p = 0
for i in range(n):
p_i = x[i] * w[i]
p = p + p_i
p = p + b
return p
数量积法,对应公式(2)
def predict(x, w, b):
"""
single predict using linear regression
Args:
x (ndarray): Shape (n,) example with multiple features
w (ndarray): Shape (n,) model parameters
b (scalar): model parameter
Returns:
p (scalar): prediction
"""
p = np.dot(x, w) + b
return p
计算代价函数 j ( x ) j(x) j(x)
J
(
w
,
b
)
J(\mathbf{w},b)
J(w,b)的表达式为:
J
(
w
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
(3)
J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3}
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2(3)
用向量表示:
f w , b ( x ( i ) ) = w ⋅ x ( i ) + b (4) f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b \tag{4} fw,b(x(i))=w⋅x(i)+b(4)
方法
def compute_cost(X, y, w, b):
"""
compute cost
Args:
X (ndarray (m,n)): Data, m examples with n features(所有输入的矩阵)
y (ndarray (m,)) : target values(目标变量)
w (ndarray (n,)) : model parameters(权重向量)
b (scalar) : model parameter(标参)
Returns:
cost (scalar): cost(代价函数)
"""
m = X.shape[0]
cost = 0.0
for i in range(m):
f_wb_i = np.dot(X[i], w) + b #(n,)(n,) = scalar (see np.dot)
cost = cost + (f_wb_i - y[i])**2 #scalar
cost = cost / (2 * m) #scalar
return cost
下降率
参数:
循环
直到收敛:
{
w
j
=
w
j
−
α
∂
J
(
w
,
b
)
∂
w
j
for j = 0..n-1
b
=
b
−
α
∂
J
(
w
,
b
)
∂
b
}
\begin{align*} \text{循环}&\text{直到收敛:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \; & \text{for j = 0..n-1}\ \ \ \ \ \ \ \ \ \ \ \ \ \tag{5} \newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*}
循环}直到收敛:{wj=wj−α∂wj∂J(w,b)b =b−α∂b∂J(w,b)for j = 0..n-1 (5)
导数项:
∂
J
(
w
,
b
)
∂
w
j
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
∂
J
(
w
,
b
)
∂
b
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \ \ \ \ \ \ \tag{6} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \ \ \ \ \ \ \ \tag{7} \end{align}
∂wj∂J(w,b)∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))xj(i) =m1i=0∑m−1(fw,b(x(i))−y(i)) (6)(7)
其中
m
m
m是数据集中的样本数,
f
(
w
,
b
)
(
x
(
i
)
)
f_{(w, b)}(x^{(i)})
f(w,b)(x(i))是预测值,
y
(
i
)
y^{(i)}
y(i)是目标值。
def compute_gradient(X, y, w, b):
"""
Computes the gradient for linear regression
Args:
X (ndarray (m,n)): Data, m examples with n features(数据矩阵)
y (ndarray (m,)) : target values(目标值)
w (ndarray (n,)) : model parameters(参数向量)
b (scalar) : model parameter(标参)
Returns:
dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar): The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape #(number of examples, number of features)
dj_dw = np.zeros((n,))
dj_db = 0.
for i in range(m):
err = (np.dot(X[i], w) + b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_db, dj_dw
实现梯度下降
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
"""
Performs batch gradient descent to learn theta. Updates theta by taking
num_iters gradient steps with learning rate alpha
Args:
X (ndarray (m,n)) : Data, m examples with n features
y (ndarray (m,)) : target values
w_in (ndarray (n,)) : initial model parameters
b_in (scalar) : initial model parameter
cost_function : function to compute cost
gradient_function : function to compute the gradient
alpha (float) : Learning rate
num_iters (int) : number of iterations to run gradient descent
Returns:
w (ndarray (n,)) : Updated values of parameters
b (scalar) : Updated value of parameter
"""
# An array to store cost J and w's at each iteration primarily for graphing later
J_history = []
w = copy.deepcopy(w_in) #avoid modifying global w within function
b = b_in
for i in range(num_iters):
# Calculate the gradient and update the parameters
dj_db,dj_dw = gradient_function(X, y, w, b) ##None
# Update Parameters using w, b, alpha and gradient
w = w - alpha * dj_dw ##None
b = b - alpha * dj_db ##None
# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
J_history.append( cost_function(X, y, w, b))
# Print cost every at intervals 10 times or as many iterations if < 10
if i% math.ceil(num_iters / 10) == 0:
print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f} ")
return w, b, J_history #return final w,b and J history for graphing
回归模型优化方法
特征缩放(scale features)
目的;让梯度下降运行地更快
问题:不同的特征有不同的取值范围,且尺度相差大,梯度下降会反复跳跃,导致运行缓慢
处理方法:重新缩放这些特征,保持同样的数量级,如将范围定在(-1~1)附近。
三种方法混用导致模型准确性有一定的偏差,具体要看算法
特征缩放处理数据训练的模型,预测特征也要用对应的特征缩放
简单的特征缩放(Feature scaling)
将数据除其取值范围的最大值即可:
x
i
,
s
c
a
l
e
d
=
x
i
x
m
a
x
x_{i, scaled} = \frac{x_i}{x_{max}}
xi,scaled=xmaxxi
- x i x_i xi代表这个特征向量中的第 i i i个特征。
均值归一化(Mean normalization)
缩放数值,使数值分布在原点附近(四个象限均有)。
计算方法:
x
i
=
x
i
−
μ
j
x
m
a
x
−
x
m
i
n
x_i=\frac{x_i - \mu_j}{x_{max} - x_{min}}
xi=xmax−xminxi−μj
- x i x_i xi代表这个特征向量中的第 i i i个特征。
- μ j \mu_j μj代表这个特征向量的均值
Z-score归一化(Z-score normalization)
同样能使数据分布在原点附近。
Z-score归一化以后,所有要素的平均值为0,标准差为1。
计算方法:
x
i
=
x
i
−
μ
j
σ
j
x_i = \frac{x_i - \mu_j}{\sigma_j}
xi=σjxi−μj
- σ j \sigma_j σj:第 j j j个特征的标准差(standard deviation)
判断梯度下降是否收敛
绘制学习曲线图
- 绘制学习曲线图(
j
(
w
,
b
)
−
i
t
e
r
a
t
i
o
n
j(w,b)-iteration
j(w,b)−iteration曲线图)
- j ( w , b ) j(w,b) j(w,b)必须在每次迭代后下降
- 如果出现上升,则需要考虑:
- α \alpha α是否过大
- 是否代码有bug
- w = w − α d w = w - \alpha d w=w−αd 中的“ − - −”写成了“ + + +”
自动收敛测试
ϵ
\epsilon
ϵ代表一个很小的正数,如
1
0
−
3
10^{-3}
10−3
如果每一次迭代
j
(
w
,
b
)
j(w,b)
j(w,b)的下降小于
ϵ
\epsilon
ϵ,那么可以认为收敛
由于
ϵ
\epsilon
ϵ很难取,所以通常用学习曲线观察法看收敛情况
选择合适的学习率(Learning rate)
排除问题
方法:选择较小的学习率,观察其学习曲线,从而判断是bug还是学习率问题。
技巧:通常从0.001开始以三倍递增(0.001~ 0.003~ 0.01~ 0.03~ 0.1~ 0.3),从而找到最合适的学习率。
结果:选择尽可能大,又不会导致发散的学习率。
特征工程(Feature engineering)
概念:运用所学知识和现有的特征创建新的特征,通常通过变换,合并等方式,使其能帮助算法更简单地做出准确的预测。
例子:一块土地有特征长(
x
1
x_1
x1)和宽(
x
2
x_2
x2),可以写出回归方程
f
(
x
)
=
w
1
x
1
×
w
2
x
2
+
b
f(x)=w_1 x_1 \times w_2 x_2 + b
f(x)=w1x1×w2x2+b 。但是长和宽可以合并成面积(
x
3
=
x
1
x
2
x_3=x_1 x_2
x3=x1x2),扩充后的回归方程可以写为
f
(
x
)
=
w
1
x
1
×
w
2
x
2
×
w
3
x
3
+
b
f(x) = w_1 x_1 \times w_2 x_2 \times w_3 x_3 + b
f(x)=w1x1×w2x2×w3x3+b,使得对目标值的预测更准确。
多项式回归(Polynomial regression)
通过特征工程对参数处理,从而得到一个多项式回归模型,如 f ( X ) = w 1 x 1 × w 2 x 2 2 × w 3 x 3 3 + b f(X)=w_1 x_1 \times w_2 x_2^2 \times w_3 x_3^3 + b f(X)=w1x1×w2x22×w3x33+b这样的曲线回归模型。实质上,最佳特征相对于目标呈线性,比如一个用二次方匹配的目标,用二次方去拟合目标值,则对应的图像是一条直线。其实可以理解为将多项式回归中的高次项当成一个整体,以线性回归方程理解。
K-Score归一化
计算方法:
x
i
=
x
i
−
μ
j
σ
j
x_i = \frac{x_i - \mu_j}{\sigma_j}
xi=σjxi−μj
代码实现:
# K-score归一化
def zscore\_normalize\_features(X):
"""
按列计算k均值归一化
参数:
X (ndarray): Shape (m,n) input data, m examples, n features(训练用的特征矩阵)
Returns:
X_norm (ndarray): Shape (m,n) input normalized by column(归一化后的特征矩阵)
mu (ndarray): Shape (n,) mean of each feature(每一列的特征均值)
sigma (ndarray): Shape (n,) standard deviation of each feature(每一列的特征标准差)
"""
\# 计算均值
mu = np.mean(X, axis=0) \# mu will have shape (n,)
\# 计算标准差
sigma = np.std(X, axis=0) \# sigma will have shape (n,)
\# element-wise, subtract mu for that column from each example, divide by std for that column
X_norm = (X - mu) / sigma
return (X_norm, mu, sigma)