0 前言
- 线性回归模型
- 本文多元线性回归模型基本思想与上文单变量线性回归模型类似,只是特征数(feature)不同,为线性回归模型一般情况,也适用于单变量回归,在计算过程中引入向量方便计算
1 Notation
Here is a summary of some of the notation you will encounter, updated for multiple features.
General Notation | Description | Python (if applicable) |
---|---|---|
a a a | scalar, non bold | |
a \mathbf{a} a | vector, bold | |
A \mathbf{A} A | matrix, bold capital | |
Regression | ||
X \mathbf{X} X | training example maxtrix | X_train |
y \mathbf{y} y | training example targets | y_train |
x ( i ) \mathbf{x}^{(i)} x(i), y ( i ) y^{(i)} y(i) | i t h i_{th} ithTraining Example | X[i] , y[i] |
m | number of training examples | m |
n | number of features in each example | n |
w \mathbf{w} w | parameter: weight, | w |
b b b | parameter: bias | b |
f w , b ( x ( i ) ) f_{\mathbf{w},b}(\mathbf{x}^{(i)}) fw,b(x(i)) | The result of the model evaluation at x ( i ) \mathbf{x^{(i)}} x(i) parameterized by w , b \mathbf{w},b w,b: f w , b ( x ( i ) ) = w ⋅ x ( i ) + b f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)}+b fw,b(x(i))=w⋅x(i)+b | f_wb |
∂ J ( w , b ) ∂ w j \frac{\partial J(\mathbf{w},b)}{\partial w_j} ∂wj∂J(w,b) | the gradient or partial derivative of cost with respect to a parameter w j w_j wj | dj_dw[j] |
∂ J ( w , b ) ∂ b \frac{\partial J(\mathbf{w},b)}{\partial b} ∂b∂J(w,b) | the gradient or partial derivative of cost with respect to a parameter b b b | dj_db |
2 Matrix X containing our examples
Similar to the table above, examples are stored in a NumPy matrix X_train
. Each row of the matrix represents one example. When you have
m
m
m training examples (
m
m
m is three in our example), and there are
n
n
n features (four in our example),
X
\mathbf{X}
X is a matrix with dimensions (
m
m
m,
n
n
n) (m rows, n columns).
X
=
(
x
0
(
0
)
x
1
(
0
)
⋯
x
n
−
1
(
0
)
x
0
(
1
)
x
1
(
1
)
⋯
x
n
−
1
(
1
)
⋯
x
0
(
m
−
1
)
x
1
(
m
−
1
)
⋯
x
n
−
1
(
m
−
1
)
)
\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \cdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix}
X=
x0(0)x0(1)⋯x0(m−1)x1(0)x1(1)x1(m−1)⋯⋯⋯xn−1(0)xn−1(1)xn−1(m−1)
notation:
- x ( i ) \mathbf{x}^{(i)} x(i) is vector containing example i. x ( i ) = ( x 0 ( i ) , x 1 ( i ) , ⋯ , x n − 1 ( i ) ) \mathbf{x}^{(i)} =(x^{(i)}_0, x^{(i)}_1, \cdots,x^{(i)}_{n-1}) x(i)=(x0(i),x1(i),⋯,xn−1(i))
- x j ( i ) x^{(i)}_j xj(i) is element j in example i. The superscript in parenthesis indicates the example number while the subscript represents an element.
3 Parameter vector w, b
-
w
\mathbf{w}
w is a vector with
n
n
n elements.
- Each element contains the parameter associated with one feature.
- notionally, we draw this as a column vector
w = ( w 0 w 1 ⋯ w n − 1 ) \mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \cdots\\ w_{n-1} \end{pmatrix} w= w0w1⋯wn−1
- b b b is a scalar parameter.
4 Model Prediction With Multiple Variables(多元)
The model’s prediction with multiple variables is given by the linear model:
f
w
,
b
(
x
)
=
w
0
x
0
+
w
1
x
1
+
.
.
.
+
w
n
−
1
x
n
−
1
+
b
(1)
f_{\mathbf{w},b}(\mathbf{x}) = w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b \tag{1}
fw,b(x)=w0x0+w1x1+...+wn−1xn−1+b(1)
or in vector notation:
f
w
,
b
(
x
)
=
w
⋅
x
+
b
(2)
f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \tag{2}
fw,b(x)=w⋅x+b(2)
where
⋅
\cdot
⋅ is a vector dot product
To demonstrate the dot product, we will implement prediction using (1) and (2).
4.1 Single Prediction element by element
用循环计算
def predict_single_loop(x, w, b):
"""
single predict using linear regression
Args:
x (ndarray): Shape (n,) example with multiple features
w (ndarray): Shape (n,) model parameters
b (scalar): model parameter
Returns:
p (scalar): prediction
"""
n = x.shape[0]
p = 0
for i in range(n):
p_i = x[i] * w[i]
p = p + p_i
p = p + b
return p
4.2 Single Prediction, vector
用向量 内积计算(即.dot),效率较4.1高
def predict(x, w, b):
"""
single predict using linear regression
Args:
x (ndarray): Shape (n,) example with multiple features
w (ndarray): Shape (n,) model parameters
b (scalar): model parameter
Returns:
p (scalar): prediction
"""
p = np.dot(x, w) + b
return p
5 Compute Cost With Multiple Variables(多元回归代价函数)
The equation for the cost function with multiple variables
J
(
w
,
b
)
J(\mathbf{w},b)
J(w,b) is:
J
(
w
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
(3)
J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3}
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2(3)
where:
f
w
,
b
(
x
(
i
)
)
=
w
⋅
x
(
i
)
+
b
(4)
f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b \tag{4}
fw,b(x(i))=w⋅x(i)+b(4)
In contrast to previous labs, w \mathbf{w} w and x ( i ) \mathbf{x}^{(i)} x(i) are vectors rather than scalars supporting multiple features.
Below is an implementation of equations (3) and (4). Note that this uses a standard pattern for this course where a for loop over all m
examples is used.
def compute_cost(X, y, w, b):
"""
compute cost
Args:
X (ndarray (m,n)): Data, m examples with n features
y (ndarray (m,)) : target values
w (ndarray (n,)) : model parameters
b (scalar) : model parameter
Returns:
cost (scalar): cost
"""
m = X.shape[0]
cost = 0.0
for i in range(m):
f_wb_i = np.dot(X[i], w) + b #(n,)(n,) = scalar (see np.dot)
cost = cost + (f_wb_i - y[i])**2 #scalar
cost = cost / (2 * m) #scalar
return cost
6 Gradient Descent With Multiple Variables
Gradient descent for multiple variables:
repeat until convergence: { w j = w j − α ∂ J ( w , b ) ∂ w j for j = 0..n-1 b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{5} \; & \text{for j = 0..n-1}\newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*} repeat} until convergence:{wj=wj−α∂wj∂J(w,b)b =b−α∂b∂J(w,b)for j = 0..n-1(5)
where, n is the number of features, parameters w j w_j wj, b b b, are updated simultaneously and where
∂ J ( w , b ) ∂ w j = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x j ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{6} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{7} \end{align} ∂wj∂J(w,b)∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))xj(i)=m1i=0∑m−1(fw,b(x(i))−y(i))(6)(7)
-
m is the number of training examples in the data set
-
f w , b ( x ( i ) ) f_{\mathbf{w},b}(\mathbf{x}^{(i)}) fw,b(x(i)) is the model’s prediction, while y ( i ) y^{(i)} y(i) is the target value
6.1 Compute Gradient with Multiple Variables (计算梯度)
An implementation for calculating the equations (6) and (7) is below. There are many ways to implement this. In this version, there is an
- outer loop over all m examples.
- ∂ J ( w , b ) ∂ b \frac{\partial J(\mathbf{w},b)}{\partial b} ∂b∂J(w,b) for the example can be computed directly and accumulated
- in a second loop over all n features:
- ∂ J ( w , b ) ∂ w j \frac{\partial J(\mathbf{w},b)}{\partial w_j} ∂wj∂J(w,b) is computed for each w j w_j wj.
def compute_gradient(X, y, w, b):
"""
Computes the gradient for linear regression
Args:
X (ndarray (m,n)): Data, m examples with n features
y (ndarray (m,)) : target values
w (ndarray (n,)) : model parameters
b (scalar) : model parameter
Returns:
dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar): The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape #(number of examples, number of features)
dj_dw = np.zeros((n,))
dj_db = 0.
for i in range(m):
err = (np.dot(X[i], w) + b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_db, dj_dw
6.2 Gradient Descent With Multiple Variables(梯度下降)
The routine below implements equation (5) above.
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
"""
Performs batch gradient descent to learn theta. Updates theta by taking
num_iters gradient steps with learning rate alpha
Args:
X (ndarray (m,n)) : Data, m examples with n features
y (ndarray (m,)) : target values
w_in (ndarray (n,)) : initial model parameters
b_in (scalar) : initial model parameter
cost_function : function to compute cost
gradient_function : function to compute the gradient
alpha (float) : Learning rate
num_iters (int) : number of iterations to run gradient descent
Returns:
w (ndarray (n,)) : Updated values of parameters
b (scalar) : Updated value of parameter
"""
# An array to store cost J and w's at each iteration primarily for graphing later
J_history = []
w = copy.deepcopy(w_in) #avoid modifying global w within function
b = b_in
for i in range(num_iters):
# Calculate the gradient and update the parameters
dj_db,dj_dw = gradient_function(X, y, w, b) ##None
# Update Parameters using w, b, alpha and gradient
w = w - alpha * dj_dw ##None
b = b - alpha * dj_db ##None
# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
J_history.append( cost_function(X, y, w, b))
# Print cost every at intervals 10 times or as many iterations if < 10
if i% math.ceil(num_iters / 10) == 0:
print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f} ")
return w, b, J_history #return final w,b and J history for graphing
7 实例
7.1 Problem Statement
You will use the motivating example of housing price prediction. The training dataset contains three examples with four features (size, bedrooms, floors and, age) shown in the table below. Note that, unlike the earlier labs, size is in sqft rather than 1000 sqft. This causes an issue, which you will solve in the next lab!
Size (sqft) | Number of Bedrooms | Number of floors | Age of Home | Price (1000s dollars) |
---|---|---|---|---|
2104 | 5 | 1 | 45 | 460 |
1416 | 3 | 2 | 40 | 232 |
852 | 2 | 1 | 35 | 178 |
You will build a linear regression model using these values so you can then predict the price for other houses. For example, a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old.
7.2 实例完整代码
import copy, math
import numpy as np
import matplotlib.pyplot as plt
#plt.style.use('./deeplearning.mplstyle')
np.set_printoptions(precision=2) # reduced display precision on numpy arrays
def compute_cost(X, y, w, b):
m = X.shape[0]
cost = 0.0
for i in range(m):
f_wb_i = np.dot(X[i], w) + b #(n,)(n,) = scalar (see np.dot)
cost = cost + (f_wb_i - y[i])**2 #scalar
cost = cost / (2 * m) #scalar
return cost
def compute_gradient(X, y, w, b):
m,n = X.shape #(number of examples, number of features)
dj_dw = np.zeros((n,))
dj_db = 0.
for i in range(m):
err = (np.dot(X[i], w) + b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_db, dj_dw
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
J_history = []
w = copy.deepcopy(w_in) #avoid modifying global w within function
b = b_in
for i in range(num_iters):
dj_db,dj_dw = gradient_function(X, y, w, b) ##None
w = w - alpha * dj_dw ##None
b = b - alpha * dj_db ##None
if i<100000: # prevent resource exhaustion
J_history.append( cost_function(X, y, w, b))
if i% math.ceil(num_iters / 10) == 0:
print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f} ")
return w, b, J_history #return final w,b and J history for graphing
# 对于上面问题给出的训练集
X_train = np.array([[2104, 5, 1, 45], [1416, 3, 2, 40], [852, 2, 1, 35]])
y_train = np.array([460, 232, 178])
# data is stored in numpy array/matrix
# 可以查看训练集及其形状维度
print(f"X Shape: {X_train.shape}, X Type:{type(X_train)})")
print(X_train)
print(f"y Shape: {y_train.shape}, y Type:{type(y_train)})")
print(y_train)
# b_init = 785.1811367994083
# w_init = np.array([ 0.39133535, 18.75376741, -53.36032453, -26.42131618])
# initialize parameters 初始化参数
# initial_w = np.zeros_like(w_init)
initial_w = np.zeros(4)
initial_b = 0.
# some gradient descent settings 迭代次数和学习率
iterations = 1000
alpha = 5.0e-7
# run gradient descent 得出的即为预测用的参数
w_final, b_final, J_hist = gradient_descent(X_train, y_train, initial_w, initial_b,
compute_cost, compute_gradient,
alpha, iterations)
print(f"b,w found by gradient descent: {b_final:0.2f},{w_final} ")
m,_ = X_train.shape
for i in range(m):
print(f"prediction: {np.dot(X_train[i], w_final) + b_final:0.2f}, target value: {y_train[i]}")
# plot cost versus iteration
# 查看代价函数下降过程
fig, (ax1, ax2) = plt.subplots(1, 2, constrained_layout=True, figsize=(12, 4))
ax1.plot(J_hist)
ax2.plot(100 + np.arange(len(J_hist[100:])), J_hist[100:])
ax1.set_title("Cost vs. iteration"); ax2.set_title("Cost vs. iteration (tail)")
ax1.set_ylabel('Cost') ; ax2.set_ylabel('Cost')
ax1.set_xlabel('iteration step') ; ax2.set_xlabel('iteration step')
plt.show()