1.代价函数(Cost Function)
1.1 介绍
The equation for cost with one variable is:
J
(
w
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
(1)
J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2(1)
where
f
w
,
b
(
x
(
i
)
)
=
w
x
(
i
)
+
b
(2)
f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2}
fw,b(x(i))=wx(i)+b(2)
- f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)) is our prediction for example i i i using parameters w , b w,b w,b.
- ( f w , b ( x ( i ) ) − y ( i ) ) 2 (f_{w,b}(x^{(i)}) -y^{(i)})^2 (fw,b(x(i))−y(i))2 is the squared difference between the target value and the prediction.
- These differences are summed over all the
m
m
m examples and divided by
2m
to produce the cost, J ( w , b ) J(w,b) J(w,b). - 目的是最小化代价函数来得到想要的参数,代价函数只是一个量值,除以
2m
方便求导,本质并无影响
Note, in lecture summation ranges are typically from 1 to m, while code will be from 0 to m-1.
1.2 代码实现
The code below calculates cost by looping over each example. In each loop:
f_{w,b}
, a prediction is calculated- the difference between the target and the prediction is calculated and squared.
- this is added to the total cost.
def compute_cost(x, y, w, b):
"""
Computes the cost function for linear regression.
Args:
x (ndarray (m,)): Data, m examples
y (ndarray (m,)): target values
w,b (scalar) : model parameters
Returns
total_cost (float): The cost of using w,b as the parameters for linear regression
to fit the data points in x and y
"""
# number of training examples
m = x.shape[0]
cost_sum = 0
for i in range(m):
f_wb = w * x[i] + b
cost = (f_wb - y[i]) ** 2
cost_sum = cost_sum + cost
total_cost = (1 / (2 * m)) * cost_sum
return total_cost
2.梯度下降(Gradient Descent)
2.1 介绍
For a linear model that predicts
f
w
,
b
(
x
(
i
)
)
f_{w,b}(x^{(i)})
fw,b(x(i)):
f
w
,
b
(
x
(
i
)
)
=
w
x
(
i
)
+
b
(1)
f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1}
fw,b(x(i))=wx(i)+b(1)
In linear regression, you utilize input training data to fit the parameters
w
w
w,
b
b
b by minimizing a measure of the error between our predictions
f
w
,
b
(
x
(
i
)
)
f_{w,b}(x^{(i)})
fw,b(x(i)) and the actual data
y
(
i
)
y^{(i)}
y(i). The measure is called the
c
o
s
t
cost
cost,
J
(
w
,
b
)
J(w,b)
J(w,b). In training you measure the cost over all of our training samples
x
(
i
)
,
y
(
i
)
x^{(i)},y^{(i)}
x(i),y(i)
J
(
w
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
(2)
J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2}
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2(2)
Gradient descent was described as:
repeat until convergence: { w = w − α ∂ J ( w , b ) ∂ w b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{3} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace \end{align*} repeatwb} until convergence:{=w−α∂w∂J(w,b)=b−α∂b∂J(w,b)(3)where, parameters w w w, b b b are updated simultaneously(同时).
The gradient is defined as(对代价函数求偏导可得):
∂
J
(
w
,
b
)
∂
w
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
∂
J
(
w
,
b
)
∂
b
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4}\\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\ \end{align}
∂w∂J(w,b)∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))x(i)=m1i=0∑m−1(fw,b(x(i))−y(i))(4)(5)
Here simultaniously(同时) means that you calculate the partial derivatives for all the parameters before updating any of the parameters.
2.2 代码实现
You will implement gradient descent algorithm for one feature. You will need three functions.
compute_gradient
implementing equation (4) and (5) abovecompute_cost
implementing equation (2) above (code from the previous )gradient_descent
, utilizing compute_gradient and compute_cost
Conventions:
- The naming of python variables containing partial derivatives follows this pattern,
∂
J
(
w
,
b
)
∂
b
\frac{\partial J(w,b)}{\partial b}
∂b∂J(w,b) will be
dj_db
. - w.r.t is With Respect To, as in partial derivative of J ( w b ) J(wb) J(wb) With Respect To b b b.
compute_gradient(梯度计算)
compute_gradient
implements (4) and (5) above and returns
∂
J
(
w
,
b
)
∂
w
\frac{\partial J(w,b)}{\partial w}
∂w∂J(w,b),
∂
J
(
w
,
b
)
∂
b
\frac{\partial J(w,b)}{\partial b}
∂b∂J(w,b). The embedded comments describe the operations.
def compute_gradient(x, y, w, b):
"""
Computes the gradient for linear regression
Args:
x (ndarray (m,)): Data, m examples
y (ndarray (m,)): target values
w,b (scalar) : model parameters
Returns
dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
dj_db (scalar): The gradient of the cost w.r.t. the parameter b
"""
# Number of training examples
m = x.shape[0]
dj_dw = 0
dj_db = 0
for i in range(m):
f_wb = w * x[i] + b
dj_dw_i = (f_wb - y[i]) * x[i]
dj_db_i = f_wb - y[i]
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_db
Gradient Descent(梯度下降)
Now that gradients can be computed, gradient descent, described in equation (3) above can be implemented below in gradient_descent
. The details of the implementation are described in the comments. Below, you will utilize this function to find optimal values of
w
w
w and
b
b
b on the training data.
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
"""
Performs gradient descent to fit w,b. Updates w,b by taking
num_iters gradient steps with learning rate alpha
Args:
x (ndarray (m,)) : Data, m examples
y (ndarray (m,)) : target values
w_in,b_in (scalar): initial values of model parameters 输入参数
alpha (float): Learning rate 学习率
num_iters (int): number of iterations to run gradient descent 迭代次数
cost_function: function to call to produce cost
gradient_function: function to call to produce gradient
Returns:
w (scalar): Updated value of parameter after running gradient descent
b (scalar): Updated value of parameter after running gradient descent
J_history (List): History of cost values
p_history (list): History of parameters [w,b]
"""
w = copy.deepcopy(w_in) # avoid modifying global w_in
# An array to store cost J and w's at each iteration primarily for graphing later
J_history = []
p_history = []
b = b_in
w = w_in
for i in range(num_iters):
# Calculate the gradient and update the parameters using gradient_function
dj_dw, dj_db = gradient_function(x, y, w , b)
# Update Parameters using equation (3) above
b = b - alpha * dj_db
w = w - alpha * dj_dw
# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
J_history.append( cost_function(x, y, w , b))# 代价记录
p_history.append([w,b]) # w,b参数值记录
#append()函数用于向列表末尾增加元素
# Print cost every at intervals 10 times or as many iterations if < 10
if i% math.ceil(num_iters/10) == 0:
print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",
f"w: {w: 0.3e}, b:{b: 0.5e}")
# i:4 是宽度为4, 0.3e 是三位小数加科学计数法
return w, b, J_history, p_history #return w and J,w history for graphing
# initialize parameters 初始化参数
w_init = 0
b_init = 0
# some gradient descent settings
iterations = 10000
tmp_alpha = 1.0e-2 # 学习率{\alpha}
# run gradient descent 运行
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha,iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")
# 8.4f 是精度 :最小宽度为8,小数点后保留4位,左边空格补位
# 08.4f:左边用0补位,详细可了解python输出精度和位数控制