文章目录
- Week one
- 1.1、Supervised learning-part-01(regression)
- 1.2、Supervised learning-part-02(classification)
- 1.3、Unsupervised learning-part-01
- 1.4、Unsupervised learning-part-02
- 1.5、Jupyter notebooks
- 2.1、Line regression model part 0102
- 2.2、Cost Function Formula
- 2.3、Visualizing The Cost Function
- 2.4、Visualization Examples
- 3.1、Gradient Descent(梯度下降)
- 3.2、Gradient Descent for linear regression
Week one
1.1、Supervised learning-part-01(regression)
Learn from being given “right answer”
给定输入x与输出y然后让机器判断新的输入x下的正确输出
1.2、Supervised learning-part-02(classification)
It predicts categories. How to find a boundary line?
1.3、Unsupervised learning-part-01
We don’t tell them in advance
eg: Clustering algorithm
1.4、Unsupervised learning-part-02
Anomaly detection异常检测
Dimensionality reduction 数据降维
1.5、Jupyter notebooks
2.1、Line regression model part 0102
- y-hat is the prediction target and y is the output or “target” variable(目标变量)
- Line regression with one variable is called uni-variate(单变量)linear regression
2.2、Cost Function Formula
J
(
w
,
b
)
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
(1)
J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}
J(w,b)=2m1i=0∑m−1(fw,b(x(i))−y(i))2(1)
The extra division by 2 is just meant to make some of our later calculations look neater
def compute_cost(x, y, w, b):
m = x.shape[0]
cost = 0
for i in range(m):
f_wb = w * x[i] + b
cost = cost + (f_wb - y[i])**2
total_cost = 1 / (2 * m) * cost
return total_cost
In this class, we have a simplified version which make b equal zero and minimize J ( w , b ) J(w,b) J(w,b).
2.3、Visualizing The Cost Function
The original model with w and b was plotted in 3D.
2.4、Visualization Examples
Different choice of w and b.
3.1、Gradient Descent(梯度下降)
Spin around and decide direction
For the calculation:
w = w − α ∂ J ( w , b ) ∂ w , b = b − α ∂ J ( w , b ) ∂ b w=w-\alpha\frac{\partial J(w,b)}{\partial w},b=b-\alpha \frac{\partial J(w,b)}{\partial b} w=w−α∂w∂J(w,b),b=b−α∂b∂J(w,b)
- α \alpha α is the learning rate,which is a small positive number.
3.1.1 The relation between slope and w
When the point is on the right,the slope is 2/1,which is a positive number.So the
∂
J
(
w
,
b
)
∂
w
\frac{\partial J(w,b)}{\partial w}
∂w∂J(w,b)is positive.Therefore,the new
w
w
w will decrease .That’s how the w reaches the correct value.
3.1.2 The learning rate
If
α
\alpha
α is too small,gradient descent may be slow.
If
α
\alpha
α is too big, gradient descent may be fail to converge(收敛)just like this pic.
We should also notice that the change of w will not make the pot leave the curve ,because it only changes the x axis and the vertical axis will change following w.
3.2、Gradient Descent for linear regression
Depending on where you initialize the parameters w and b, you can end up at different local minima.
But it turns out when you’re using a squared error cost function with linear regression, the cost function does not and will never have multiple local minima.
- ∂ J ( w , b ) ∂ w \frac{\partial J(w,b)}{\partial w} ∂w∂J(w,b) is equal to 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} m1i=0∑m−1(fw,b(x(i))−y(i))x(i)
- ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} ∂b∂J(w,b)is equal to 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) m1i=0∑m−1(fw,b(x(i))−y(i))
def compute_gradient(x, y, w, b):
"""
Computes the gradient for linear regression
Args:
x (ndarray (m,)): Data, m examples
y (ndarray (m,)): target values
w,b (scalar) : model parameters
Returns
dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
dj_db (scalar): The gradient of the cost w.r.t. the parameter b
"""
# Number of training examples
m = x.shape[0]
dj_dw = 0
dj_db = 0
for i in range(m):
f_wb = w * x[i] + b
dj_dw_i = (f_wb - y[i]) * x[i]
dj_db_i = f_wb - y[i]
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_db
The overall implement:
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):
"""
Performs gradient descent to fit w,b. Updates w,b by taking
num_iters gradient steps with learning rate alpha
Args:
x (ndarray (m,)) : Data, m examples
y (ndarray (m,)) : target values
w_in,b_in (scalar): initial values of model parameters
alpha (float): Learning rate
num_iters (int): number of iterations to run gradient descent
cost_function: function to call to produce cost
gradient_function: function to call to produce gradient
Returns:
w (scalar): Updated value of parameter after running gradient descent
b (scalar): Updated value of parameter after running gradient descent
J_history (List): History of cost values
p_history (list): History of parameters [w,b]
"""
w = copy.deepcopy(w_in) # avoid modifying global w_in
# An array to store cost J and w's at each iteration primarily for graphing later
J_history = []
p_history = []
b = b_in
w = w_in
for i in range(num_iters):
# Calculate the gradient and update the parameters using gradient_function
dj_dw, dj_db = gradient_function(x, y, w , b)
# Update Parameters using equation (3) above
b = b - alpha * dj_db
w = w - alpha * dj_dw
# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
J_history.append( cost_function(x, y, w , b))
p_history.append([w,b])
# Print cost every at intervals 10 times or as many iterations if < 10
if i% math.ceil(num_iters/10) == 0:
print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",
f"w: {w: 0.3e}, b:{b: 0.5e}")
return w, b, J_history, p_history #return w and J,w history for graphing
# initialize parameters
w_init = 0
b_init = 0
# some gradient descent settings
iterations = 10000
tmp_alpha = 1.0e-2
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha,
iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")
Finally,to be more precise, this gradient descent process is called batch gradient descent. The term bashed grading descent refers to the fact that on every step of gradient descent, we’re looking at all of the training examples, instead of just a subset of the training data.