梯度下降、随机梯度下降算法 (手写)【实战案例】
这份作业提供了一个很好的实战案例,在此将其转为博客。作为学习笔记
若想了解该作业得更多信息,或者算法的其他内容欢迎私信博主。或添加博主VX: 1178623893
COMP9417 - Machine Learning
Homework 1: Gradient Descent & Friends
Introduction
In this homework, you will be required to manually implement (Stochastic) Gradient Descent
in Python to learn the parameters of a linear regression model.
# Import Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load Data
df = pd.read_csv(r'real_estate(1).csv')
df
transactiondate | age | nearestMRT | nConvenience | latitude | longitude | price | |
---|---|---|---|---|---|---|---|
0 | 2012.917 | 32.0 | 84.87882 | 10.0 | 24.98298 | 121.54024 | 37.9 |
1 | 2012.917 | 19.5 | 306.59470 | 9.0 | 24.98034 | 121.53951 | 42.2 |
2 | 2013.583 | 13.3 | 561.98450 | 5.0 | 24.98746 | 121.54391 | 47.3 |
3 | 2013.500 | 13.3 | 561.98450 | 5.0 | 24.98746 | 121.54391 | 54.8 |
4 | 2012.833 | 5.0 | 390.56840 | 5.0 | 24.97937 | 121.54245 | 43.1 |
... | ... | ... | ... | ... | ... | ... | ... |
409 | 2013.000 | 13.7 | 4082.01500 | 0.0 | 24.94155 | 121.50381 | 15.4 |
410 | 2012.667 | 5.6 | 90.45606 | 9.0 | 24.97433 | 121.54310 | 50.0 |
411 | 2013.250 | 18.8 | 390.96960 | 7.0 | 24.97923 | 121.53986 | 40.6 |
412 | 2013.000 | 8.1 | 104.81010 | 5.0 | 24.96674 | 121.54067 | 52.5 |
413 | 2013.500 | 6.5 | 90.45606 | 9.0 | 24.97433 | 121.54310 | 63.9 |
414 rows × 7 columns
### Question 1. (Pre-processing)# Question 1. (Pre-processing)
# Q1(a) Remove any rows of the data that contain a missing (‘NA’) value. List the indices of the removed
# data points. Then, delete all features from the dataset apart from: age, nearestMRT and nConvenience.
df1 = df.dropna(axis = 0,how='any') # df1: remove 'NA'
df2 = df1.reset_index(drop = True) # Set a new index for df1
df3 = df2[['price','age','nearestMRT','nConvenience']] # Delete all features from the dataset apart from: age, nearestMRT and nConvenience.
# Q1(b) feature normalisation
df_data = (df3 - df3.min()) / (df3.max() - df3.min())
print('----------------------the cleaned data----------------------')
print(df_data)
print('-----------------the mean value over your dataset.-------------------')
print(df_data.mean())
----------------------the cleaned data----------------------
price age nearestMRT nConvenience
0 0.275705 0.730594 0.009513 1.0
1 0.314832 0.445205 0.043809 0.9
2 0.361237 0.303653 0.083315 0.5
3 0.429481 0.303653 0.083315 0.5
4 0.323021 0.114155 0.056799 0.5
.. ... ... ... ...
403 0.070974 0.312785 0.627820 0.0
404 0.385805 0.127854 0.010375 0.9
405 0.300273 0.429224 0.056861 0.7
406 0.408553 0.184932 0.012596 0.5
407 0.512284 0.148402 0.010375 0.9
[408 rows x 4 columns]
-----------------the mean value over your dataset.-------------------
price 0.277240
age 0.406079
nearestMRT 0.162643
nConvenience 0.412010
dtype: float64
Question 2. (Train and Test sets)
# Question 2. (Train and Test sets)
#
train_df = df_data.iloc[:int(.5 * len(df_data)),:]
test_df = df_data.iloc[int(.5 * len(df_data)):,:]
print('----------------------the Train Set----------------------')
print(train_df)
print('----------------------the Test Set----------------------')
print(test_df)
print('---------------------------------------------------------')
print('Print out the first and last rows of both your training and test sets.')
print('---------------------------------------------------------')
print(train_df.iloc[1,:])
print(train_df.iloc[-1,:])
print(test_df.iloc[1,:])
print(test_df.iloc[-1,:])
----------------------the Train Set----------------------
price age nearestMRT nConvenience
0 0.275705 0.730594 0.009513 1.0
1 0.314832 0.445205 0.043809 0.9
2 0.361237 0.303653 0.083315 0.5
3 0.429481 0.303653 0.083315 0.5
4 0.323021 0.114155 0.056799 0.5
.. ... ... ... ...
199 0.350318 0.356164 0.041138 0.5
200 0.172884 0.410959 0.215241 0.1
201 0.125569 0.292237 0.220637 0.3
202 0.331210 0.506849 0.055096 1.0
203 0.242038 0.878995 0.099260 0.3
[204 rows x 4 columns]
----------------------the Test Set----------------------
price age nearestMRT nConvenience
204 0.169245 0.262557 0.206780 0.1
205 0.303003 0.794521 0.023551 0.8
206 0.405823 0.118721 0.056799 0.5
207 0.326661 0.000000 0.038770 0.1
208 0.213831 0.401826 0.275697 0.2
.. ... ... ... ...
403 0.070974 0.312785 0.627820 0.0
404 0.385805 0.127854 0.010375 0.9
405 0.300273 0.429224 0.056861 0.7
406 0.408553 0.184932 0.012596 0.5
407 0.512284 0.148402 0.010375 0.9
[204 rows x 4 columns]
---------------------------------------------------------
Print out the first and last rows of both your training and test sets.
---------------------------------------------------------
price 0.314832
age 0.445205
nearestMRT 0.043809
nConvenience 0.900000
Name: 1, dtype: float64
price 0.242038
age 0.878995
nearestMRT 0.099260
nConvenience 0.300000
Name: 203, dtype: float64
price 0.303003
age 0.794521
nearestMRT 0.023551
nConvenience 0.800000
Name: 205, dtype: float64
price 0.512284
age 0.148402
nearestMRT 0.010375
nConvenience 0.900000
Name: 407, dtype: float64
Question 3. (Loss Function)
Consider the loss function
L
c
(
x
,
y
)
=
1
c
2
(
x
−
y
)
2
+
1
−
1
\mathcal{L}_{c}(x, y)=\sqrt{\frac{1}{c^{2}}(x-y)^{2}+1}-1
Lc(x,y)=c21(x−y)2+1−1
where
c
∈
R
c \in \mathbb{R}
c∈R is a hyper-parameter. Consider the (simple) linear model
We can write this more succintly by letting
w
=
(
w
0
,
w
1
,
w
2
,
w
3
)
T
w=\left(w_{0}, w_{1}, w_{2}, w_{3}\right)^{T}
w=(w0,w1,w2,w3)T and
X
(
i
)
=
(
1
,
x
1
(
i
)
,
x
2
(
i
)
,
x
3
(
i
)
)
T
\boldsymbol{X}^{(i)}=\left(1, x_{1}^{(i)}, x_{2}^{(i)}, x_{3}^{(i)}\right)^{T}
X(i)=(1,x1(i),x2(i),x3(i))T
,so that
y
^
(
i
)
=
w
T
X
(
i
)
\hat{y}^{(i)}=w^{T} X^{(i)}
y^(i)=wTX(i). The mean-loss achieved by our model (w) on a given dataset of n observations is then
L
c
(
y
,
y
^
)
=
1
n
∑
i
=
1
n
L
c
(
y
(
i
)
,
y
^
(
i
)
)
=
1
n
∑
i
=
1
n
[
1
c
2
(
y
(
i
)
−
⟨
w
(
t
)
,
X
(
i
)
⟩
)
2
+
1
−
1
]
\mathcal{L}_{c}(y, \hat{y})=\frac{1}{n} \sum_{i=1}^{n} \mathcal{L}_{c}\left(y^{(i)}, \hat{y}^{(i)}\right)= \\ \frac{1}{n} \sum_{i=1}^{n}\left[\sqrt{\frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1}-1\right]
Lc(y,y^)=n1i=1∑nLc(y(i),y^(i))=n1i=1∑n[c21(y(i)−⟨w(t),X(i)⟩)2+1−1]
Compute the following derivatives:
∂
L
c
(
y
(
i
)
y
^
(
i
)
)
∂
w
k
,
k
=
0
,
1
,
2
,
3
\frac{\partial \mathcal{L}_{c}\left(y^{(i)} \hat{y}^{(i)}\right)}{\partial w_{k}}, \quad k=0,1,2,3
∂wk∂Lc(y(i)y^(i)),k=0,1,2,3
You must show your working for full marks.
Answer to Question 3.
∂ L c ( y ( i ) y ^ ( i ) ) ∂ w k = ∂ ∂ w k ( 1 c 2 ( y ( i ) − ⟨ w ( t ) , X ( i ) ⟩ ) 2 + 1 − 1 ) = 1 2 ( 1 c 2 ( y ( i ) − ⟨ w ( t ) , X ( i ) ⟩ ) 2 + 1 ) − 1 2 ∂ ∂ w k ( 1 c 2 ( y ( i ) − ⟨ w ( t ) , X ( i ) ⟩ ) 2 + 1 ) = 1 2 ( 1 c 2 ( y ( i ) − ⟨ w ( t ) , X ( i ) ⟩ ) 2 + 1 ) − 1 2 1 c 2 ∂ ∂ w k ( ( y ( i ) − ⟨ w ( t ) , X ( i ) ⟩ ) 2 ) = ( y ( i ) − ⟨ w ( t ) , X ( i ) ⟩ ) x k ( i ) c 2 ( y ( i ) − ⟨ w ( t ) , X ( i ) ⟩ ) 2 + c 4 \begin{aligned} \frac{\partial \mathcal{L}_{c}\left(y^{(i)} \hat{y}^{(i)}\right)}{\partial w_{k}} &= \frac{\partial}{\partial w_k}\left({\sqrt{\frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1}-1} \right) \\ &= \frac{1}{2}\left(\frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1\right)^{-\frac{1}{2}}\frac{\partial}{ \partial w_k}\left( \frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1 \right) \\ &= \frac{1}{2}\left(\frac{1}{c^{2}}\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+1\right)^{-\frac{1}{2}}\frac{1}{c^2}\frac{\partial }{\partial w_k}\left(\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}\right) \\ &= \frac{\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)x_{k}^{(i)}}{\sqrt{c^2\left(y^{(i)}-\left\langle w^{(t)}, X^{(i)}\right\rangle\right)^{2}+c^4}} \end{aligned} ∂wk∂Lc(y(i)y^(i))=∂wk∂(c21(y(i)−⟨w(t),X(i)⟩)2+1−1)=21(c21(y(i)−⟨w(t),X(i)⟩)2+1)−21∂wk∂(c21(y(i)−⟨w(t),X(i)⟩)2+1)=21(c21(y(i)−⟨w(t),X(i)⟩)2+1)−21c21∂wk∂((y(i)−⟨w(t),X(i)⟩)2)=c2(y(i)−⟨w(t),X(i)⟩)2+c4(y(i)−⟨w(t),X(i)⟩)xk(i)
Question 4. (Gradient Descent Psuedocode)
The pseudocode for gradient descent updates
for i in range ( max_iterations):
params_grad = evaluate_gradient ( loss_function , training_data , params )
params = params - learning_rate * params_grad
The pseudocode for stochastic gradient descent updates
for i in range ( nb_epochs ): # loop Epochs
np. random . shuffle ( training_data ) # random
for batch in get_batches (training_data , batch_size =50):
params_grad = evaluate_gradient ( loss_function , batch , params )
params = params - learning_rate * params_grad
Question 5. (Gradient Descent Implementation)
# Question 5. (Gradient Descent Implementation)
# Q5. (a)
# -------------------- Gradient_Descent --------------------
def loss(w,x,y):
c = 0.5
temp = pow((y-x@w),2)
num = np.sqrt(temp/c/c+1)-1
return np.mean(num)
def Gradient_Descent(w,x,y,eta):
max_iter = 400
i = 0
Loss = []
w_all = w
while i < max_iter:
i += 1
w0 = w[0] - eta * np.mean( 0.5 * (x[:,0].reshape([-1,1])) * (x@w-y) / (np.sqrt(pow((x@w-y),2)+4)))
w1 = w[1] - eta * np.mean( 0.5 * (x[:,1].reshape([-1,1])) * (x@w-y) / (np.sqrt(pow((x@w-y),2)+4)))
w2 = w[2] - eta * np.mean( 0.5 * (x[:,2].reshape([-1,1])) * (x@w-y) / (np.sqrt(pow((x@w-y),2)+4)))
w3 = w[3] - eta * np.mean( 0.5 * (x[:,3].reshape([-1,1])) * (x@w-y) / (np.sqrt(pow((x@w-y),2)+4)))
w = np.array([w0,w1,w2,w3]).reshape([-1,1])
w_all = np.hstack([w_all,w])
loss_fun = loss(w,x,y)
Loss.append(loss(w,x,y))
return Loss,w_all
# -------------------- Main Function --------------------
df_x = train_df[['age','nearestMRT','nConvenience']]
df_y = train_df['price']
x = np.hstack([np.ones([len(train_df),1]),df_x.values])
y = df_y.values.reshape([-1,1])
ww = np.array([1,1,1,1]).reshape([-1,1])
##
alphas = [10,5,2, 1,0.5, 0.25,0.1, 0.05, 0.01]
num = 0
for eta in alphas:
Loss_temp,w_all = Gradient_Descent(ww,x,y,eta)
Loss_temp = np.array(Loss_temp).reshape([-1,1])
if num == 0:
Loss_all = Loss_temp
num = 1
else:
Loss_all = np.hstack([Loss_all,Loss_temp])
## plot
fig, ax = plt.subplots(3,3, figsize=(10,10))
ax.flat
for i, ax in enumerate(ax.flat):
# losses is a list of 9 elements. Each element is an array of length nIter storing the loss at each iteration for
# that particular step size
ax.plot(Loss_all[:,i])
ax.set_title(f"step size: {alphas[i]}") # plot titles
plt.tight_layout() # plot formatting
Q5. (b)
From your results in the previous part, choose an appropriate step size (and state your choice), and
explain why you made this choice
Answer to Question 5.(b).
In my opinion, after many tests, the step size of 1 is an appropriate choice.
If the step size is too large, the iteration will be too fast, and even the optimal solution may be missed.The step size is too small, the iteration speed is too slow, and the algorithm cannot finish for a long time.However, when the step size is 1, the algorithm uses fewer times to reach the optimal value.
# Q5. (c)
# 取 eta = 0.3 画出所有的权重迭代图
# -------------------- Main Function --------------------
df_x = df_data[['age','nearestMRT','nConvenience']]
df_y = df_data['price']
x = np.hstack([np.ones([len(df_data),1]),df_x.values])
y = df_y.values.reshape([-1,1])
ww = np.array([1,1,1,1]).reshape([-1,1])
eta = 0.3
Loss_temp,w_all = Gradient_Descent(ww,x,y,eta)
plt.plot(w_all.T)
plt.title('The progression of each of the four weights over the iterations.')
plt.ylabel('Weights')
plt.xlabel('Iterations')
plt.legend(['w0','w_1','w_2','w_3','w_3'])
plt.show()
print('The final weight vector is :',w_all[:,-1]) # Print out the final weight vector.
# Finally, run your model on the train and test set, and print the achieved losses.
# Train Set
df_train_x = train_df[['age','nearestMRT','nConvenience']]
df_train_y = train_df['price']
train_x = np.hstack([np.ones([len(df_train_x),1]),df_train_x.values])
train_y = df_train_y.values.reshape([-1,1])
loss_train = loss(w_all[:,-1].reshape([-1,1]),train_x,train_y)
print('Loss for Train Set :',loss_train)
# Test Set
df_test_x = test_df[['age','nearestMRT','nConvenience']]
df_test_y = test_df['price']
test_x = np.hstack([np.ones([len(df_test_x),1]),df_test_x.values])
test_y = df_test_y.values.reshape([-1,1])
loss_test = loss(w_all[:,-1].reshape([-1,1]),test_x,test_y)
print('Loss for Test Set :',loss_test)
The final weight vector is : [ 0.04394919 -0.01285655 0.29836987 0.44517113]
Loss for Train Set : 0.030617652935386656
Loss for Test Set : 0.035002153888240746
Question 6. (Stochastic Gradient Descent Implementation)
# Q6. (a)
# -------------------- Stochastic_Gradient_Descent --------------------
def Stochastic_Gradient_Descent(w,x,y,eta):
def loss(w,x,y):
c = 0.5
temp = pow((y-x@w),2)
num = np.sqrt(temp/c/c+1)-1
return np.mean(num)
Loss_all = []
data = np.hstack([y,x])
w_all = w
n_Epochs = 6
for i in range(n_Epochs):
rand_data = np.random.permutation(data)
yy = rand_data[:,0]
xx = rand_data[:,1:]
for j in range(len(yy)):
w0 = w[0] - eta * 0.5 * (xx[j,0]) * (xx[j,:]@w-yy[j]) / (np.sqrt(pow((xx[j,:]@w-yy[j]),2)+4))
w1 = w[1] - eta * 0.5 * (xx[j,1]) * (xx[j,:]@w-yy[j]) / (np.sqrt(pow((xx[j,:]@w-yy[j]),2)+4))
w2 = w[2] - eta * 0.5 * (xx[j,2]) * (xx[j,:]@w-yy[j]) / (np.sqrt(pow((xx[j,:]@w-yy[j]),2)+4))
w3 = w[3] - eta * 0.5 * (xx[j,3]) * (xx[j,:]@w-yy[j]) / (np.sqrt(pow((xx[j,:]@w-yy[j]),2)+4))
w = np.array([w0,w1,w2,w3]).reshape([-1,1])
w_all = np.hstack([w_all,w])
loss_fun = loss(w,x,y)
Loss_all.append(loss(w,x,y))
return w_all,Loss_all
# -------------------- Main Function --------------------
df_x = train_df[['age','nearestMRT','nConvenience']]
df_y = train_df['price']
x = np.hstack([np.ones([len(train_df),1]),df_x.values])
y = df_y.values.reshape([-1,1])
ww = np.array([1,1,1,1]).reshape([-1,1])
eta = 0.4
w_all,loss_all = Stochastic_Gradient_Descent(ww,x,y,eta)
alphas = [10,5,2, 1,0.5, 0.25,0.1, 0.05, 0.01]
num = 0
for eta in alphas:
w_temp, Loss_temp = Stochastic_Gradient_Descent(ww,x,y,eta)
Loss_temp1 = np.array(Loss_temp).reshape([-1,1])
if num == 0:
Loss_all = Loss_temp1
num = 1
else:
Loss_all = np.hstack([Loss_all,Loss_temp1])
## plot
fig, ax = plt.subplots(3,3, figsize=(10,10))
ax.flat
for i, ax in enumerate(ax.flat):
# losses is a list of 9 elements. Each element is an array of length nIter storing the loss at each iteration for
# that particular step size
ax.plot(Loss_all[:,i])
ax.set_title(f"step size: {alphas[i]}") # plot titles
plt.tight_layout() # plot formatting
Q6. (b)
From your results in the previous part, choose an appropriate step size (and state your choice), and
explain why you made this choice
Answer to Question 6.(b).
In my opinion, after many tests, the step size of 0.5 is an appropriate choice.
If the step size is too large, the iteration will be too fast, and even the optimal solution may be missed.The step size is too small, the iteration speed is too slow, and the algorithm cannot finish for a long time.However, when the step size is 0.5, the algorithm uses fewer times to reach the optimal value.
# Q6. (c)
# 取 eta = 0.4 画出所有的权重迭代图
# -------------------- Main Function --------------------
df_x = train_df[['age','nearestMRT','nConvenience']]
df_y = train_df['price']
x = np.hstack([np.ones([len(train_df),1]),df_x.values])
y = df_y.values.reshape([-1,1])
ww = np.array([1,1,1,1]).reshape([-1,1])
eta = 0.4
w_all,Loss_temp = Stochastic_Gradient_Descent(ww,x,y,eta)
plt.plot(w_all.T)
plt.title('The progression of each of the four weights over the iterations.')
plt.ylabel('Weights')
plt.xlabel('Iterations')
plt.legend(['w0','w_1','w_2','w_3','w_3'])
plt.show()
print('The final weight vector is :',w_all[:,-1]) # Print out the final weight vector.
# Finally, run your model on the train and test set, and print the achieved losses.
# Train Set
df_train_x = train_df[['age','nearestMRT','nConvenience']]
df_train_y = train_df['price']
train_x = np.hstack([np.ones([len(df_train_x),1]),df_train_x.values])
train_y = df_train_y.values.reshape([-1,1])
loss_train = loss(w_all[:,-1].reshape([-1,1]),train_x,train_y)
print('Loss for Train Set :',loss_train)
# Test Set
df_test_x = test_df[['age','nearestMRT','nConvenience']]
df_test_y = test_df['price']
test_x = np.hstack([np.ones([len(df_test_x),1]),df_test_x.values])
test_y = df_test_y.values.reshape([-1,1])
loss_test = loss(w_all[:,-1].reshape([-1,1]),test_x,test_y)
print('Loss for Test Set :',loss_test)
The final weight vector is : [ 0.27166775 -0.12645567 -0.17158891 0.20079968]
Loss for Train Set : 0.012240492047852287
Loss for Test Set : 0.01614156189305504
Question 7. Results Analysis
In a few lines, comment on your results in Questions 5 and 6.
Ansewr:
Both Gradient Descent and Stochastic Gradient Descent performed well in the tests, he said, meaning that they both used fewer iterations to reach the optimal value.However, in general, stochastic gradient descent is better than gradient descent.
Explain the importance of the step-size in both GD and SGD.
Ansewr:
Step size determines the length of each step in the gradient descent iteration along the negative direction of the gradient.It is important to set a proper step size.This is because the step size is too large, which will lead to too fast iteration and may even miss the optimal solution.The step size is too small, the iteration speed is too slow, and the algorithm cannot finish for a long time.
Explain why one might have a preference for GD or SGD.
Ansewr:
Gradient Descent and Stochastic Gradient Descent are two important optimization algorithms, and each of them has its own disadvantages and advantages.In general, gradient descent can be more accurate in the direction of the extreme value, but the gradient descent algorithm cannot guarantee that the optimized function can reach the global optimal solution.The worst part is that gradient descent takes too long to compute.For stochastic gradient descent, the convergence rate is generally much faster, but the error function is not
minimized as in Gd.
Explain why the GD paths look much smoother than the SGD paths.
Ansewr:
BGD always synthesizes the gradient of all data, so its iteration process is always smooth, while SGD randomly selects a piece of data as a parameter, so its iteration process is very unstable.