理论及推导
文中代码都为一元的
一、作用:通过最小化代价函数(成本函数)J(w,b)来训练w和b
→找出J(w,b)的min所对应的参数Wi和b
这里的参数可以是多个,也就是说可以有好多维
我用笔记来说明一下
这里的上标j指的是样本数,下标i指的是特征数
假设函数可以你对样本的值的预测值,损失函数则是比较预测值和真实值之间的差距。
下面解释一下特征数:
一个单特征假设函数:h(x) = θ0 + θ1x1
n个特征的话:h(x) = θ0 + θ1x1 + θ2x2 + …+θnxn
2、代数法算法过程:(有一个初始参数Ω)
(1)选择一个初始位置,计算当前位置的损失函数的梯度值,即:a/aθ J(θ0,θ1…θn)
(2)用步长(学习率)乘以当前的梯度,得到△
(3)确定是否所有的θi梯度下降的距离都小于Ω。(越接近于极小值,下降会越少)
下面引用一下一篇比较好的博客上关于梯度下降中单变量和多变量的解释:
(1)单变量:
(2)多变量
对单多变量有了基本的了解,便可以理解下面的函数了。
3、误差函数(用总样本的预测值和真实值的差值平方的平均数来表示):
4、矩阵推导
python和matlab都提供了很方便的矩阵运算方法。于是我们就可以把我们的函数进行向量化变形,利用矩阵的方法进行求解
上面就是梯度下降的矩阵求法
下面我们用代码来展示一下损失函数和假设函数:
def cal_cost(theta,X,y):
m = len(y)#y指的是真实值
predictions = X.dot(theta)#X和theta的矩阵内积,预测函数h(x)
cost = (1/2*m) * np.sum(square(predictions - y))
return cost
批量梯度下降、随机梯度下降、小批量梯度下降
以两个参数的为例
批量梯度函数:每一个样本都要用到
def gradient_descent(X,y,theta,learning_rate = 0.01,iterations = 100):
#,learning_rate是学习率,iteration是他的迭代次数
m = len(y)
cost_history = np.zeros(iterations)
theta_history = np.zeros((iterations,2))#theta是指@
for i in range(iterations):
prediction = np.dot(X,theta)#将X与theta进行矩阵乘
#
theta = theta - (1/m) * learning_rate * (X.T.dot(prediction - y))#更新公式
theta_history[i,:] = theta.T#
cost_history[i] = cal_cost(theta,X,y) # 代价函数更新
return theta,cost_history,theta_history #返回损失函数最小值时的参数theta,以及此时的损失函数的值
随机梯度函数(每一次迭代只用一个样本):
def stocashtic_gradient_descent(X,y,theta,learning_rate,iterations):
m = len(y)
cost_history = np.zeros(iterations)
cost = 0.0
for it in range(iterations):
cost = 0.0
for i in range(m):
rand_ind = np.random.randint(0,m)
X_i = X[rand_ind,:].reshape(1,X.shape[1])
y_i = y[rand_ind,:].reshape(1,1)
prediction = np.dot(X_i,theta)
theta -= (1/m) * learning_rate * (X_i.T.dot((prediction - y_i)))
cost += cal_cost(theta,X_i,y_i)
cost_history[it] = cost
return theta,cost_history
小批量梯度下降:
def minibatch_gradient_descent(X, y, theta, learning_rate=0.01, iterations=10):
m = len(y)
batch_size = 20
cost_history = np.zeros(iterations)
n_batches = int(m / batch_size)
for it in range(iterations):
cost = 0.0
indices = np.random.permutation(m)
X = X[indices]
y = y[indices]
for i in range(0, m, batch_size):
X_i = X[i: i + batch_size]
y_i = y[i: i + batch_size]
X_i = np.c_[np.ones(len(X_i)), X_i]
prediction = np.dot(X_i, theta)
theta -= (1 / m) * learning_rate * (X_i.T.dot((prediction - y_i)))
cost += cal_cost(theta, X_i, y_i)
cost_history[it] = cost
return theta, cost_history
错误
很奇怪的是,得出的theta值会很大(数据集很少的时候)
假设函数对应的迭代的时候的函数也出现错误
import numpy as np
import matplotlib.pyplot as plt
X = [[150, 200, 250, 300, 350, 400, 600]]
y = [[6450, 7450, 8450, 9450, 11450, 15450, 18450]]
X = np.array(X).T
y = np.array(y).T
X_b = np.c_[np.ones((7, 1)), X] # add x0 = 1 to each instan
def cal_cost(theta, X, y):
m = len(y)
predictions = X.dot(theta)
cost = (1 / 2 * m) * np.sum(np.square(predictions - y))
return cost
def gradient_descent(X, y, theta, learning_rate=0.01, iterations=10,epsilon=0.01):
m = len(y)
# learning_rate = 0.01
# iterations = 100
cost_history = np.zeros(iterations)
theta_history = np.zeros((iterations, 2))
for i in range(iterations):
prediction = np.dot(X, theta)
theta = theta - (1 / m) * learning_rate * (X.T.dot((prediction - y)))
theta_history[i, :] = theta.T
cost = cal_cost(theta, X, y)
cost_history[i] = cost
if(cost<epsilon):
break
return theta, cost_history, theta_history
# 从1000次迭代开始,学习率为0.01。从高斯分布的θ开始
lr =0.01
n_iter = 5
theta = np.zeros((2,1))##记得修改
epsilon = 0.01
X_b = np.c_[np.ones((len(X), 1)), X]
theta, cost_history, theta_history= gradient_descent(X_b, y, theta, lr, n_iter,epsilon)
Y_new = X_b.dot(theta)
print('Theta0: {:0.3f},\nTheta1: {:0.3f}'.format(theta[0][0],theta[1][0]))
print('Final cost/MSE: {:0.3f}'.format(cost_history[-1]))
# 绘制迭代的成本图
plt.xlabel('iterations')
plt.ylabel('J')
plt.plot(range(len(cost_history)),cost_history)
plt.show()
plt.axis([0,650,0,20000])
plt.plot(X,y,'b.',X,Y_new,'r-')
#plt.plot(X,Y_new,'r-')
plt.show()
得出的theta很大
当我们换了稍微大一点的数据集后
X = [[1.1, 1.3, 1.5, 2, 2.2, 2.9, 3, 3.2, 3.2, 3.7, 3.9, 4, 4, 4.1, 4.5, 4.9, 5.1, 5.3, 5.9, 6, 6.8, 7.1, 7.9, 8.2, 8.7, 9, 9.5, 9.6, 10.3, 10.5]]
y = [[39343, 46205, 37731, 43525, 39891, 56642, 60150, 54445, 64445, 57189, 63218, 55794, 56957, 57081, 61111, 67938, 66029, 83088, 81363, 93940, 91738,
98273, 101302, 113812, 109431, 105582, 116969, 112635, 122391, 121872]](这里y乘了0.01,避免太大)
参数结果:
Theta0: 6481.917,
Theta1: 12315.527
接下来我们将三种梯度下降放到一起比较
当学习率、迭代次数以及初始theta值都一样时
得到的时间和theta值
lr =0.05
n_iter = 100
theta = np.zeros((2,1))##初始化参数,然后开始迭代
#theta = np.random.randn(2,1)
epsilon = 0.01
BGD_Time: 0.017
SGD_Time: 0.067
MSGD_Time: 0.010
BGD_Theta0: 17.551,
BGD_Theta1: 10.673
SGD_Theta0: 22.952,
SGD_Theta1: 9.736
MBGD_Theta0: 22.952,
MBGD_Theta1: 9.736
这里的sgd图很奇怪:
完整代码
import numpy as np
import matplotlib.pyplot as plt
import timeit
#X = [[150, 200, 250, 300, 350, 400, 600]]
#y = [[6450, 7450, 8450, 9450, 11450, 15450, 18450]]
X = [[1.1, 1.3, 1.5, 2, 2.2, 2.9, 3, 3.2, 3.2, 3.7, 3.9, 4, 4, 4.1, 4.5, 4.9, 5.1, 5.3, 5.9, 6, 6.8, 7.1, 7.9, 8.2, 8.7, 9, 9.5, 9.6, 10.3, 10.5]]
y = [[39343, 46205, 37731, 43525, 39891, 56642, 60150, 54445, 64445, 57189, 63218, 55794, 56957, 57081, 61111, 67938, 66029, 83088, 81363, 93940, 91738,
98273, 101302, 113812, 109431, 105582, 116969, 112635, 122391, 121872]]
X = np.array(X).T
y = np.array(y).T * 0.001
#plt.plot(X,y,'b.')
X_b = np.c_[np.ones((len(X), 1)), X] # add x0 = 1 to each instan
#假设函数
def cal_cost(theta, X, y):
m = len(y)
predictions = X.dot(theta)
cost = (1 / 2 * m) * np.sum(np.square(predictions - y))
return cost
#BGD
def gradient_descent(X, y, theta, learning_rate=0.01, iterations=10,epsilon=0.01):
start_time = timeit.default_timer()
m = len(y)
cost_history = np.zeros(iterations)
theta_history = np.zeros((iterations, 2))
for i in range(iterations):
prediction = np.dot(X, theta)
theta = theta - (1 / m) * learning_rate * (X.T.dot((prediction - y)))
theta_history[i, :] = theta.T
cost = cal_cost(theta, X, y)
cost_history[i] = cost
# if(cost<epsilon):
# break
end_time = timeit.default_timer()
print('BGD_Time: {:0.3f}'.format(end_time - start_time))
return theta, cost_history, theta_history
#SGD
def stocashtic_gradient_descent(X,y,theta,learning_rate,iterations,epsilon = 0.01):
start_time = timeit.default_timer()
m = len(y)
cost_history = np.zeros(iterations)
theta_history = np.zeros((iterations,2))
cost = 0.0
for i in range(iterations):
cost = 0.0
for i in range(m):#循环的原因?
rand_ind = np.random.randint(0,m)
X_i = X[rand_ind,:].reshape(1,X.shape[1])
y_i = y[rand_ind,:].reshape(1,1)
prediction = np.dot(X_i,theta)
theta -= (1/m) * learning_rate * (X_i.T.dot((prediction - y_i)))
cost += cal_cost(theta,X_i,y_i)
theta_history[i, :] = theta.T
cost_history[i] = cost
if(cost<epsilon):
break
end_time = timeit.default_timer()
print('SGD_Time: {:0.3f}'.format(end_time - start_time))
return theta,cost_history,theta_history
#MBGD
def minibatch_gradient_descent(X, y, theta, learning_rate=0.01, iterations=10,epsilon=0.01):
start_time = timeit.default_timer()
m = len(y)
batch_size = 20
theta_history = np.zeros((iterations,2))
cost_history = np.zeros(iterations)
n_batches = int(m / batch_size)
for i in range(iterations):
cost = 0.0
indices = np.random.permutation(m)
X = X[indices]
y = y[indices]
for i in range(0, m, batch_size):
X_i = X[i: i + batch_size]
y_i = y[i: i + batch_size]
#X_i = np.c_[np.ones(len(X_i)), X_i]
prediction = np.dot(X_i, theta)
theta -= (1 / m) * learning_rate * (X_i.T.dot((prediction - y_i)))
cost += cal_cost(theta, X_i, y_i)
theta_history[i,:] = theta.T
cost_history[i] = cost
if(cost<epsilon):
break
end_time = timeit.default_timer()
print('MSGD_Time: {:0.3f}'.format(end_time - start_time))
return theta, cost_history,theta_history
# 从1000次迭代开始,学习率为0.01。从高斯分布的θ开始
lr =0.05
n_iter = 100
theta = np.zeros((2,1))##初始化参数,然后开始迭代
#theta = np.random.randn(2,1)
epsilon = 0.01
X_b = np.c_[np.ones((len(X), 1)), X]
theta_BGD, cost_history_BGD, theta_history_BGD = gradient_descent(X_b, y, theta, 0.05, n_iter,epsilon)
theta_SGD, cost_history_SGD, theta_history_SGD = stocashtic_gradient_descent(X_b,y,theta,0.5,n_iter,epsilon)
theta_MBGD, cost_history_MBGD, theta_history_MBGD = minibatch_gradient_descent(X_b, y, theta, 0.06, n_iter,epsilon)
Y_cost_BGD = X_b.dot(theta_BGD)
Y_cost_SGD = X_b.dot(theta_SGD)
Y_cost_MSGD = X_b.dot(theta_MBGD)
print('BGD_Theta0: {:0.3f},\nBGD_Theta1: {:0.3f}'.format(theta_BGD[0][0],theta_BGD[1][0]))
print('SGD_Theta0: {:0.3f},\nSGD_Theta1: {:0.3f}'.format(theta_SGD[0][0],theta_SGD[1][0]))
print('MBGD_Theta0: {:0.3f},\nMBGD_Theta1: {:0.3f}'.format(theta_MBGD[0][0],theta_MBGD[1][0]))
#print('Final cost/MSE: {:0.3f}'.format(cost_history[-1]))
# 绘制迭代的成本图
plt.xlabel('iterations')
plt.ylabel('J')
fig,ax = plt.subplots()
#ax.set(xlim=[0,100],ylim=[0,500])
ax.plot(range(len(cost_history_BGD)), cost_history_BGD, 'r.', label="BGD")
ax.plot(range(len(cost_history_BGD)), cost_history_SGD, 'y.', label="SGD")
ax.plot(range(len(cost_history_BGD)), cost_history_MBGD, 'b.', label="MGD")
ax.legend(loc="upper right", fontsize=14)
plt.show()
# 绘制迭代的成本图
fig1 = plt.figure()
#plt.set_title('迭代对比')
plt.xlabel('iterations')
plt.ylabel('cost')
xlabel = [i for i in range(50)]
plt.plot(range(len(cost_history_BGD)),cost_history_BGD,'b.',label = u"BGD")
plt.plot(range(len(cost_history_SGD)),cost_history_SGD,'r.',label = u"SGD")
plt.show()
#绘制theta0和theta1
fig2 = plt.figure()
ax1 = fig2.add_subplot(211)
ax2 = fig2.add_subplot(212)
ax1.plot(range(len(theta_history_BGD)), theta_history_BGD[:,0], 'r-', label="Theta 0")
ax1.plot(range(len(theta_history_BGD)), theta_history_BGD[:,1], 'b-', label="Theta 1")
ax1.set_title('BGD')
ax2.plot(range(len(theta_history_SGD)), theta_history_SGD[:,0], 'r-', label="Theta 0")
ax2.plot(range(len(theta_history_SGD)), theta_history_SGD[:,1], 'b-', label="Theta 0")
ax2.set_title('SGD')
plt.show()
多元
import numpy as np
import matplotlib.pyplot as plt
import timeit
X = np.array([[1,2],[2,1],[2,3],[3,5],[1,3],[4,2],[7,3],[4,5],[11,3],[8,7]])
Y = np.array([[7,8,10,14,8,13,20,16,28,26]])
Y = Y.T
#假设函数
def cal_cost(theta, X, y):
m = len(y)
predictions = X.dot(theta)
cost = (1 / 2 * m) * np.sum(np.square(predictions - y))
return cost
def gradient_descent(X, y, theta, learning_rate=0.01, iterations=10,epsilon=0.01):
start_time = timeit.default_timer()
m = len(y)
cost_history = np.zeros(iterations)
theta_history = np.zeros((iterations, 3))
for i in range(iterations):
prediction = np.dot(X, theta)
theta = theta - (1 / m) * learning_rate * (X.T.dot((prediction - y)))
theta_history[i, :] = theta.T
cost = cal_cost(theta, X, y)
cost_history[i] = cost
# if(cost<epsilon):
# break
end_time = timeit.default_timer()
print('BGD_Time: {:0.3f}'.format(end_time - start_time))
return theta, cost_history, theta_history
#SGD
def stocashtic_gradient_descent(X,y,theta,learning_rate,iterations):
start_time = timeit.default_timer()
m = len(y)
cost_history = np.zeros(iterations)
theta_history = np.zeros((iterations, 3))
cost = 0.0
for it in range(iterations):
cost = 0.0
for i in range(m):
rand_ind = np.random.randint(0,m)
X_i = X[rand_ind,:].reshape(1,X.shape[1])
y_i = y[rand_ind,:].reshape(1,1)
prediction = np.dot(X_i,theta)
theta -= (1/m) * learning_rate * (X_i.T.dot((prediction - y_i)))
cost += cal_cost(theta,X_i,y_i)
cost_history[it] = cost
theta_history[i,:] = theta.T
end_time = timeit.default_timer()
print('SGD_Time: {:0.3f}'.format(end_time - start_time))
return theta,cost_history,theta_history
# 从1000次迭代开始,学习率为0.01。从高斯分布的θ开始
lr =0.05
n_iter = 100
theta = np.zeros((3,1))##初始化参数,然后开始迭代
#theta = np.random.randn(2,1)
epsilon = 0.01
X_b = np.c_[np.ones((len(X), 1)), X]
theta_BGD, cost_history_BGD, theta_history_BGD = gradient_descent(X_b, Y, theta, 0.05, n_iter,epsilon)
theta_SGD, cost_history_SGD, theta_history_SGD = gradient_descent(X_b, Y, theta, 0.05, n_iter,epsilon)
Y_cost_BGD = X_b.dot(theta_BGD)
print('BGD_Theta0: {:0.3f},\nBGD_Theta1: {:0.3f},\nBGD_Theta2: {:0.3f}\n'.format(theta_BGD[0][0],theta_BGD[1][0],theta_BGD[2][0]))
Y_cost_SGD = X_b.dot(theta_SGD)
print('SGD_Theta0: {:0.3f},\nSGD_Theta1: {:0.3f},\nSGD_Theta2: {:0.3f}\n'.format(theta_SGD[0][0],theta_SGD[1][0],theta_SGD[2][0]))
#绘制theta0和theta1
fig2 = plt.figure()
ax1 = fig2.add_subplot(211)
ax2 = fig2.add_subplot(212)
ax1.plot(range(len(theta_history_BGD)), theta_history_BGD[:,0], 'r-', label="Theta 0")
ax1.plot(range(len(theta_history_BGD)), theta_history_BGD[:,1], 'b-', label="Theta 1")
ax1.plot(range(len(theta_history_BGD)), theta_history_BGD[:,2], 'y-', label="Theta 2")
ax1.set_title('BGD')
ax2.plot(range(len(theta_history_SGD)), theta_history_SGD[:,0], 'r-', label="Theta 0")
ax2.plot(range(len(theta_history_SGD)), theta_history_SGD[:,1], 'b-', label="Theta 1")
ax2.plot(range(len(theta_history_SGD)), theta_history_SGD[:,2], 'y-', label="Theta 2")
ax2.set_title('SGD')
plt.show()