Large scale machine learning - Stochastic gradient descent convergence

In this class, let's talk about:

  1. How to make sure the Stochastic gradient descent algorithm is converging well when we're running the algorithm
  2. How to tune the learning rate \alpha for the algorithm.
figure-1

As what can be seen in figure-1, when we were using batch gradient descent, our standard way for making sure that gradient descent was converging was we would plot the optimization cost function (J_{train}\left ( \theta \right )) as a function of the number of iterations. And we would make sure that the cost function was decreasing on every iteration. When the training set was small, we could do that because we could compute the sum pretty efficiently. But when you have a massive training set size (e.g., 300 million), then you don't want to have to pause your algorithm periodically in order to compute this cost function.

figure-2

So for Stochastic gradient descent, in order to check the algorithm is converging, the figure-2 shows what we can do instead.

  1. Right before we train on a specific example, that is before updating \theta using \left ( x^{(i)}, y^{(i)} \right ), let's compute the cost of that example cost(\theta , (x^{(i)},y^{(i)}))
  2. Then, to check for the convergence of Stochastic gradient descent, every 1000 iterations (say), we can plot the cost averaged over the last 1000 examples. This can kind of give you a running estimate of how well the algorithm is doing. And let us check whether the algorithm is converging.
figure-3

Some examples are shown in figure-3. Suppose you've plotted the cost average over the last 1000 examples.

  • Because they are averaged over just 1000 examples, they're going to be a little bit noisy, so they may not decrease on every single iteration. Then if you get a figure looks like figure-3a, that would be a pretty decent run with the algorithm. It looks like the cost has gone down and then plateaued from specific point. Then maybe your learning algorithm has converged. If you want to try a smaller learning rate \alpha, something you might see is what shown by the red line. The algorithm may initially learn more slowly so the cost goes down more slowly. But then eventually with a smaller learning rate, it's actually possible for the algorithm to end up at a maybe very slightly better solution. The reason is that Stochastic gradient descent doesn't actually converge to the global minimum, instead the parameters will oscillate around the global minimum. By using a smaller learning rate, you'll end up with smaller oscillations.
  • If instead averaging over 5000 examples, it's possible that you might get a smoother curve that looks like the red line in figure-3b. And so that's the effect of increasing the number of examples you average over. The disadvantage of making this too big is that now you get one data point only every 5000 examples. And so the feedback on how well your algorithm is doing is sort of delayed.
  • Sometimes you may end up with a plot like the blue line in figure-3c. It looks like the cost not decreasing at all and thus the algorithm is not converging. But again if you were to average over a larger (say 5000) examples, it's possible that you see something like the red line. And it looks like the cost is actually decreasing. The blue line is too noisy and you couldnt' see the actual trend. Of course, it's also possible that the cost is still flat, as the magent line, even if you average over larger examples. If that is true, then it means the algorithm is not learning much for whatever reason. And you need either change the learning rate or change the features or change something else about the algorithm.
  • Finally if you see a curve that is increasing as figure-3d, then this is a sign that the algorithm is diverging. What you usually do is use a smaller learning rate

Next, let's examine the issue of learning rate a little bit more.

figure-4

In figure-4, it's a review of Stochastic gradient algorithm. When you run the algorithm, it will start from some point and sort of meander towards the minimum. It won't really converge and instead it'll wander around the minimum forever. So you end up with a parameter value that is hopefully close to the global minimum that won't be exactly at the global minimum. In most typical applications of Stochastic gradient descent, the learning rate \alpha is typically held constant. So you typically will end up with a picture like figure-4.

If you want Stochastic gradient descent actually to converge to the global minimum, one thing you can do is slowly decrease the learning rate \alpha over time. A typical way for this is to define the learning rate as the following:

\alpha =\frac{const1}{iterationNumber+const2}

figure-5

Where iterationNumber is really the number of training examples that have been scaned by Stochastic gradient descent algorithm. const1 and const2 are additional parameters that you might have to fiddle with a bit in order to get good performance. And if you manage to tune the parameters well, then the picture you can get would be as figure-5. At the beginning, the algorithm will actually meander around toward the minimum. But as it gets closer, because you're decreasing the learning rate, the meanderings will get smaller and smaller until it pretty much just converges to the global minimum.

In practice, people tend not to do this because you'll end up needing to spend time playing with these 2 extra parameters. This makes the algorithm more finicky. And frankly usually we're pretty happy with any parameter value that is pretty close to the  global minimum.

<end>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import LassoCV from sklearn.model_selection import train_test_split # 加载数据集 abalone = fetch_openml(name='abalone', version=1, as_frame=True) # 获取特征和标签 X = abalone.data y = abalone.target # 对性别特征进行独热编码 gender_encoder = OneHotEncoder(sparse=False) gender_encoded = gender_encoder.fit_transform(X[['Sex']]) # 特征缩放 scaler = StandardScaler() X_scaled = scaler.fit_transform(X.drop('Sex', axis=1)) # 合并编码后的性别特征和其他特征 X_processed = np.hstack((gender_encoded, X_scaled)) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42) # 初始化Lasso回归模型 lasso = LassoCV(alphas=[1e-4], random_state=42) # 随机梯度下降算法迭代次数和损失函数值 n_iterations = 200 losses = [] for iteration in range(n_iterations): # 随机选择一个样本 random_index = np.random.randint(len(X_train)) X_sample = X_train[random_index].reshape(1, -1) y_sample = y_train[random_index].reshape(1, -1) # 计算目标函数值与最优函数值之差 lasso.fit(X_sample, y_sample) loss = np.abs(lasso.coef_ - lasso.coef_).sum() losses.append(loss) # 绘制迭代效率图 plt.plot(range(n_iterations), losses) plt.xlabel('Iteration') plt.ylabel('Difference from Optimal Loss') plt.title('Stochastic Gradient Descent Convergence') plt.show()上述代码报错,请修改
05-24
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值