机器学习算法 拟合曲线
机器学习 (Machine Learning)
The learning curve is very useful to determine how to improve the performance of an algorithm. It is useful to determine if an algorithm is suffering from bias or underfitting, a variance or overfishing, or a bit of both.
学习曲线对于确定如何提高算法性能非常有用。 确定算法是否遭受偏差或拟合不足,方差或过度捕捞或两者兼而有之很有用。
If your machine learning algorithm is not working as expected, what to do next? There are several options:
如果您的机器学习算法无法正常工作,下一步该怎么做? 有几种选择:
- Getting more training data which is very time-consuming. It may even take months to obtain more research data. 获取更多的培训数据非常耗时。 甚至可能需要数月的时间才能获得更多的研究数据。
- Getting more training features. It may also take a lot of time. But if adding some polynomial features works, that is cool. 获得更多的培训功能。 也可能需要很多时间。 但是,如果添加一些多项式特征可以工作,那就太酷了。
- Selecting a smaller set of training features. 选择较小的一组训练功能。
- Increasing regularization term 增加正则项
- Decreasing the regularization term. 减少正则项。
So, which one should you try next? This is not a good idea to start trying just anything. Because you may end up spending too much time on something that is not helpful. You need to detect the problem first and then take action accordingly. A learning curve helps to detect the problem easily which saves a lot of time.
那么,接下来您应该尝试哪一个呢? 开始尝试任何操作都不是一个好主意。 因为您可能最终会花太多时间在无用的事情上。 您需要先检测到问题,然后采取相应措施。 学习曲线有助于轻松检测问题,从而节省大量时间。
学习曲线的工作原理 (How Learning Curve Works)
The learning curve is the plot of the cost function. The cost function for the training data and the cost function for the cross-validation data in the same plot gives important insights about the algorithm. As a reminder, here is the formula for the cost function:
学习曲线是成本函数的图。 在同一图中,训练数据的成本函数和交叉验证数据的成本函数为算法提供了重要的见解。 提醒一下,这是成本函数的公式:
In other words, it is squared of the predicted output minus the original output divided by twice the number of training data. To make the learning curve, we need to plot these cost functions as a function of the number of training data (m). Instead of using all the training data, we will use only a smaller subset of training data to train the data.
换句话说,它是预测输出减去原始输出的平方除以训练数据数量的两倍。 为了绘制学习曲线,我们需要将这些成本函数绘制为训练数据数量(m)的函数。 代替使用所有训练数据,我们将仅使用训练数据的较小子集来训练数据。
Have a look at the picture below:
看看下面的图片:
Here is the concept. If we train the data with a too-small number of data, the algorithm will fit perfectly on the training data and the cost function will return 0. In the picture above it is showing clearly that when we train the data with only one, two, or three data algorithms can learn that few data very well and training cost comes out to be zero or close to zero. But this type of algorithm cannot perform well on other data. When you will try to fit the cross-validation data on this algorithm, the probability is very high that it will perform poorly on cross-validation data. So, the cost function for cross-validation data will return a very high value. On the other hand, when we will take more and more data to train the algorithm, it will not fit in the training data perfectly anymore. So, the training cost will become higher. At the same time, as this algorithm is trained on a lot of data, it will perform better on the cross-validation data and the cost function for cross-validation data will return a lower value. Here is how to develop a learning curve.
这是概念。 如果我们使用太少的数据来训练数据,则该算法将非常适合训练数据,并且成本函数将返回0。在上图中,清楚地表明,当我们仅训练一个,两个时,或三种数据算法可以很好地了解到很少的数据,并且训练成本为零或接近零。 但是,这种类型的算法无法在其他数据上很好地执行。 当您尝试使交叉验证数据适合此算法时,在交叉验证数据上执行效果很差的可能性很高。 因此,交叉验证数据的成本函数将返回非常高的值。 另一方面,当我们将需要越来越多的数据来训练算法时,它将不再完全适合训练数据。 因此,培训成本将变得更高。 同时,由于该算法针对大量数据进行训练,因此在交叉验证数据上的性能会更好,并且交叉验证数据的成本函数将返回较低的值。 这是如何建立学习曲线的方法。
开发学习算法 (Develop A Learning Algorithm)
I will demonstrate how to draw a learning curve step by step. For drawing a learning curve, we need a machine learning algorithm first. For simplicity, I will work with a linear regression algorithm. I will move a bit faster here and not explain every step because I am assuming you know the machine learning algorithm development. If you need a refresher on how to develop a linear regression algorithm, please check this article first:
我将演示如何逐步绘制学习曲线。 为了绘制学习曲线,我们首先需要机器学习算法。 为简单起见,我将使用线性回归算法。 在这里,我会移动得更快一些,并且不会解释每个步骤,因为我假设您知道机器学习算法的开发。 如果您需要重新学习如何开发线性回归算法,请首先查看本文:
First, import the packages and the dataset. The dataset I am using here is taken from Andrew Ng’s machine learning course in Coursera. In this dataset, X-value, and y-value are organized in separate sheets in an Excel file. X and y values of cross-validation data are also organized in two other sheets in the same Excel file. I provided the link to the dataset at the end of this article. Please feel free to download the dataset and practice yourself.
首先,导入包和数据集。 我在这里使用的数据集取材于Coursera的Andrew Ng的机器学习课程。 在此数据集中,X值和y值在Excel文件中的单独工作表中进行组织。 交叉验证数据的X和y值也被组织在同一Excel文件中的其他两个工作表中。 我在本文结尾处提供了到数据集的链接。 请随时下载数据集并进行练习。
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
file = pd.ExcelFile('dataset.xlsx')
df = pd.read_excel(file, 'Xval', header=None)
df.head()
In the same way, import the y-values for the training set:
以相同的方式,导入训练集的y值:
y = pd.read_excel(file, 'yval', header=None)
y.head()
Let’s develop the linear regression algorithm quickly. Define hypothesis and cost function.
让我们快速开发线性回归算法。 定义假设和成本函数。
m = len(df)def hypothesis(theta, X):
return theta[0] + theta[1]*Xdef cost_calc(theta, X, y):
return (1/2*m) * np.sum((hypothesis(theta, X) - y)**2)
Now, we will define gradient descent to optimize the parameters
现在,我们将定义梯度下降以优化参数
def gradient_descent(theta, X, y, epoch, alpha):
cost = []
theta_hist = []
i = 0
while i < epoch:
hx = hypothesis(theta, X)
theta[0] -= alpha*(sum(hx-y)/m)
theta[1] -= (alpha * np.sum((hx - y) * X))/m
cost.append(cost_calc(theta, X, y))
i += 1
return theta, cost
A linear regression algorithm is done. We need a method to predict the output:
完成了线性回归算法。 我们需要一种预测输出的方法:
def predict(theta, X, y, epoch, alpha):
theta, cost = gradient_descent(theta, X, y, epoch, alpha)
return hypothesis(theta, X), cost, theta
Now, initiate the parameters as zeros and use the predict function to predict the output variable.
现在,将参数初始化为零,并使用预测函数预测输出变量。
theta = [0,0]
y_predict, cost, theta = predict(theta, df[0], y[0], 1400, 0.001)
The updated theta values are: [10.724868115832654, 0.3294833798797125]
更新的theta值是:[10.724868115832654,0.3294833798797125]
Now, plot the predicted output and the original output against the df in the same plot:
现在,在同一图中针对df绘制预测输出和原始输出:
plt.figure()
plt.scatter(df, y)
plt.scatter(df, y_predict)
Looks like the algorithm is working well.
看起来算法运行良好。
画出学习曲线 (Draw A Learning Curve)
Now, we can draw a learning curve. First, let’s import the X and y values for our cross-validation dataset. As I mentioned earlier, We have then organized in separate Excel sheets.
现在,我们可以画出学习曲线。 首先,让我们为交叉验证数据集导入X和y值。 正如我之前提到的,我们然后将其组织在单独的Excel工作表中。
file = pd.ExcelFile('dataset.xlsx')
cross_val = pd.read_excel(file, 'X', header=None)
cross_val.head()
cross_y = pd.read_excel(file, 'y', header=None)
cross_y.head()
For this purpose, I want to modify the gradient_descent function a little bit. In our previous gradient_descent function, we calculated the cost in each iteration. I did that because that’s a good practice in traditional machine learning algorithm development. But for the learning curve, we do not need the cost in each iteration. So, to save running time there, I will exclude calculating cost function in each epoch. We will return only the updated parameters.
为此,我想稍微修改一下gradient_descent函数。 在之前的gradient_descent函数中,我们计算了每次迭代的成本。 我这样做是因为这是传统机器学习算法开发中的一个好习惯。 但是对于学习曲线,我们不需要每次迭代的成本。 因此,为了节省运行时间,我将排除每个时期的成本函数计算。 我们将仅返回更新的参数。
def grad_descent(theta, X, y, epoch, alpha):
i = 0
while i < epoch:
hx = hypothesis(theta, X)
theta[0] -= alpha*(sum(hx-y)/m)
theta[1] -= (alpha * np.sum((hx - y) * X))/m
i += 1
return theta
As I discussed earlier, to develop a learning curve, we need to train the learning algorithm with the different subsets of training data. In our training dataset, we have 21 data. I will train the algorithm using just one data, then with two data, then with three data all the way up to 21 data. So, we will train the algorithm 21 times on 21 subsets of the training data. We will also keep track of the cost function for each subset of training data. Please have a close look at the code, it will be clearer.
如前所述,要开发学习曲线,我们需要使用不同的训练数据子集来训练学习算法。 在我们的训练数据集中,我们有21个数据。 我将仅使用一个数据,然后使用两个数据,然后使用三个数据一直到21个数据来训练算法。 因此,我们将在21个训练数据子集上对算法进行21次训练。 我们还将跟踪每个训练数据子集的成本函数。 请仔细看一下代码,它将更加清晰。
j_tr = []
theta_list = []
#theta = [0,0]
for i in range(0, len(df)):
theta = [0,0]
theta_list.append(grad_descent(theta, df[0][:i], y[0][:i], 1400, 0.001))
#print(theta)
j_tr.append(cost_calc(theta, df[0][:i], y[0][:i]))
theta_list
Here are the training parameters for each subset of training data:
以下是每个训练数据子集的训练参数:
Here is the cost for each training subset:
这是每个训练子集的费用:
Look at the cost for each subset. When the training data was only 1 or 2, the cost was zero or almost zero. As we kept increasing the training data, the cost also went up which was expected. Now, use the parameters above for all the subsets of training data to calculate the cost on cross-validation data:
查看每个子集的成本。 当训练数据仅为1或2时,成本为零或几乎为零。 随着我们不断增加培训数据,成本也上升了,这是预期的。 现在,对训练数据的所有子集使用上面的参数来计算交叉验证数据的成本:
j_val = []
for i in theta_list:
j_val.append(cost_calc(i, cross_val[0], cross_y[0]))
j_val
In the beginning, the cost was really high because the training parameters are coming from too few training data. But as the parameters improved with more training data, cross-validation error kept going down. Let’s plot the training error and cross-validation error in the same plot:
刚开始时,成本确实很高,因为训练参数来自太少的训练数据。 但是随着参数的增加和更多训练数据的改进,交叉验证错误不断下降。 让我们在同一图中绘制训练误差和交叉验证误差:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(range(0, 21), j_tr)
plt.scatter(range(0, 21), j_val)
This is our learning curve.
这是我们的学习曲线。
从学习曲线中得出决策 (Drawing Decision From Learning Curve)
The learning curve above looks nice. It is flowing the way we expected. In the beginning, training error was too small and validation error was too high. Slowly they totally overlapped on each other. So that is perfect! But in the real-life, it does not happen very often. Most machine learning algorithms do not work perfectly for the first time. Almost all the time it suffers from some problems that we need to fix. Here I will discuss some issues.
上面的学习曲线看起来不错。 它以我们预期的方式流动。 最初,训练错误太小,验证错误太高。 慢慢地,它们彼此完全重叠。 太完美了! 但是在现实生活中,这种情况并不经常发生。 大多数机器学习算法并不是第一次都能完美运行。 它几乎始终都遭受一些我们需要解决的问题的困扰。 在这里,我将讨论一些问题。
We may find our learning curve looks like this:
我们可能会发现我们的学习曲线如下所示:
If there is a significant gap between training error and validation-error, that indicates a high variance problem. It also can be called an overfitting problem. Getting more training data or selecting a smaller set of features or both may fix this problem.
如果训练误差与验证误差之间存在显着差距,则表明存在高方差问题。 也可以称为过度拟合问题。 获取更多的训练数据或选择较小的功能集或同时使用这两种功能都可以解决此问题。
If a leaning curve looks like this that means in the beginning training error was too small and validation error was too high. Slowly, training error goes higher and validation-error goes lower. But at a point they become parallel. You can see from the picture, after a point, even with more training data cross-validation error is not going down anymore. In this case, getting more training data will not improve the machine learning algorithm. This indicates that the learning algorithm is suffering from a high bias problem. In this case, getting more training features may help.
如果倾斜曲线看起来像这样,则意味着开始时训练误差太小而验证误差太高。 缓慢地,训练误差变高而验证误差变低。 但在某种程度上,它们变得平行。 您可以从图片中看到一点,即使有了更多训练数据,交叉验证错误也不再减少。 在这种情况下,获取更多训练数据将不会改善机器学习算法。 这表明学习算法正在遭受高偏差问题。 在这种情况下,获得更多训练功能可能会有所帮助。
修正学习算法 (Fixing A Learning Algorithm)
Assume, we are implementing linear regression. But the algorithm is not working as expected. What to do now?
假设我们正在执行线性回归。 但是该算法无法正常工作。 现在要做什么?
First, draw a learning curve as I demonstrated here. If you detect a high variance problem, select a smaller set of features based on the importance of the features. If that helps, that will save some time. If not, try getting more training data.
首先,画出一条学习曲线,如我在此处演示的那样。 如果检测到高方差问题 ,请根据要素的重要性选择一组较小的要素。 如果有帮助,可以节省一些时间。 如果不是,请尝试获取更多训练数据。
If you detect high bias problem from the learning curve, you know already that getting additional features is a possible solution. You may even try adding some polynomial features. Lots of time that helps and saves a lot of time.
如果您从学习曲线中发现高偏差问题 ,那么您已经知道获得附加功能是一种可能的解决方案。 您甚至可以尝试添加一些多项式特征。 很多时间可以帮助您节省很多时间。
If you are implementing an algorithm with the regularization term lambda, try decreasing the lambda if the algorithm is suffering from a high bias and try increasing the lambda, if the algorithm is suffering from a high variance problem. Here is an article that explains the relationship of regularization term with bias and variance in details:
如果您要使用正则化项lambda实现算法,则在算法存在高偏差的情况下尝试减小 lambda ,而在算法存在高方差问题的情况下尝试增大lambda 。 这是一篇详细解释正则项与偏差和方差的关系的文章:
In the case of a neural network also we may come across this bias or variance problem. For the high bias or underfitting problem, we need to increase the number of neurons or the number of hidden layers. To address the high variance or overfitting problem, we should decrease the number of neurons or the number of hidden layers. We can even draw a learning curve using a different number of neurons.
在神经网络的情况下,我们也可能遇到这种偏差或方差问题。 对于高偏差或欠拟合问题,我们需要增加神经元的数量或隐藏层的数量。 为了解决高方差或过度拟合的问题,我们应该减少神经元的数量或隐藏层的数量。 我们甚至可以使用不同数量的神经元绘制学习曲线。
Thank you so much for reading this article. I hope this was helpful.
非常感谢您阅读本文。 我希望这可以帮到你。
Here is the dataset used in this article:
这是本文中使用的数据集:
机器学习算法 拟合曲线