算法训练 网络流裸题 试题_训练试题

算法训练 网络流裸题 试题

The main reason why the question of splitting is actually asked is quite simple, very interesting and most importantly essential. Every dataset contains patterns called real as well as random effects. Let’s assume we are looking at a random dataset. If we created a model that allowed us to predict values, the model is exposed to real and random effects. What we do not know is, which points are random and therefore any other dataset to which we apply our model to, may contain entirely different random patterns. This could be in favor of our model accuracy or not.

实际提出分割问题的主要原因很简单,非常有趣,而且最重要的是必不可少的。 每个数据集都包含称为真实随机效应的模式。 假设我们正在看一个随机数据集。 如果我们创建了一个可以预测值的模型,则该模型会受到真实和随机的影响。 我们不知道的是,哪些点是随机的,因此我们对其应用模型的任何其他数据集都可能包含完全不同的随机模式。 这可能有利于我们模型的准确性。

In other words, if we had a perfectly “real” dataset and the model is fitted appropriately, new data points that are not showing random effects, would be predicted in a very accurate manner.

换句话说,如果我们有一个完美的“真实”数据集,并且模型得到适当拟合,则将以非常准确的方式预测未显示随机效应的新数据点。

But to be real, this is just hypothetical. Data contains random patterns.

但实际上,这只是假设。 数据包含随机模式。

In the real world datasets contain both, random and real effects, hence it is unlikely to have a model that is 100% accurate. Further, new data points will likely contain random effects, so whether these effects can be explained by the model are subject to randomness, too. As a result, some random effects may be explained, while others may not.

在现实世界中,数据集同时包含随机效应和真实效应,因此不可能有100%准确的模型。 此外,新的数据点可能会包含随机效应,因此模型是否可以解释这些效应也将受到随机性的影响。 结果,可以解释一些随机效应,而其他则不能。

The question we want to answer is a simple one, “ is my model accurate and how accurate is my model?”. Basically there are two entry points to answer this question, do I have a single model or are there several models that are required to be evaluated in terms of accuracy.

我们要回答的问题很简单,即“ 我的模型准确吗?我的模型准确吗? ”。 基本上有两个入口点可以回答这个问题,我是否只有一个模型,或者是否有几个模型需要进行准确性评估。

数据分割 (Data Splitting)

Why it is important to know if there is one or even several models to test is essential in order to determine how we split our available data. It seems quite intuitive to split data into a training portion and a test portion, so the model can be trained on the first and then tested with the testing data. It may be a good idea to split the data in a way, so that the model can be trained on a larger portion in order to adapt to more possible data constellations. This is a great procedure for the case we are only looking at a single model.

为了确定我们如何分割可用数据,为什么知道要测试一个或什至几个模型的重要性至关重要。 将数据分为训练部分和测试部分似乎很直观,因此可以首先训练模型,然后使用测试数据进行测试。 最好以某种方式分割数据,以便可以在更大的部分上训练模型,以适应更多可能的数据星座。 对于仅查看单个模型的情况,这是一个很好的过程。

When it comes to several models, things become a bit more complex — but just a tiny bit. Let’s assume there are 10 models that may be useful to predict data. In this case we aim at creating comparable data among all 10 models and pick the best one. We could split data into two parts and repeat the process as described just in the previous paragraph.

当涉及到多个模型时,事情变得有些复杂-只是一点点。 假设有10个模型可能对预测数据有用。 在这种情况下,我们旨在在所有10个模型中创建可比数据,并选择最佳模型。 我们可以将数据分为两部分,然后按照上一段所述重复该过程。

This would be wrong, for a simple reason. If one model is better than all the other, this may be due to the fact that it is. However, it could also be pure randomness and this is exactly why we needed another independent test set. This test set contains real and random effects that are neither related to the training, nor to the validation data. Applying the trained model to this data will reveal it’s accuracy.

这是错误的,原因很简单。 如果一个模型优于所有其他模型,则可能是由于事实。 但是,这也可能是纯随机性,这正是我们需要另一个独立测试集的原因。 该测试集包含与训练或验证数据均无关的真实和随机影响。 将训练后的模型应用于此数据将显示其准确性。

To make this process a bit more tangible, the steps are outlined with a bit more structure and terminology:

为了使该过程更加具体,将概述步骤,并增加一些结构和术语:

  • Training dataset: the training dataset is used to train the model

    训练数据集:训练数据集用于训练模型
  • Validation dataset: the validation dataset is used to test the models — all of them. This dataset is not used when there is only one model — nothing to validate!

    验证数据集:验证数据集用于测试所有模型。 只有一个模型时不使用此数据集-无需验证!
  • Test dataset: This is all about: how does the best model that was selected from its validation performance, perform on a third, and for the model, previously unknown dataset?

    测试数据集:这是关于:从其验证性能中选择的最佳模型如何在第三个模型上执行,对于该模型,该模型以前是未知的?
Image for post
Train/Validate/Test — Multiple Model Validation
训练/验证/测试-多模型验证

Please note that you should also keep in mind, that data may follow a natural pattern. This pattern is something you should thoroughly think through, e.g. are there any patterns like seasonal patterns contained.

请注意,您还应该记住,数据可能遵循自然规律。 您应该仔细考虑这种模式,例如是否包含诸如季节性模式之类的任何模式。

Another main problem is, that we do not know which data points are random and which are not. For this reason, we should consider using a random procedure that helps us selecting data points randomly from our dataset. There are various ways to split the data, but in general choosing random data points is a good idea to start with — R’s sample or Python’s random.sample functions may be very handy.

另一个主要问题是, 我们不知道哪些数据点是随机的,哪些不是 。 出于这个原因,我们应该考虑使用随机过程来帮助我们从数据集中随机选择数据点。 分割数据的方法有很多种,但总的来说,选择随机数据点是一个好主意-R的样本或Python的random.sample函数可能非常方便

A brief look at the R documentation reveals an example code to split data into train and test — which is the way to go, if we only tested one model. If we had several models to test, the data should be split into two a training set of around 70% and equal halves for validation and testing.

R文档的简要介绍揭示了一个示例代码,可将数据分为训练和测试-如果我们仅测试一个模型,这就是方法。 如果我们要测试多个模型,则应将数据分为两个训练集,分别为70%和相等的一半,以进行验证和测试。

# 0.8 is the size of the training data
train_index <- sample(1:nrow(adult), 0.8 * nrow(adult))
test_index <- setdiff(1:nrow(adult), train_index)# Build X_train, y_train, X_test, y_test
X_train <- adult[train_index, -15]
y_train <- adult[train_index, "income"]X_test <- adult[test_index, -15]
y_test <- adult[test_index, "income"]

培训/验证/测试的替代方法 (Alternatives to Train/Validate/Test)

There are many other ways to train and validate data, cross validation is one of these. In the cross-validation case (which is generally a function related to the model), a large portion of the data (maybe 70–90%) is given to the function and only the remainder is kept for the final testing.

还有许多其他方法可以训练和验证数据,交叉验证就是其中一种。 在交叉验证的情况下(通常是与模型相关的功能),大部分数据(可能是70%到90%)提供给该功能,只有其余部分保留用于最终测试。

Notable cross-validation algorithms are “leave one out” or “k-fold” (sometimes referred to as n-fold) variations. The cross validation algorithm uses the first portion and performs the validation actions on the data. In case of the k-fold cross validation, the total set is split into several (k) smaller parts, and then cross validated as using k-1 subsets to train, and one to validate. This repetitive step is done until the given models are provided with accuracy scores that allow choosing the best.

值得注意的交叉验证算法是“遗漏”或“ k倍”(有时称为n倍)变化。 交叉验证算法使用第一部分并对数据执行验证操作。 在进行k倍交叉验证的情况下,将整个集合分成几(k)个较小的部分,然后交叉验证,方法是使用k-1个子集进行训练,然后使用一个子集进行验证。 完成此重复步骤,直到为给定的模型提供了允许选择最佳模型的准确性分数。

As in the train/validate/test procedure, the last step covers testing the model through the test set.

与训练/验证/测试过程一样,最后一步包括通过测试集测试模型。

I will cover cross validation in another post soon.

我将在另一篇文章中介绍交叉验证。

As I could hopefully made clear, splitting data into meaningful portions as an absolute must. If we do not follow an orderly procedure our model may appear better than it really is and this is something you need to avoid as a sound data scientist.

我希望可以明确地说,绝对必须将数据分成有意义的部分。 如果我们不遵循有序的程序,我们的模型可能会看起来比实际情况要好,这是您作为声音数据科学家需要避免的事情。

{See you next time}

{下次见}

[1] Photo by invisiblepower on Unsplash— Thanks!

[1]图片由invisiblepowerUnsplash -谢谢!

翻译自: https://towardsdatascience.com/train-test-split-c3eed34f763b

算法训练 网络流裸题 试题

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值