机器学习数据拆分
数据集分割 (Dataset Splitting)
Splitting up into Training, Cross Validation, and Test sets are common best practices. This allows you to tune various parameters of the algorithm without making judgements that specifically conform to training data.
分为培训,交叉验证和测试集是常见的最佳实践。 这使您可以调整算法的各种参数,而无需做出专门符合训练数据的判断。
动机 (Motivation)
Dataset Splitting emerges as a necessity to eliminate bias to training data in ML algorithms. Modifying parameters of a ML algorithm to best fit the training data commonly results in an overfit algorithm that performs poorly on actual test data. For this reason, we split the dataset into multiple, discrete subsets on which we train different parameters.
数据集拆分是消除ML算法中训练数据偏差的必要条件。 修改ML算法的参数以最适合训练数据通常会导致过拟合算法,该算法在实际测试数据上的表现不佳。 因此,我们将数据集分为多个离散子集,在这些子集上训练不同的参数。
训练集 (The Training Set)
The Training set is used to compute the actual model your algorithm will use when exposed to new data. This dataset is typically 60%-80% of your entire available data (depending on whether or not you use a Cross Validation set).
训练集用于计算算法在暴露给新数据时将使用的实际模