分类和回归都包含很多专业术语,这些术语在机器学习领域都有确切的定义,本文对常见术语进行整理。
-
样本(sample)或输入(input):进入模型的数据点。
-
预测(prediction)或输出(output):从模型出来的结果。
-
目标(target):真实值。对于外部数据源,理想情况下,模型应该能够预测出目标。
-
预测误差(prediction error)或损失值(loss value):模型预测与目标之间的距离。
-
类别(class):分类问题中供选择的一组标签。例如,对猫狗图像进行分类时,“狗” 和“猫”就是两个类别。
-
标签(label):分类问题中类别标注的具体例子。比如,如果 1234 号图像被标注为 包含类别“狗”,那么“狗”就是 1234 号图像的标签。
-
真值(ground-truth)或标注(annotation):数据集的所有目标,通常由人工收集。
-
The ground truth is what you measured for your target variable for the training and testing examples.
Nearly all the time you can safely treat this the same as the label.
In some cases it is not precisely the same as the label. For instance if you augment your data set, there is a subtle difference between the ground truth (your actual measurements) and how the augmented examples relate to the labels you have assigned. However, this distinction is not usually a problem.
Ground truth can be wrong. It is a measurement, and there can be errors in it. In some ML scenarios it can also be a subjective measurement where it is difficult define an underlying objective truth - e.g. expert opinion or analysis, which you are hoping to automate. Any ML model you train will be limited by the quality of the ground truth used to train and test it, and that is part of the explanation on the Wikipedia quote. It is also why published articles about ML should include full descriptions of how the data was collected.
-
-
二分类(binary classification):一种分类任务, 每个输入样本都应被划分到两个互斥的类别中。
-
多分类(multiclass classification):一种分类任务,每个输入样本都应被划分到两个以上的类别中,比如手写数字分类。
-
多标签分类(multilabel classification):一种分类任务,每个输入样本都可以分配多 个标签。 举个例子, 如果一幅图像里可能既有猫又有狗, 那么应该同时标注“猫” 标签和“狗”标签。每幅图像的标签个数通常是可变的。
-
标量回归(scalar regression):目标是连续标量值的任务。预测房价就是一个很好的 例子,不同的目标价格形成一个连续的空间。
-
向量回归(vector regression):目标是一组连续值(比如一个连续向量)的任务。如 果对多个值(比如图像边界框的坐标)进行回归,那就是向量回归。
-
小批量(mini-batch)或批量(batch):模型同时处理的一小部分样本(样本数通常 为 8~128)。 样本数通常取 2 的幂, 这样便于 GPU 上的内存分配。 训练时, 小批量用来为模型权重计算一次梯度下降更新。