数据标注
1. 数据标注流程
- Have enough data? 数据是否充足
- improve label,data, or model? 标签、数据和模型哪个部分需要提升
- enough label? 标签是否充足
semi-supervised learning 半监督学习
- enough budget ? 预算是否充足
label via crowd sourcing 有钱任性
- user weak label? 是否可以使用比较差的标签
weak supervising 弱监督学习
2. semi-supervised learning (SSL) : 半监督学习
推荐阅读:
-
focus on the scenario where there is a small amount of labeled data, along with large amount of unlabeld data (少部分有标签的数据,大部分数据没有标签)
-
make assumptions on data distribution to use unlabeled data (对没有标签的数据的分布进行假设)
- continuity assumption:examples with similar features are more likely to have the same label(相似假设)
- cluster assumption : data have inherent cluster structure ,example in the same cluster tend to have the same label (聚类假设)
- mainifold assumption: the data lie on a manifld of much lower dimension than the input space (流行假设)
-
self-training (自训练)
self training is a SSL method
-
we can use expensive models (大佬游戏:加深网络、模型集成,一个模型不行就n个)
- deep neural networks ,model ensemble/bagging
3. 数据标注的挑战
- simplify user interaction: design easy tasks ,clear instructions and simple to use interface (设计简单、清楚的标注任务)
- THE user instruction and task used by the MIT place365 dataset
- cost:active learning + self-training (考虑标注的成本问题)
- focus on same scenario as SSL but with human intervention (有人工干预的SSL)
- uncertainty sampling chooses an example whose prediction is most uncertain (抽样筛查标签)
- similar to self-training we can use expensive models
- query by committee trains multiple models and performs major voting
- quality control :label qualities generated by different labels vary(控制标注信息的质量)
- simplest but most expensive : sending the same task to multiple labeled then determinne the label by majority voting (最简单却也是最贵的方法就是,将同一个任务交给不同的人去做,然后进行投票)