DataPerf-Training Set Cleaning (Data-centric AI)挑战赛简介及部分参与细节

最新推荐文章于 2025-06-06 22:48:34 发布

再思行

最新推荐文章于 2025-06-06 22:48:34 发布

阅读量219

点赞数 1

文章标签：机器学习 python 数据分析

本文链接：https://blog.csdn.net/qq_45755365/article/details/130115537

版权

前言

目前在深度学习领域，由于改进模型的增量收益正在减少。已经有越来越多的学者将目光从以模型为中心转移到了以数据为中心的领域。以数据为中心的观点强调了需要有系统的方法来评估、综合、清理和注释用于训练和测试AI模型的数据[1]。DataPerf 是最新建立的一个较为权威的为数据基准建立排行榜的社区和平台。Google最近通过 DataPerf 开展了一场关于基准创建和改进数据集的挑战排行。第一轮挑战的日期是2023 年 3月30日至5 月 26 日。
比赛网站(需翻墙)：https://sites.google.com/mlcommons.org/dataperf/home_1/training-set-cleaning-vision

基础规则

As a participant, you will be asked to rank the samples in the entire training set, and then we will clean them one by one and evaluate the performance of the model after each fix. The earlier it reached a high enough accuracy, the better your rank is.

Similar to other dataperf challenges, the cleaning challenge comes with two flavors: the open division and the closed division.

In the open division, you will submit the output of running your cleaning algorithm on a given dataset. Then we will train the model and evaluate it based on your submission.
In the closed division, you will submit the cleaning algorithm itself, and we will run your algorithm to generate the output on several hidden datasets. Then we evaluate your submissions.

参与（MLCube and Dynabench）

MLCube was developed to help you get started on your local computer, and it will help you download the datasets, run some baseline algorithms, evaluate your submission and baselines and plot the results.

Once you are satisfied with your results, you can then submit it to Dynabench, which is a platform where we will evaluate your submission and show the leaderboard for this challenge.
安装MLCube： https://mlcommons.github.io/mlcube/getting-started/
评分网站：https://dynabench.org/tasks/vision-debugging

Evaluation Metric

Your submission will be evaluated based on “how many samples your submission needs to fix, to achieve a high enough accuracy”. This is to imitate real use cases of the data cleaning algorithms, where we want to inspect as less samples as possible, but keep the data quality good enough. For example, if the accuracy of the model, trained on a perfectly clean dataset, is 0.9, then we define the high enough accuracy to be 0.9 * 95% = 0.855. Assume that algorithm A achieves an accuracy of 0.855 after fixing 100 samples and algorithm B achieves an accuracy of 0.855 after fixing 200 samples, then score(A)=100/300 = 1/3 while score(B)=2/3. In other words, the lower the score, the better the cleaning algorithm.

由于这类以数据集为核心的比赛相当少见，对此，作者的个人理解是：
比如，标准的数据集有300个样本，按比赛所给模型可以得到0.9的准确率。主办方对该数据集添加噪音后提供参赛者下载，要求参赛者将300个样本按从劣到优排序它们的ID，使得能以尽可能少删样本（从劣到优，删前1个，计算ACC；删前2个，计算ACC；…）的前提下达到要求准确率（如0.9 * 95% = 0.855）。删掉的样本越少，分数越低，表明数据清洗算法越好。

配置与安装

Installation - MLCube (mlcommons.github.io)

Docker Engine、Docker Desktop、MLCube runner作者都在自己的机器上成功安装，一般按照说明进行即可（作者的机器是Ubuntu18.04）

安装网页：

Dynabench

Install Docker Desktop on Ubuntu

Install Docker Engine on Ubuntu

安装参考：

(46条消息) 【Docker系列】Docker教程：详细全部_小夕Coding的博客-CSDN博客

(46条消息) 在 Ubuntu 上安装 Docker 引擎_s390x架构_小夕Coding的博客-CSDN博客

(46条消息) Docker官方文档学习笔记（一）：安装、升级、卸载Docker Desktop for Linux（和安装Docker Engine二选一，推荐Docker Engine）_MAVER1CK的博客-CSDN博客

样例运行

MNIST - MLCube (mlcommons.github.io)
在这里插入图片描述

Offline evaluation (MLCube)

DataPerf - Cleaning Evaluation (google.com)
在这里作者被两个error卡住了…
这是第一个：
在这里插入图片描述

虽然查资料说修改numpy版本就可以解决，作者在这里试过几次却无法克服这个error。目前也没有找到出错的具体代码，现在怀疑可能是Python39等配置和这里的代码不匹配。

接下来是第二个：
在这里插入图片描述

即使明确了问题起因：

但也没能在代码层面解决问题：

提交结果与方案

有提交结果和提交数据清洗算法两种方案

具体要求说明DataPerf - Cleaning Rules (google.com)

在提交结果模式中：

提交样例（提交ID排序和耗时）在文件夹.\dataperf-vision-debugging\examples

文件及代码下载：DS3Lab/dataperf-vision-debugging: Alpha version of our data-centric vision benchmark for training data correction (github.com)

数据集下载：https://pan.baidu.com/s/1lA0A33BrQ5ZlQD4RBsYupQ?pwd=q0rh 提取码：q0rh（为避免翻墙，已将数据集从谷歌云盘移动至百度网盘）

值得注意的是：

数据来源及说明：The provided candidate pool is a custom subset of the training set for the Open Images dataset. You may refer to non-labels metadata from the Open Images dataset [link]

要求只能使用提供的数据集且不能利用测试集数据

使用MLCube的四个任务：

tasks:
  download:
    # Download data
    parameters:
      inputs: { parameters_file: { type: file, default: parameters.yaml } }
      outputs: { output_path: ./ }

  create_baselines:
    # Run selection script
    parameters:
      inputs: { embedding_folder: embeddings/, groundtruth_folder: data/ }
      outputs: { submission_folder: submissions/ }

  evaluate:
    # Run evaluation script
    parameters:
      inputs:
        {
          submission_folder: submissions/,
          groundtruth_folder: data/,
          embedding_folder: embeddings/,
        }
      outputs: { results_folder: results/ }

  plot:
    # Run plotter script
    parameters:
      inputs: { results_folder: results/, submission_folder: submissions/}

模型的一些参数：

data_id: 01g317-flipped
train_size: 300
noise_level: 0.3
test_size: 500
val_size: 100

最后说明

如果有一起参与该挑战赛的朋友，可以一起讨论。
对于我在文中提出的两个未解决的error，也希望有人能提供思路或解决方案。
最后，本人的研究领域是深度学习和计算机视觉相关，目前正在关注以数据为中心的研究模式，欢迎志同道合的朋友一起讨论和进步。谢谢！

[1] Liang, Weixin et al. “Advances, challenges and opportunities in creating data for trustworthy AI.” Nature Machine Intelligence 4 (2022): 669 - 677.
[2] https://sites.google.com/mlcommons.org/dataperf/home_1
注：未注明的链接和代码资料、截图等基本都与挑战赛官方相关，如有侵权，请联系，立删