文献汇总|AI生成图像检测相关数据集汇总

前言:本博客汇总当前AI生成图像检测领域用到的数据集及相关链接。
⚠️:除标注「未公开」数据集,其余数据集均已开源。

2020

  • CNNSpot
    https://github.com/peterwang512/CNNDetection
    Testset: The zip file contains images from 13 CNN-based synthesis algorithms, including the 12 testsets from the paper and images downloaded from whichfaceisreal.com. Images from each algorithm are stored in a separate folder. In each category, real images are in the 0_real folder, and synthetic images are in the 1_fake folder.
    Note: ProGAN, StyleGAN, StyleGAN2, CycleGAN testset contains multiple classes, which are stored in separate subdirectories.
    Training set: The training set used in the paper can be downloaded here (Try alternative links 1,2 if the previous link does not work). All images are from LSUN or generated by ProGAN, and they are separated in 20 object categories. Similarly, in each category, real images are in the 0_real folder, and synthetic images are in the 1_fake folder.
    Validation set: The validation set consists of held-out ProGAN real and fake images, and can be downloaded here. The directory structure is identical to that of the training set.
    在这里插入图片描述

2022


2023

  • DiffusionForensics
    https://github.com/ZhendongWang6/DIRE
    在这里插入图片描述

  • DMimageDetection
    https://github.com/grip-unina/DMimageDetection/tree/main/training_code
    https://luminohope.org/pub/publication/arxiv_diffusion_detection_2022/

  • GenImage
    https://github.com/GenImage-Dataset/GenImage
    We employ eight generative models for image generation, namely BigGAN [2], GLIDE [21], VQDM [8], Stable Diffusion V1.4 [25], Stable Diffusion V1.5 [25], ADM [5], Midjourney [20], and Wukong [35].
    在这里插入图片描述

  • Fake2M
    https://arxiv.org/pdf/2304.13023
    We constructed 3 training fake datasets with about 2M images, named Fake2M, and 11 validation fake datasets with about 257K images using different latest modern generative models, which contain the SOTA Diffusion models (Stable Diffusion [46], IF [4]), the SOTA GAN model (StyleGAN3 [31]), the SOTA autoregressive model (CogView2 [19]), and the SOTA generative model (Midjounrey [6]), as shown in Tab. 2. We describe the details of our datasets in the following subsections.
    在这里插入图片描述

  • TWIGMA
    https://yiqunchen.github.io/TWIGMA/index.html#dataset
    在这里插入图片描述

  • ArtiFact
    https://github.com/awsaf49/artifact
    To include a diverse collection of real images from multiple categories, including Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and many other real-life objects, the proposed dataset utilizes 8 sources [7], [14]–[16] that are carefully chosen. Additionally, to inject diversity in terms of generators, the proposed dataset synthesizes images from 25 distinct methods [7]–[9], [14]–[24]. Specifically, it includes 13 GANs, 7 Diffusion, and 5 other miscellaneous generators. On the other hand, in terms of syntheticity, there are 20 fully manipulating and 5 partially manipulating generators, thus providing a broad spectrum of diversity in terms of generators used. The distribution of real and fake data with different sources is shown in Fig.1 and Fig.2, respectively. The dataset contains a total of 2,496,738 images, comprising 964,989 real images and 1,531,749 fake images. The most frequently occurring categories in the dataset are Human/Human Faces, Animal/Animal Faces, Vehicles, Places, and Art.

13GANs: BigGAN, CycleGAN, Denoising Diffusion GAN, Diffusion GAN, FaceSynthetics, GANformer, GauGAN, ProGAN, ProjectedGAN, StarGAN, StyleGAN1, StyleGAN2, StyleGAN3
7DMs: DDPM, Glide, LaMa, Latent Diffusion, Stable Diffusion, Taming Transformer, VQDiffusion
5 Others: CIPS, Generative Inpainting, MAT, Palette, SFHQ
在这里插入图片描述


2024

We generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators.

Generators: Stable Diffusion 1.4, Stable Diffusion 2.1, Stable Diffusion XL, and DeepFloyd IF

Consequently, we generate and release the Diffusion-generated Deepfake Detection (D3 ) dataset containing 2.3 million records, each composed of a real image coming from LAION-400M [44] dataset and images from four generators, for a total of 9.2 million generated images. To verify the generation capabilities of deepfake detection methods to unseen generators, we also collect a challenging test set composed of 4.8k real images, each paired with 12 fake images generated by as many diffusion-based generators.

With the aim of increasing the variance of the dataset, images have been generated with different aspect ratios, i.e. 256x256, 512x512, 640×480, and 640×360. Moreover, to mimic the distribution of real images, we also employ a variety of encoding and compression methods (BMP, GIF, JPEG, TIFF, PNG). In particular, we closely follow the distribution of encoding methods of LAION itself, therefore favoring the presence of JPEG-encoded images.
在这里插入图片描述

  • ImagiNet
    https://github.com/delyan-boychev/imaginet
    https://huggingface.co/datasets/delyanboychev/imaginet
    To support the development of defensive methods, we introduce ImagiNet, a high-resolution and balanced dataset for synthetic image detection, designed to mitigate potential biases in existing resources. It contains 200k examples, spanning four content categories: photos, paintings, faces, and uncategorized. Synthetic images are produced with open-source and proprietary generators, whereas real counterparts of the same content type are collected from public datasets.
    在这里插入图片描述
    在这里插入图片描述

  • AntifakePrompt
    https://github.com/nctu-eva-lab/AntifakePrompt
    We conduct full-spectrum experiments on datasets from a diversity of 3 held-in and 20 held-out generative models, covering modern text-to-image generation, image editing and adversarial image attacks.
    Real datasets. We use Microsoft COCO (COCO) (Lin et al. 2014) dataset and Flickr30k (Young et al. 2014) dataset. In our work, we selected 90K images, with shorter sides greater than 224, from COCO dataset for the real images in the training dataset. Moreover, to assess the generalizability of our method over various real images, we additionally select 3K images from Flickr30k dataset to form a held-out testing dataset, adhering to the same criterion of image size. (93k)
    Fake image for training: 150k;for testing:3k*21 = 63k
    在这里插入图片描述

  • FakeBench
    https://arxiv.org/abs/2404.13306
    Regarding the genuine images, we sample 3,000 images from ImageNet [76] and DIV2K dataset [77].
    在这里插入图片描述
    在这里插入图片描述

  • COCOXGEN
    https://github.com/heikeadel/cocoxgen
    COCOXGEN(COCO Extended With Generated Images), which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths.
    在这里插入图片描述

  • WildRF
    https://github.com/barcavia/RealTime-DeepfakeDetection-in-the-RealWorld
    We propose to improve deepfake evaluation and align it with real-world settings by introducing WildRF, a realistic benchmark consisting of images sourced from popular social platforms. Specifically, we manually collected real images and fake images using keywords and hashtags associated with the suitable content. Our protocol is to train on one platform (e.g., Reddit) and test the detector on real and fake images from other unseen platforms (e.g., Twitter and Facebook).
    在这里插入图片描述

  • WildFake 未公开
    https://arxiv.org/pdf/2402.11843
    在这里插入图片描述
    在这里插入图片描述

  • LSUNDB
    https://github.com/jonasricker/diffusion-model-deepfake-detection
    The main dataset used in this work is hosted on Zenodo. In total, the dataset contains 50k samples (256x256) for each of the following generators trained on LSUN Bedroom, divided into train, validation, and test set (39k/1k/10k).

  • DIF
    https://sergo2020.github.io/DIF/
    在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

_Meilinger_

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值