Training and validation accuracy w.r.t. training iterations for our DINO [11] based discriminator vs. baseline StyleGAN2-ADA discriminator on FFHQ 1k dataset.
Figure 3. Our discriminator based on pretrained features has higher accuracy on validation real images and thus shows better generalization. In the above training, vision aided adversarial loss is added at the 2M iteration.
Figure 2. Performance on LSUN CAT and LSUN CHURCH. We compare with the leading methods StyleGAN2-ADA [41] and DiffAugment [109] on different sizes of training samples and full-dataset. Our method outperforms them by a large margin, especially in limited sample setting. For LSUN CAT we achieve similar FID as StyleGAN2 [44] trained on full-dataset using only 0.7% of the dataset.
Figure 1. The model bank F consists of widely used and state-of-the-art pretrained networks. We automatically select a subset from F, which can best distinguish between real and fake distribution. Our training procedure consists of creating an ensemble of the original discriminator D and discriminators based on the feature space of selected off-the-shelf models. is a shallow trainable network over the frozen pretrained features.
Figure 4. Model selection using linear probing of pretrained features. We show correlation of FID with the accuracy of a logistic linear model trained for real vs fake classification over the features of off-the-shelf models. Top dotted line is the FID of StyleGAN2- ADA generator used in model selection and from which we finetune with our proposed vision-aided adversarial loss. Similar analysis for LSUN CAT is shown in Figure 12 in the appendix.
Table 1. FFHQ and LSUN results with varying training samples from 1k to 10k. FID↓ is measured with complete dataset as reference distribution. We select the best snapshot according to training set FID, and report mean of 3 FID evaluations. In Ours (w/ ADA) we finetune the StyleGAN2-ADA model, and in Ours (w/ DiffAugment) we finetune the model trained with DiffAgument while using the corresponding policy for augmentation. Our method works with both ADA and DiffAugment strategy for augmenting images input to the discriminators.
Figure 8. Qualitative comparison of our method with StyleGAN2-ADA on AFHQ. Left: randomly generated samples for both methods. Right: For both our model and StyleGAN2-ADA, we independently generate 5k samples and find the worst-case samples compared to real image distribution. We first fit a Gaussian model using the Inception [86] feature space of real images. We then calculate the log-likelihood of each sample given this Gaussian prior and show the images with minimum log-likelihood (maximum Mahalanobis distance).
Figure 6. Linear probe accuracy of off-the-shelf models during our K-progressive ensemble training on FFHQ 1k. For the StyleGAN2-ADA, ViT (DINO) model has the highest accuracy and is selected first, then ViT (CLIP) and then Swin-T (MoBY). As we train with vision-aided discriminators, linear probe accuracy decreases for most of the pretrained models.
Table 6. Additional ablation studies evaluated on FID↓ metric. Having two discriminators during training (frozen with random weights or trainable) or standard adversarial training for more iterations leads to only marginal benefits in FID. Thus the improvement is through an ensemble of original and vision-aided discriminators. ✗ means FID increased to twice the baseline, and therefore, we stop the training run.