论文速看 Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Miao kristoff

已于 2024-06-02 16:35:38 修改

阅读量888

点赞数 23

文章标签：深度学习 aigc

于 2024-06-02 16:33:44 首次发布

本文链接：https://blog.csdn.net/weixin_45372906/article/details/139379214

版权

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

在这里插入图片描述
Year

文章目录

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

总结

作者们发现FID指标与人类的评价不一致，同时在模型的迭代中不能很好的判断模型的好坏，同时不能很好评价图像中的distortions。
作者担心FID的这些缺点会影响我们对好模型的选取，当评价标准出现问题，那些好的想法可能就被当做垃圾丢弃了。
作者在text-to-image generation领域对FID的缺点进行了实验论证，同时对比了提出的CMMD（CLIP-MMD）指标，表明CMMD指标是一种评价modern text-to-image models更好的指标。

个人认为：CMMD指标在general image generation上性能可能在某些时刻与FID相近，但其与KID一致，使用MMD来进行的计算不需要太大的sample size，同时其使用CLIP embeddings是在更大的数据集上训练得到的，其可能含有更多的语义信息，有理由相信其在较多场景下优于FID基于的Inception embeddings。

Abstract

important drawbacks of FID:

Inception’s poor representation of the rich and varied content generated by modern text-to image models, incorrect normality assumptions, and poor sample complexity.
We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to image
models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size.

We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient.
Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.(针对text-to-image模型提出的评价指标，用了CLIP来弥补Inception网络在图像语义方面的不足，不知道其是否在所有生成领域都优于FID)

MMD, on the other hand, is an unbiased estimator, and as we empirically demonstrate it does not exhibit a strong dependency on sample size like the Fr´echet distance.

Contributions

We call for a reevaluation of FID as the evaluation metric for modern image generation and text-to-image models. We show that it does not agree with human raters in some important cases, that it does not reflect gradual improvement of iterative text-to-image models and that it does not capture obvious image distortions.
We identify and analyze some shortcomings of the Fr´echet distance and of Inception features, in the context of evaluation of image generation models.
We propose CMMD, a distance that uses CLIP features with the MMD distance as a more reliable and robust alternative, and show that it alleviates some of FIDs major shortcomings.

Related work

Generated image quality has been assessed using a variety of metrics including log-likelihood [9], Inception Score (IS) [1, 24], Kernel Inception Distance (KID) [2, 27],F´rechet Inception Distance (FID) [13], perceptual path length [14], Gaussian Parzen window [9], and HYPE [29].(哦吼，这么多，后面两种没听过)
在这里插入图片描述
Both FID and KID suffer from the limitations of the underlying Inception embeddings: they have been trained on only 1 million images, limited to 1000 classes.

Limitations of FID

As we show in Section 3.3, Inception embeddings for typical image sets are far from being normally distributed. The implications of this inaccurate assumption when calculating the Fr´echet distance are discussed in Section 3.2.
Estimating (2048 × 2048)-dimensional covariance matrices from a small sample can lead to large errors, as discussed in Section 6.3.
)

Implications of Wrong Normality Assumptions

在这里插入图片描述
这些分布都有着和reference normal distribution相同的均值和方差，由此验证FID指标假设embeddings满足正太分布假设的局限性
FID∞, the unbiased version of FID proposed in [5], also suffers from this shortcoming, since it also relies on the normality assumption.

Incorrectness of the Normality Assumption

在这里插入图片描述
Figure 2 shows a 2-dimensional t-SNE [26] visualization(这个可视化方法可以可视化embeddings 空间) of Inception embeddings of the COCO 30K dataset, commonly used as the reference (real) image set in text-to image FID benchmarks. It is clear that the low dimensional visualization has multiple modes, and therefore, it is also clear that the original, 2048-dimensional distribution is not close to a multivariate normal distribution.
Finally, we applied three different widely-accepted statistical tests: Mardia’s skewness test, Mardia’s kurtosis test, and Henze-Zirkler test(复杂的统计学检测？) to test normality of Inception embeddings of the COCO 30K dataset. All of them strongly refute the hypothesis that Inception embeddings come from a multivariate normal distribution, with p-values of virtually zero (indicating an overwhelming confidence in rejecting the null hypothesis of normality).
In fact, CLIP embeddings of COCO 30K also fail the normality tests with virtually zero p-values, indicating that it is not reasonable to assume normality on CLIP embeddings either.

CMMD Metric

The CMMD (stands for CLIP-MMD) metric is the squared MMD distance between CLIP embeddings of the reference (real) image set and the generated image set.
MMD was originally developed as a part of a two-sample statistical test to determine whether two samples come from the same distribution. For two probability distributions P and Q over $R^d$ , the MMD distance with respect to a positive definite kernel $k$ is defined by: $dist^2_{MMD}(P,Q):=E_{x,x'}[k(x,x')]+E_{y,y'}[k(y,y')]-2E_{x,y}[k(x,y)]$