Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Year
文章目录
总结
作者们发现FID指标与人类的评价不一致,同时在模型的迭代中不能很好的判断模型的好坏,同时不能很好评价图像中的distortions。
作者担心FID的这些缺点会影响我们对好模型的选取,当评价标准出现问题,那些好的想法可能就被当做垃圾丢弃了。
作者在text-to-image generation领域对FID的缺点进行了实验论证,同时对比了提出的CMMD(CLIP-MMD)指标,表明CMMD指标是一种评价modern text-to-image models更好的指标。
个人认为:CMMD指标在general image generation上性能可能在某些时刻与FID相近,但其与KID一致,使用MMD来进行的计算不需要太大的sample size,同时其使用CLIP embeddings是在更大的数据集上训练得到的,其可能含有更多的语义信息,有理由相信其在较多场景下优于FID基于的Inception embeddings。
Abstract
important drawbacks of FID:
- Inception’s poor representation of the rich and varied content generated by modern text-to image models, incorrect normality assumptions, and poor sample complexity.
- We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to image
models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size.
We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient.
Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.(针对text-to-image模型提出的评价指标,用了CLIP来弥补Inception网络在图像语义方面的不足,不知道其是否在所有生成领域都优于FID)
MMD, on the other hand, is an unbiased estimator, and as we empirically demonstrate it does not exhibit a strong dependency on sample size like the Fr´echet distance.
Contributions
- We call for a reevaluation of FID as the evaluation metric for modern image generation and text-to-image models. We show that it does not agree with human raters in some important cases, that it does not reflect gradual improvement of iterative text-to-image models and that it does not capture obvious image distortions.
- We identify and analyze some shortcomings of the Fr´echet distance and of Inception features, in the context of evaluation of image generation models.
- We propose CMMD, a distance that uses CLIP features with the MMD distance as a more reliable and robust alternative, and show that it alleviates some of FIDs major shortcomings.
Related work
Generated image quality has been assessed using a variety of metrics including log-likelihood [9], Inception Score (IS) [1, 24], Kernel Inception Distance (KID) [2, 27],F´rechet Inception Distance (FID) [13], perceptual path length [14], Gaussian Parzen window [9], and HYPE [29].(哦吼,这么多,后面两种没听过)
Both FID and KID suffer from the limitations of the underlying Inception embeddings: they have been trained on only 1 million images, limited to 1000 classes.
Limitations of FID
- As we show in Section 3.3, Inception embeddings for typical image sets are far from being normally distributed. The implications of this inaccurate assumption when calculating the Fr´echet distance are discussed in Section 3.2.
- Estimating (2048 × 2048)-dimensional covariance matrices from a small sample can lead to large errors, as discussed in Section 6.3.
)
Implications of Wrong Normality Assumptions
这些分布都有着和reference normal distribution相同的均值和方差,由此验证FID指标假设embeddings满足正太分布假设的局限性
FID∞, the unbiased version of FID proposed in [5], also suffers from this shortcoming, since it also relies on the normality assumption.
Incorrectness of the Normality Assumption
Figure 2 shows a 2-dimensional t-SNE [26] visualization(这个可视化方法可以可视化embeddings 空间) of Inception embeddings of the COCO 30K dataset, commonly used as the reference (real) image set in text-to image FID benchmarks. It is clear that the low dimensional visualization has multiple modes, and therefore, it is also clear that the original, 2048-dimensional distribution is not close to a multivariate normal distribution.
Finally, we applied three different widely-accepted statistical tests: Mardia’s skewness test, Mardia’s kurtosis test, and Henze-Zirkler test(复杂的统计学检测?) to test normality of Inception embeddings of the COCO 30K dataset. All of them strongly refute the hypothesis that Inception embeddings come from a multivariate normal distribution, with p-values of virtually zero (indicating an overwhelming confidence in rejecting the null hypothesis of normality).
In fact, CLIP embeddings of COCO 30K also fail the normality tests with virtually zero p-values, indicating that it is not reasonable to assume normality on CLIP embeddings either.
CMMD Metric
The CMMD (stands for CLIP-MMD) metric is the squared MMD distance between CLIP embeddings of the reference (real) image set and the generated image set.
MMD was originally developed as a part of a two-sample statistical test to determine whether two samples come from the same distribution. For two probability distributions P and Q over
R
d
R^d
Rd, the MMD distance with respect to a positive definite kernel
k
k
k is defined by:
d
i
s
t
M
M
D
2
(
P
,
Q
)
:
=
E
x
,
x
′
[
k
(
x
,
x
′
)
]
+
E
y
,
y
′
[
k
(
y
,
y
′
)
]
−
2
E
x
,
y
[
k
(
x
,
y
)
]
dist^2_{MMD}(P,Q):=E_{x,x'}[k(x,x')]+E_{y,y'}[k(y,y')]-2E_{x,y}[k(x,y)]
distMMD2(P,Q):=Ex,x′[k(x,x′)]+Ey,y′[k(y,y′)]−2Ex,y[k(x,y)]where
x
x
x and
x
′
x'
x′ are independently distributed by
P
P
P and
y
y
y and
y
′
y′
y′ are independently distributed by
Q
Q
Q. It is known that the MMD is a metric for characteristic kernels
k
k
k.
Given two sets of vectors ,
X
=
x
1
,
x
2
,
.
.
.
,
x
m
X = {x_1, x_2, ...,x_m}
X=x1,x2,...,xm and
Y
=
y
1
,
y
2
,
.
.
.
,
y
n
Y = {y_1, y_2, ... , y_n}
Y=y1,y2,...,yn, sampled from
P
P
P and
Q
Q
Q, respectively, an unbiased estimator for
d
M
M
D
2
(
P
,
Q
)
d^2_{MMD}(P,Q)
dMMD2(P,Q) is given by,
d
i
s
t
M
M
D
2
(
X
,
Y
)
=
1
m
(
m
−
1
)
∑
i
=
1
m
∑
j
≠
i
m
k
(
x
i
,
x
j
)
+
1
n
(
n
−
1
)
∑
i
=
1
n
∑
j
≠
i
n
k
(
y
i
,
y
j
)
−
2
m
n
∑
i
=
1
m
∑
j
=
1
n
k
(
x
i
,
y
j
)
dist^2_{MMD}(X,Y)=\frac{1}{m(m-1)}\sum_{i=1}^m\sum_{j\neq i}^{m}k(x_i,x_j)+\frac{1}{n(n-1)}\sum_{i=1}^n\sum_{j\neq i}^{n}k(y_i,y_j)-\frac{2}{mn}\sum^m_{i=1}\sum^{n}_{j=1}k(x_i,y_j)
distMMD2(X,Y)=m(m−1)1i=1∑mj=i∑mk(xi,xj)+n(n−1)1i=1∑nj=i∑nk(yi,yj)−mn2i=1∑mj=1∑nk(xi,yj)As the kernel in the MMD calculation, we use the Gaussian RBF kernel
k
(
x
,
y
)
=
e
x
p
(
−
∣
∣
x
−
y
∣
∣
2
/
2
σ
2
)
k(x, y) = exp(-||x-y||^2/2σ^2)
k(x,y)=exp(−∣∣x−y∣∣2/2σ2), which is a characteristic kernel, with the bandwidth parameter set to
σ
=
10
σ = 10
σ=10.
For the CLIP embedding model, we use the publicly-available ViT-L/14@336px model,which is the largest and the best performing CLIP model [20]. Also note that we have
m
=
n
m = n
m=n in Eq. (4) for text-to-image evaluation since we evaluate generated images against real images sharing the same captions/prompts.
Human Evaluation
To this end, we picked two models, Model-A: the full Muse model as described in [3] with 24 base-model iterations and 8 super-resolution model iterations. Model-B: an early-stopped Muse model with only 20 base-model iterations and 3 super-resolution model iterations.
Performance Comparison
Progressive Image Generation Models
Image Distortions
To this end, we take a set of images generated by Muse and progressively distort them by adding noise in the VQ-GAN latent space
Sample Efficiency
In Figure 8 we illustrate this by evaluating a Stable Diffusion model at different sample sizes (number of images) sampled randomly from the COCO 30K dataset. Note that we need more than 20,000 images to reliably estimate FID, whereas CMMD provides consistent estimates even with small image sets.
Computational Cost
Table 4 shows an empirical runtime comparison of com-puting FD and MMD on a set of size
n
=
30
,
000
n = 30, 000
n=30,000 with
d
=
2048
d = 2048
d=2048 dimensional features on a TPUv4 platform with a JAX implementation.
Related Paper
[26] Laurens van der Maaten and Geoffrey E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008. 4
[3] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. Muse: Text-to-image generation via masked generative transformers. ICML, 2023. 5, 6
[5] Min Jin Chong and David A. Forsyth. Effectively unbiased FID and inception score and where to find them. CoRR, abs/1911.07023, 2019. 2, 4, 5, 8
Words do not know
- hinge on这个短语通常用于描述某事物或情况依赖于另一个因素或条件,即一个事物的存在、发展或结果受到另一个事物的影响或控制。
- monotonically的中文含义是“单调地”或“单调递增/递减地”。这个词来源于数学中的“单调性”概念,用于描述一个函数或序列在整个定义域或区间内始终增加或减少,没有局部上升或下降的趋势。在更广泛的应用中,它也可以用来形容任何持续不断、没有起伏或变化的情况或过程。
- aesthetics的中文含义是“美学”或“美的哲学”?。
- “disastrous”的中文含义是“灾难性的”、“极坏的”或“悲惨的”。这个词通常用于描述具有严重、不幸或破坏性后果的事件、情况或结果。例如,一场“disastrous”的自然灾害可能会造成大量的人员伤亡和财产损失。
- isotropic的中文含义为“各向同性的;等方性的”。