1、人工判断
1)AMT perceptual studies
Turkers were presented with a series of trials that pitted a “real” image against a “fake” image generated by our algorithm
2)Mean Opinion Score (MOS) testing
assign integral from 1 (bad quality) to 5 (excellent quality)
2、借助pretrained网络
1)inception
a)Inception score (IS)
-> how well a model captures the full ImageNet class distribution
-> produce individual samples that are convincing examples of a single class
-> do not reward covering the whole distribution or capturing diversity within a class. models which memorize a small subset of the full dataset will still have high IS
-> measure fidelity
b)Frechet Inception Distance (FID) score
-> symmetric measure of the distance between two image distributions in the Inception-V3 latent space
-> more consistent with human judgement than IS
-> a reliable FID for inpainting is usually computed with more than 1000 images
-> captures both diversity and fidelity
c)sFID score
-> use spatial features rather than the standard pooled features
-> better captures spatial relationships, rewarding image distributions with coherent high-level structure
2)FCN
a)FCN score (image-to-image generation)
Train classifiers on real images. The FCN predicts a label map for a generated photo. This label map can then be compared against the input ground truth labels using standard semantic segmentation metrics
3、diversity score
4、LPIPS
5、votes
6、Improved Precision and Recall metrics
-> Precision: fidelity. fractioin of model samples which fall into the data manifold
-> Recall: diversity. fraction of data samples which fall into the sample manifold