Lab midterm report —— Self-Attention Generative Adversarial Networks
Paper Title: Self-Attention Generative Adversarial Networks
Paper Authors: Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena
Year: 2018
Student: Liang Yinglin-16340132
Description of the problem
Image synthesis is an important problem in computer vision. There has been remarkable progress in this direction with the emergence of Generative Adversarial Networks (GANs). However, GAN model excels at synthesizing image classes with few structural constraints , but it fails to capture geometric or structural patterns that occur consistently in some classes. Since the convolution operator has a local receptive field, long range dependencies can only be processed after passing through several convolutional layers.
One disadvantage of GANs is that after training on large datasets containing multiple types of images, they can not clearly distinguish image categories, and it is difficult to capture the structure, texture and details of these images. Therefore, we can not use a GAN to generate a large number of high-quality images with different categories.
On the other hand, although increasing the size of the convolution core (receptive field) can retain more representations, it is at the expense of efficiency and computation.
Introduction of the method
The authors propose Self-Attention Generative Adversarial Networks (SAGANs), which introduce a self-attention mechanism into convolutional GANs.
The self-attention module is complementary to convolutions and helps with modeling long range, multi-level dependencies across image regions. Armed with self-attention, the generator can draw images in which fine details at every location are carefully coordinated with fine details in distant portions of the image. Moreover, the discriminator can also more accurately enforce complicated geometric constraints on the global image structure.
In addition to self-attention, They propose enforcing good conditioning of GAN generators using the spectral normalization technique that has previously been applied only to the discriminator.
As a result, SAGAN significantly outperforms the state of the art in image synthesis by boosting the best reported Inception score from 36.8 to 52.52 and reducing Fréchet Inception distance from 27.62 to 18.65.
Preliminary results of the experiment
The structure of SGAN model
Generator includes five layers and two self-attention layer:
- Layer one
- x = ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(1, 1))
- x = SpectralNorm(x)
- x = BatchNorm2d(512)
- x = ReLU(x)
- Layer two
- x = ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
- x = SpectralNorm(x)
- x = BatchNorm2d(256)
- x = ReLU(x)
- Layer three
- ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
- x = SpectralNorm(x)
- x = BatchNorm2d(128)
- x = ReLU(x)
- Self_Attn(128)
- Layer four
- ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
- x = SpectralNorm(x)
- x = BatchNorm2d(64)
- x = ReLU(x)
- Self_Attn(64)
- Layer five
- x = ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
- x = Tanh()
Generator(
(l1): Sequential(
(0): SpectralNorm(
(module): ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(1, 1))
)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(l2): Sequential(
(0): SpectralNorm(
(module): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(l3): Sequential(
(0): SpectralNorm(
(module): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(l4): Sequential(
(0): SpectralNorm(
(module): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(last): Sequential(
(0): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(1): Tanh()
)
(attn1): Self_Attn(
(query_conv): Conv2d(128, 16, kernel_size=(1, 1), stride=(1, 1))
(key_conv): Conv2d(128, 16, kernel_size=(1, 1), stride=(1, 1))
(value_conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
(softmax): Softmax()
)
(attn2): Self_Attn(
(query_conv): Conv2d(64, 8, kernel_size=(1, 1), stride=(1, 1))
(key_conv): Conv2d(64, 8, kernel_size=(1, 1), stride=(1, 1))
(value_conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(softmax): Softmax()
)
)
Discriminator also includes five layers and two self-attention layer:
- Layer one
- x = Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)
- x = LeakyReLU(negative_slope=0.1)
- Layer two
- x = Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
- x = SpectralNorm(x)
- x = LeakyReLU(negative_slope=0.1)
- Layer three
- x = Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
- x = SpectralNorm(x)
- x = LeakyReLU(negative_slope=0.1)
- Self_Attn(256)
- Layer four
- x = Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
- x = SpectralNorm(x)
- x = LeakyReLU(negative_slope=0.1)
- Self_Attn(512)
- Layer five
- x = Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1))
Discriminator(
(l1): Sequential(
(0): SpectralNorm(
(module): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
(1): LeakyReLU(negative_slope=0.1)
)
(l2): Sequential(
(0): SpectralNorm(
(module): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
(1): LeakyReLU(negative_slope=0.1)
)
(l3): Sequential(
(0): SpectralNorm(
(module): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
(1): LeakyReLU(negative_slope=0.1)
)
(l4): Sequential(
(0): SpectralNorm(
(module): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
)
(1): LeakyReLU(negative_slope=0.1)
)
(last): Sequential(
(0): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1))
)
(attn1): Self_Attn(
(query_conv): Conv2d(256, 32, kernel_size=(1, 1), stride=(1, 1))
(key_conv): Conv2d(256, 32, kernel_size=(1, 1), stride=(1, 1))
(value_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(softmax): Softmax()
)
(attn2): Self_Attn(
(query_conv): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
(key_conv): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
(value_conv): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
(softmax): Softmax()
)
)
The hyperparameter of SGAN model
batch_size = 64
g_lr = 0.0001
d_lr = 0.0004
lr_decay = 0.95
imsize = 64
total_step = 100000
optimizer = 'Adam'
beta1 = 0.0
beta2 = 0.9
The training set
def load_lsun(self, classes='church_outdoor_train'):
lsun_transforms = transforms.Compose([
transforms.Resize((self.imsize,self.imsize)),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
dataset = dsets.LSUN(self.path, classes=[classes], transform=lsun_transforms)
return dataset
the ground truth
the generated photos
after 1000 steps:
Elapsed [0:09:07.832233], G_step [1000/100000], D_step[1000/100000], d_out_real: 1.1565, ave_gamma_l3: -0.0323, ave_gamma_l4: -0.0486
after 10000 steps:
Elapsed [0:45:39.616833], G_step [10000/100000], D_step[10000/100000], d_out_real: 0.7750, ave_gamma_l3: -0.1495, ave_gamma_l4: -0.2459
after 35000 steps:
Elapsed [2:32:44.314908], G_step [35000/100000], D_step[35000/100000], d_out_real: 0.2414, ave_gamma_l3: -0.2588, ave_gamma_l4: -0.3762
The planned work
- Compare the Spectral Normalization with other normalization in this experiment
- Use two-timescale update rule(TTUR) specifically to compensate for the problem of slow learning in a regularized discriminator, making it possible to use fewer generator steps per discriminator step.
- Prove the effect of self-attention module on the experimental results.
- Adjust hyperparameter to train model.