Self-Attention Generative Adversarial Networks (SAGAN) 论文模型复现

Lab midterm report —— Self-Attention Generative Adversarial Networks

Paper Title: Self-Attention Generative Adversarial Networks

Paper Authors: Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena

Year: 2018

Student: Liang Yinglin-16340132

Description of the problem

​ Image synthesis is an important problem in computer vision. There has been remarkable progress in this direction with the emergence of Generative Adversarial Networks (GANs). However, GAN model excels at synthesizing image classes with few structural constraints , but it fails to capture geometric or structural patterns that occur consistently in some classes. Since the convolution operator has a local receptive field, long range dependencies can only be processed after passing through several convolutional layers.

​ One disadvantage of GANs is that after training on large datasets containing multiple types of images, they can not clearly distinguish image categories, and it is difficult to capture the structure, texture and details of these images. Therefore, we can not use a GAN to generate a large number of high-quality images with different categories.

​ On the other hand, although increasing the size of the convolution core (receptive field) can retain more representations, it is at the expense of efficiency and computation.

Introduction of the method

​ The authors propose Self-Attention Generative Adversarial Networks (SAGANs), which introduce a self-attention mechanism into convolutional GANs.

​ The self-attention module is complementary to convolutions and helps with modeling long range, multi-level dependencies across image regions. Armed with self-attention, the generator can draw images in which fine details at every location are carefully coordinated with fine details in distant portions of the image. Moreover, the discriminator can also more accurately enforce complicated geometric constraints on the global image structure.

​ In addition to self-attention, They propose enforcing good conditioning of GAN generators using the spectral normalization technique that has previously been applied only to the discriminator.

​ As a result, SAGAN significantly outperforms the state of the art in image synthesis by boosting the best reported Inception score from 36.8 to 52.52 and reducing Fréchet Inception distance from 27.62 to 18.65.

Preliminary results of the experiment

The structure of SGAN model

Generator includes five layers and two self-attention layer:

  • Layer one
    • x = ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(1, 1))
    • x = SpectralNorm(x)
    • x = BatchNorm2d(512)
    • x = ReLU(x)
  • Layer two
    • x = ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = BatchNorm2d(256)
    • x = ReLU(x)
  • Layer three
    • ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = BatchNorm2d(128)
    • x = ReLU(x)
  • Self_Attn(128)
  • Layer four
    • ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = BatchNorm2d(64)
    • x = ReLU(x)
  • Self_Attn(64)
  • Layer five
    • x = ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = Tanh()
Generator(
  (l1): Sequential(
    (0): SpectralNorm(
      (module): ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(1, 1))
    )
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (l2): Sequential(
    (0): SpectralNorm(
      (module): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (l3): Sequential(
    (0): SpectralNorm(
      (module): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (l4): Sequential(
    (0): SpectralNorm(
      (module): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (last): Sequential(
    (0): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): Tanh()
  )
  (attn1): Self_Attn(
    (query_conv): Conv2d(128, 16, kernel_size=(1, 1), stride=(1, 1))
    (key_conv): Conv2d(128, 16, kernel_size=(1, 1), stride=(1, 1))
    (value_conv): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1))
    (softmax): Softmax()
  )
  (attn2): Self_Attn(
    (query_conv): Conv2d(64, 8, kernel_size=(1, 1), stride=(1, 1))
    (key_conv): Conv2d(64, 8, kernel_size=(1, 1), stride=(1, 1))
    (value_conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
    (softmax): Softmax()
  )
)

Discriminator also includes five layers and two self-attention layer:

  • Layer one
    • x = Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)
    • x = LeakyReLU(negative_slope=0.1)
  • Layer two
    • x = Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = LeakyReLU(negative_slope=0.1)
  • Layer three
    • x = Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = LeakyReLU(negative_slope=0.1)
  • Self_Attn(256)
  • Layer four
    • x = Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    • x = SpectralNorm(x)
    • x = LeakyReLU(negative_slope=0.1)
  • Self_Attn(512)
  • Layer five
    • x = Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1))
Discriminator(
  (l1): Sequential(
    (0): SpectralNorm(
      (module): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): LeakyReLU(negative_slope=0.1)
  )
  (l2): Sequential(
    (0): SpectralNorm(
      (module): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): LeakyReLU(negative_slope=0.1)
  )
  (l3): Sequential(
    (0): SpectralNorm(
      (module): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): LeakyReLU(negative_slope=0.1)
  )
  (l4): Sequential(
    (0): SpectralNorm(
      (module): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
    (1): LeakyReLU(negative_slope=0.1)
  )
  (last): Sequential(
    (0): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1))
  )
  (attn1): Self_Attn(
    (query_conv): Conv2d(256, 32, kernel_size=(1, 1), stride=(1, 1))
    (key_conv): Conv2d(256, 32, kernel_size=(1, 1), stride=(1, 1))
    (value_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
    (softmax): Softmax()
  )
  (attn2): Self_Attn(
    (query_conv): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
    (key_conv): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
    (value_conv): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
    (softmax): Softmax()
  )
)
The hyperparameter of SGAN model
batch_size = 64
g_lr = 0.0001
d_lr = 0.0004
lr_decay = 0.95
imsize = 64
total_step = 100000
optimizer = 'Adam'
beta1 = 0.0
beta2 = 0.9
The training set
def load_lsun(self, classes='church_outdoor_train'):
    lsun_transforms = transforms.Compose([
        transforms.Resize((self.imsize,self.imsize)),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    dataset = dsets.LSUN(self.path, classes=[classes], transform=lsun_transforms)
    return dataset

the ground truth

1

the generated photos

after 1000 steps:

Elapsed [0:09:07.832233], G_step [1000/100000], D_step[1000/100000], d_out_real: 1.1565,  ave_gamma_l3: -0.0323, ave_gamma_l4: -0.0486

3

after 10000 steps:

Elapsed [0:45:39.616833], G_step [10000/100000], D_step[10000/100000], d_out_real: 0.7750,  ave_gamma_l3: -0.1495, ave_gamma_l4: -0.2459

2

after 35000 steps:

Elapsed [2:32:44.314908], G_step [35000/100000], D_step[35000/100000], d_out_real: 0.2414,  ave_gamma_l3: -0.2588, ave_gamma_l4: -0.3762

4

The planned work

  • Compare the Spectral Normalization with other normalization in this experiment
  • Use two-timescale update rule(TTUR) specifically to compensate for the problem of slow learning in a regularized discriminator, making it possible to use fewer generator steps per discriminator step.
  • Prove the effect of self-attention module on the experimental results.
  • Adjust hyperparameter to train model.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值