前言
在微调StyleGAN2模型(一)构建数据集中,我们好不容易弄好了数据集却发现自家的GPU显存不足(RTX2080 8G显存),怎么调参数都没法跑.(后来发现其实是可以跑的,只是我参数没调对)
因为不确定真的是显存问题,我用尽关键词百度了一圈都找不到训练StyleGAN2相关的笔记/博客(2020年2月9日),感觉国内暂时还没人做这个然后写博客的样子.
去谷歌/推特翻了一圈,也没找到想要的答案,但是发现大家都在用Google Colab来做实验.
本着实在不行买点算力来跑的想法,详细了解了下Google Colab.
Google Colab
Google Colab是一个Google提供的云服务,以Jupiter Notebook为载体,专门面向数据科学/人工智能领域的开发者.无需任何配置,免费使用 GPU,轻松共享.
我比较关心配置如何,要不要加钱.
从 https://colab.research.google.com/signup 可以看到,谷歌为免费用户提供
K80 GPU 16G显存
12G 内存
60G 磁盘,可以直接从谷歌云盘挂载
笔记本最长可以运行 12 小时,而且有空闲超时.
16G显存,应该是没问题的了.
搭建环境
我本来打算把我的代码打包到谷歌云盘上,挂载磁盘后unzip解压使用.
但是有大佬告诉了我更好的办法
https://github.com/pbaylies/stylegan2
这个储存库不但加了很多方便实用的功能,还专门对Google Colab做了优化,可以一键部署,你需要做的基本上就只有git clone.
进入Google Colab环境后,新建笔记本,先点击菜单栏的 修改->笔记本设置
把硬件加速器设置为GPU.
然后挂载磁盘,在左侧边栏GUI上操作就行了.
回到主界面,先检查下配置
%tensorflow_version 1.x
import tensorflow as tf
print('Tensorflow version: {}'.format(tf.__version__) )
!nvidia-smi -L
print('GPU Identified at: {}'.format(tf.test.gpu_device_name()))
Tensorflow version: 1.15.0
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-99f2b332-d720-4647-7f48-6504f2445523)
GPU Identified at: /device:GPU:0
然后git clone上面提到的储存库,建议是放在谷歌云盘上,因为服务器的磁盘好像重载一次就清空一次.
import os
os.chdir("/content/drive/My Drive")
!git clone https://github.com/pbaylies/stylegan2.git
!git checkout .
!git checkout 90d548243b
解压上传到谷歌云盘的图片压缩包,生成数据集
!unzip ./data_clean_4.zip
os.chdir("/content/drive/My Drive/stylegan2")
!python dataset_tool.py create_from_images_raw ./dataset/lex ./../data_clean_4
!rm -r ./../data_clean_4
就位要微调的模型
!mkdir ./models
!cp ./../2020-01-11-skylion-stylegan2-animeportraits-networksnapshot-024664.pkl ./models
运行
运行训练代码,生成的子项目会存在谷歌云盘的stylegan2/results
文件夹下.
!python run_training.py --num-gpus=1 --data-dir=./dataset --config=config-f --dataset=lex --mirror-augment=true --metric=none --total-kimg=10000 --result-dir="/content/drive/My Drive/stylegan2/results" --resume-pkl=./models/2020-01-11-skylion-stylegan2-animeportraits-networksnapshot-024664.pkl
...
Building TensorFlow graph...
Initializing logs...
Training for 10000 kimg...
tick 0 kimg 0.1 lod 0.00 minibatch 32 time 58s sec/tick 57.7 sec/kimg 450.61 maintenance 0.0 gpumem 8.4
^C
训练了0.1kimg停下来了.没有报错信息,非常神秘.
查询后发现应该是内存不足的原因.而且有办法获取更多的内存:
d=[]
while(1):
d.append('1')
坐等OOM.然后会给你弹出个窗口,问你这个笔记本的项目是不是需要更大的内存?
要!
然后我的环境重载了,内存变成了25G.
由于环境重载了,如果有东西放在磁盘上(不是谷歌云盘上)会被清空,当然内存也会被清空.
重新输入指令运行.
import os
os.chdir("/content/drive/My Drive/stylegan2")
!python run_training.py --num-gpus=1 --data-dir=./dataset --config=config-f --dataset=lex --mirror-augment=true --metric=none --total-kimg=10000 --result-dir="/content/drive/My Drive/stylegan2/results" --resume-pkl=./models/2020-01-11-skylion-stylegan2-animeportraits-networksnapshot-024664.pkl
...
Building TensorFlow graph...
Initializing logs...
Training for 10000 kimg...
tick 0 kimg 0.8 lod 0.00 minibatch 192 time 4m 07s sec/tick 246.7 sec/kimg 321.16 maintenance 0.0 gpumem 7.4
tick 1 kimg 6.9 lod 0.00 minibatch 192 time 33m 12s sec/tick 1723.7 sec/kimg 280.54 maintenance 21.3 gpumem 7.4
tick 2 kimg 13.1 lod 0.00 minibatch 192 time 1h 02m 01s sec/tick 1721.7 sec/kimg 280.23 maintenance 7.5 gpumem 7.4
tick 3 kimg 19.2 lod 0.00 minibatch 192 time 1h 30m 48s sec/tick 1720.0 sec/kimg 279.94 maintenance 7.1 gpumem 7.4
tick 4 kimg 25.3 lod 0.00 minibatch 192 time 1h 59m 34s sec/tick 1719.1 sec/kimg 279.80 maintenance 6.8 gpumem 7.4
tick 5 kimg 31.5 lod 0.00 minibatch 192 time 2h 28m 22s sec/tick 1721.1 sec/kimg 280.13 maintenance 6.9 gpumem 7.4
tick 6 kimg 37.6 lod 0.00 minibatch 192 time 2h 57m 09s sec/tick 1719.9 sec/kimg 279.93 maintenance 7.1 gpumem 7.4
tick 7 kimg 43.8 lod 0.00 minibatch 192 time 3h 25m 57s sec/tick 1721.3 sec/kimg 280.17 maintenance 7.0 gpumem 7.4
tick 8 kimg 49.9 lod 0.00 minibatch 192 time 3h 54m 44s sec/tick 1719.6 sec/kimg 279.88 maintenance 6.9 gpumem 7.4
tick 9 kimg 56.1 lod 0.00 minibatch 192 time 4h 23m 30s sec/tick 1719.9 sec/kimg 279.93 maintenance 7.0 gpumem 7.4
tick 10 kimg 62.2 lod 0.00 minibatch 192 time 4h 52m 20s sec/tick 1722.6 sec/kimg 280.37 maintenance 7.1 gpumem 7.4
tick 11 kimg 68.4 lod 0.00 minibatch 192 time 5h 21m 10s sec/tick 1722.8 sec/kimg 280.41 maintenance 7.3 gpumem 7.4
tick 12 kimg 74.5 lod 0.00 minibatch 192 time 5h 49m 57s sec/tick 1718.9 sec/kimg 279.76 maintenance 7.5 gpumem 7.4
tick 13 kimg 80.6 lod 0.00 minibatch 192 time 6h 18m 45s sec/tick 1720.7 sec/kimg 280.07 maintenance 7.1 gpumem 7.4
tick 14 kimg 86.8 lod 0.00 minibatch 192 time 6h 47m 36s sec/tick 1724.4 sec/kimg 280.66 maintenance 7.5 gpumem 7.4
tick 15 kimg 92.9 lod 0.00 minibatch 192 time 7h 16m 28s sec/tick 1724.3 sec/kimg 280.65 maintenance 7.3 gpumem 7.4
tick 16 kimg 99.1 lod 0.00 minibatch 192 time 7h 45m 20s sec/tick 1724.6 sec/kimg 280.69 maintenance 7.3 gpumem 7.4
一觉醒来,打算检查一下训练成果,打开谷歌云盘发现
已使用 14.8 GB, 共 15 GB
一共训练了16 tick,但是只保存了6 tick.
我本来计算过是够空间的,但是我之后一通操作创建了很多文件又删了很多文件,然后删除的文件是放在回收站里的(我明明是rm -rf
的).而回收站是占空间的,清空回收站即可.
重启一下Colab,然后从上次最后保存的pkl开始运行(注意参数有改动:--resume-pkl=latest
)
import os
os.chdir("/content/drive/My Drive/stylegan2")
!python run_training.py --num-gpus=1 --data-dir=./dataset --config=config-f --dataset=lex --mirror-augment=true --metric=none --total-kimg=10000 --result-dir="/content/drive/My Drive/stylegan2/results" --resume-pkl=latest
Local submit - run_dir: /content/drive/My Drive/stylegan2/results/00003-stylegan2-lex-1gpu-config-f
dnnlib: Running training.training_loop.training_loop() on localhost...
Streaming data using training.dataset.TFRecordDataset...
Dataset shape = [3, 512, 512]
Dynamic range = [0, 255]
Label size = 0
Loading networks from "/content/drive/My Drive/stylegan2/results/00002-stylegan2-lex-1gpu-config-f/network-snapshot-000037.pkl"...
...
Building TensorFlow graph...
Initializing logs...
Training for 10000 kimg...
tick 0 kimg 37.8 lod 0.00 minibatch 192 time 4m 04s sec/tick 244.3 sec/kimg 318.14 maintenance 0.0 gpumem 7.5
tick 1 kimg 43.8 lod 0.00 minibatch 192 time 4m 07s sec/tick 246.7 sec/kimg 321.28 maintenance 0.0 gpumem 7.4
tick 2 kimg 49.9 lod 0.00 minibatch 192 time 33m 14s sec/tick 1726.9 sec/kimg 281.07 maintenance 19.9 gpumem 7.4
tick 3 kimg 56.1 lod 0.00 minibatch 192 time 1h 02m 05s sec/tick 1725.2 sec/kimg 280.79 maintenance 6.6 gpumem 7.4
tick 4 kimg 62.2 lod 0.00 minibatch 192 time 1h 31m 02s sec/tick 1729.7 sec/kimg 281.53 maintenance 6.9 gpumem 7.4
tick 5 kimg 68.3 lod 0.00 minibatch 192 time 1h 59m 59s sec/tick 1730.3 sec/kimg 281.62 maintenance 7.0 gpumem 7.4
tick 6 kimg 74.5 lod 0.00 minibatch 192 time 2h 28m 57s sec/tick 1731.0 sec/kimg 281.73 maintenance 7.0 gpumem 7.4
tick 7 kimg 80.6 lod 0.00 minibatch 192 time 2h 57m 55s sec/tick 1730.4 sec/kimg 281.64 maintenance 7.0 gpumem 7.4
tick 8 kimg 86.8 lod 0.00 minibatch 192 time 3h 26m 53s sec/tick 1731.2 sec/kimg 281.77 maintenance 6.9 gpumem 7.4
tick 9 kimg 92.9 lod 0.00 minibatch 192 time 3h 55m 45s sec/tick 1725.6 sec/kimg 280.86 maintenance 6.9 gpumem 7.4
tick 10 kimg 99.1 lod 0.00 minibatch 192 time 4h 24m 42s sec/tick 1729.7 sec/kimg 281.53 maintenance 6.9 gpumem 7.4
tick 11 kimg 105.2 lod 0.00 minibatch 192 time 4h 53m 37s sec/tick 1728.1 sec/kimg 281.26 maintenance 6.9 gpumem 7.4
tick 12 kimg 111.4 lod 0.00 minibatch 192 time 5h 22m 34s sec/tick 1730.2 sec/kimg 281.61 maintenance 7.1 gpumem 7.4
tick 13 kimg 117.5 lod 0.00 minibatch 192 time 5h 51m 28s sec/tick 1727.0 sec/kimg 281.09 maintenance 7.2 gpumem 7.4
tick 14 kimg 123.6 lod 0.00 minibatch 192 time 6h 20m 23s sec/tick 1727.5 sec/kimg 281.17 maintenance 7.0 gpumem 7.4
tick 15 kimg 129.8 lod 0.00 minibatch 192 time 6h 49m 22s sec/tick 1732.4 sec/kimg 281.97 maintenance 7.1 gpumem 7.4
tick 16 kimg 135.9 lod 0.00 minibatch 192 time 7h 18m 22s sec/tick 1732.1 sec/kimg 281.92 maintenance 7.3 gpumem 7.4
tick 17 kimg 142.1 lod 0.00 minibatch 192 time 7h 47m 20s sec/tick 1731.3 sec/kimg 281.78 maintenance 7.3 gpumem 7.4
tick 18 kimg 148.2 lod 0.00 minibatch 192 time 8h 16m 18s sec/tick 1730.8 sec/kimg 281.71 maintenance 7.1 gpumem 7.4
...
虽然新开了一个子项目00003,又是从0 tick开始计算的,但是的确是从上次最后保存的.pkl(37.6kimg)开始训练的.
当我不间断使用GPU大概12小时后,被断开了连接,尝试再次连接服务器时出来个弹窗
我去查了下,大概意思就是说使用有一定的限制(一般认为是一天12小时),超过了这个限制就会展示被禁止使用.一般等个8小时就能重新使用了.
结果
问题又来了,我训练好了模型,现在在谷歌云盘上stylegan2/results
文件夹下,但是谷歌云盘下载实在是太慢了.于是我使用https://www.multcloud.com/将谷歌云盘上所需的文件转移到Onedrive,然后用Onedrive下载.
对比样本
初始化 0kimg
筛选4 123kimg
筛选5 123kimg
可以看出,筛选5的样本质量显得更高一些.
生成图片
下载训练好的.pkl文件到本地,配置好路径之后运行
python run_generator.py generate-images --network=config --seeds=0-100 --truncation-psi=0.5
在results文件夹下查看生成的图片
生成视频
video.py
import os
import pickle
import numpy as np
import PIL.Image
import dnnlib
import dnnlib.tflib as tflib
import config
import scipy
import dnnlib.tflib as tflib
import math
import moviepy.editor
from numpy import linalg
import numpy as np
import pickle
def main():
tflib.init_tf()
# Load pre-trained network.
# url = 'https://drive.google.com/uc?id=1MEGjdvVpUsu1jB4zrXZN7Y4kBBOzizDQ'
# with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
## NOTE: insert model here:
_G, _D, Gs = pickle.load(open(config.Model, "rb"))
# _G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run.
# _D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run.
# Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot.
grid_size = [2, 2]
image_shrink = 1
image_zoom = 1
duration_sec = 60.0
smoothing_sec = 1.0
mp4_fps = 20
mp4_codec = 'libx264'
mp4_bitrate = '5M'
random_seed = 404
mp4_file = 'results/random_grid_%s.mp4' % random_seed
minibatch_size = 8
num_frames = int(np.rint(duration_sec * mp4_fps))
random_state = np.random.RandomState(random_seed)
# Generate latent vectors
shape = [num_frames, np.prod(grid_size)] + Gs.input_shape[1:] # [frame, image, channel, component]
all_latents = random_state.randn(*shape).astype(np.float32)
import scipy
all_latents = scipy.ndimage.gaussian_filter(all_latents,
[smoothing_sec * mp4_fps] + [0] * len(Gs.input_shape), mode='wrap')
all_latents /= np.sqrt(np.mean(np.square(all_latents)))
def create_image_grid(images, grid_size=None):
assert images.ndim == 3 or images.ndim == 4
num, img_h, img_w, channels = images.shape
if grid_size is not None:
grid_w, grid_h = tuple(grid_size)
else:
grid_w = max(int(np.ceil(np.sqrt(num))), 1)
grid_h = max((num - 1) // grid_w + 1, 1)
grid = np.zeros([grid_h * img_h, grid_w * img_w, channels], dtype=images.dtype)
for idx in range(num):
x = (idx % grid_w) * img_w
y = (idx // grid_w) * img_h
grid[y: y + img_h, x: x + img_w] = images[idx]
return grid
# Frame generation func for moviepy.
def make_frame(t):
frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
latents = all_latents[frame_idx]
fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
images = Gs.run(latents, None, truncation_psi=0.7,
randomize_noise=False, output_transform=fmt)
grid = create_image_grid(images, grid_size)
if image_zoom > 1:
grid = scipy.ndimage.zoom(grid, [image_zoom, image_zoom, 1], order=0)
if grid.shape[2] == 1:
grid = grid.repeat(3, 2) # grayscale => RGB
return grid
# Generate video.
import moviepy.editor
video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
def circular():
tflib.init_tf()
_G, _D, Gs = pickle.load(open(config.Model, "rb"))
rnd = np.random
latents_a = rnd.randn(1, Gs.input_shape[1])
latents_b = rnd.randn(1, Gs.input_shape[1])
latents_c = rnd.randn(1, Gs.input_shape[1])
def circ_generator(latents_interpolate):
radius = 40.0
latents_axis_x = (latents_a - latents_b).flatten() / linalg.norm(latents_a - latents_b)
latents_axis_y = (latents_a - latents_c).flatten() / linalg.norm(latents_a - latents_c)
latents_x = math.sin(math.pi * 2.0 * latents_interpolate) * radius
latents_y = math.cos(math.pi * 2.0 * latents_interpolate) * radius
latents = latents_a + latents_x * latents_axis_x + latents_y * latents_axis_y
return latents
def mse(x, y):
return (np.square(x - y)).mean()
def generate_from_generator_adaptive(gen_func):
max_step = 1.0
current_pos = 0.0
change_min = 10.0
change_max = 11.0
fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
current_latent = gen_func(current_pos)
current_image = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
array_list = []
video_length = 1.0
while current_pos < video_length:
array_list.append(current_image)
lower = current_pos
upper = current_pos + max_step
current_pos = (upper + lower) / 2.0
current_latent = gen_func(current_pos)
current_image = images = \
Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
current_mse = mse(array_list[-1], current_image)
while current_mse < change_min or current_mse > change_max:
if current_mse < change_min:
lower = current_pos
current_pos = (upper + lower) / 2.0
if current_mse > change_max:
upper = current_pos
current_pos = (upper + lower) / 2.0
current_latent = gen_func(current_pos)
current_image = images = \
Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
current_mse = mse(array_list[-1], current_image)
print("%s / %s : %s" % (current_pos, video_length, current_mse))
return array_list
frames = generate_from_generator_adaptive(circ_generator)
frames = moviepy.editor.ImageSequenceClip(frames, fps=30)
# Generate video.
mp4_file = 'results/circular.mp4'
mp4_codec = 'libx264'
mp4_bitrate = '3M'
mp4_fps = 20
frames.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
if __name__ == "__main__":
main()
circular()
运行生成视频
python video.py
等几个进度条读完后,在./results
文件夹下找到:
- 随机梯度插值视频
random_grid_404.mp4
- 无缝循环插值视频
circular.mp4
后记
- 数据集(不同程度人工筛选):
https://pan.baidu.com/s/1QxxzutXfqSMv1p7WONnipw 提取码: 3ujb - 训练好的Stylegan2模型:
https://pan.baidu.com/s/1FRGzD6MwiD-qbnFcKTE34Q 提取码: v381
https://drive.google.com/file/d/1HAuNpiovAPgxt9U4gAbisLZ-fh4Mza6r
感谢 gwern 的笔记和指点.
感谢 @document 提供RTX2080.