训练模型时,使用什么 batch_size 能够帮我们最大化利用 GPU 的性能?
给你一个函数,帮助你快速找到合适的 batch_size!
参考:原文链接
函数定义
import time
def proc_time(b_sz, model, n_iter=10):
# 模型输入部分
x = torch.rand(b_sz, 16, 11).cuda() # <----- 在这里设置输入的形状
torch.cuda.synchronize()
start = time.time()
for _ in range(n_iter):
model(x) # <---- 模型输入
torch.cuda.synchronize()
end = time.time() - start
throughput = b_sz * n_iter / end
print(f"Batch: {b_sz} \t {throughput} samples/sec")
return (b_sz, throughput, )
函数调用
for b_sz in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]:
proc_time(b_sz, model)
Batch: 1 16.793156063735697 samples/sec
Batch: 2 38.83115043526805 samples/sec
Batch: 4 77.96799714472667 samples/sec
Batch: 8 153.83649638382983 samples/sec
Batch: 16 304.7619878029563 samples/sec
Batch: 32 600.1129780317017 samples/sec
Batch: 64 1350.1580643181849 samples/sec
Batch: 128 2644.7298943577844 samples/sec
Batch: 256 5297.651717512998 samples/sec
Batch: 512 9337.831389005929 samples/sec
Batch: 1024 14020.95845977864 samples/sec
Batch: 2048 16672.3204029026 samples/sec