今天在进行API优化排查的时候发现接口在第一次进行调用的时候的运行耗时明显高于之后的调用。
后来经过排查,其实是因为GPU的第一次调用计算会明显高于之后的计算时间。我觉得可能是因为GPU要初始化的原因?
下面做一个简单的小实验,像这样子的,循环计算多次
import torch
import time
if __name__ == '__main__':
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
for i in range(0, 100):
st = time.time()
a1 = torch.rand(9999, 9999).to(device)
b1 = torch.rand(9999, 9999).to(device)
torch.mul(a1, b1)
et = time.time()
print('In computation {}, the computing operation costs {}ms'.format(str(i), (et-st)))
time.sleep(2)
输出如下:
In computation 0, the computing operation costs 4.841561555862427ms
In computation 1, the computing operation costs 1.8976020812988281ms
In computation 2, the computing operation costs 1.9691121578216553ms
In computation 3, the computing operation costs 1.8744938373565674ms
In computation 4, the computing operation costs 1.8520243167877197ms
...
但是只要在前面随机进行一下gpu的操作,让GPU初始化完成,时间就正常了。。
import torch
import time
import logging
if __name__ == '__main__':
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
before_laodd = torch.rand(9999, 9999).to(device)
for i in range(0, 100):
st = time.time()
a1 = torch.rand(9999, 9999).to(device)
b1 = torch.rand(9999, 9999).to(device)
torch.mul(a1, b1)
et = time.time()
print('In computation {}, the computing operation costs {}ms'.format(str(i), (et-st)))
time.sleep(2)
output:
In computation 0, the computing operation costs 1.9176990985870361ms
In computation 1, the computing operation costs 1.9287288188934326ms
In computation 2, the computing operation costs 1.8840551376342773ms
...
这个看似好像对实际应用没啥影响,但是因为我们是做线上的接口服务,大部分用户在启动服务之后,可能就只调用一次,那么首次的时延就显得非常重要。
做接口的时候,启动服务时要顺便初始化一下GPU呀