FP32转FP16能否加速libtorch调用
pytorchlibtorchFP16
###1. PYTORCH 采用FP16后的速度提升问题
pytorch可以使用half()函数将模型由FP32迅速简洁的转换成FP16.但FP16速度是否提升还依赖于GPU。以下面的代码为例,
import time
import torch
from torch.autograd import Variable
import torchvision.models as models
import torch.backends.cudnn as cudnn
cudnn.benchmark = True
net = models.resnet18().cuda()
inp = torch.randn(64, 3, 224, 224).cuda()
for i in range(5):
net.zero_grad()
out = net.forward(Variable(inp, requires_grad=True))
loss = out.sum()
loss.backward()
torch.cuda.synchronize()
start=time.time()
for i in range(100):
net.zero_grad()
out = net.forward(Variable(inp, requires_grad=True))
loss = out.sum()
loss.backward()
torch.cuda.synchronize()
end=time.time()
print("FP32 Iterations per second: ", 100/(end-start))
net = models.resnet18().cuda().half()
inp = torch.randn(64, 3, 224, 224).cuda().half()
torch.cuda.synchronize()
start=time.time()
for i in range(100):
net.zero_grad()
out = net.forward(Variable(inp, requires_grad=True))
loss = out.sum()
loss.backward()
torch.cuda.synchronize()
end=time.time()
print("FP16 Iterations per second: ", 100/(end-start))
在1080Ti上的性能对比:
FP32 Iterations per second: 10.37743206218922
FP16 Iterations per second: 9.855269155760238
FP32 Memory:2497M
FP16 Memory:1611M
可以发现FP16显著的降低了显存,但是速度没有提升,反而有些许下降。
然后观察在 V100 上的性能对比:
FP32 Iterations per second: 16.325794715481173
FP16 Iterations per second: 24.853492643300903
FP32 Memory: 3202M
FP16 Memory: 2272M
###2. Libtorch采用FP16后的速度提升问题
我们在V100上测试FP16是否能提升libtorch的推理速度。
####2.1 下载libtorch
wget https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.6.0%2Bcu101.zip
unzip libtorch-cxx11-abi-shared-with-deps-1.6.0+cu101.zip
在pytorch官网找到对应版本的libtorch,libtorch一般会向下支持,我这里的libtorch版本1.6.0, pytorch安装的是1.1.0
import torch
import torchvision.models as models
net = models.resnet18().cuda()
net.eval()
inp = torch.randn(64, 3, 224, 224).cuda()
traced_script_module = torch.jit.trace(net, inp)
traced_script_module.save("RESNET18_trace.pt")
print("trace has been saved!")
####2.3 libtorch 调用trace
#include
#include
#include
#include
using namespace std;
int main()
{
at::globalContext().setBenchmarkCuDNN(true);
std::string model_file = "/home/zwzhou/Code/test_libtorch/RESNET18_trace.pt";
torch::Tensor inputs = torch::rand({64, 3, 224, 224}).to(at::kCUDA);
torch::jit::script::Module net = torch::jit::load(model_file); // load model
net.to(at::kCUDA);
auto outputs = net.forward({inputs});
cudaDeviceSynchronize();
auto before = std::chrono::system_clock::now();
for (int i=0; i<100; ++i)
{
outputs = net.forward({inputs});
}
cudaDeviceSynchronize();
cudaDeviceSynchronize();
auto after = std::chrono::system_clock::now();
std::chrono::duration all_time = after - before;
std::cout<
net.to(torch::kHalf);
cudaDeviceSynchronize();
before = std::chrono::system_clock::now();
for (int i=0; i<100; ++i)
{
outputs = net.forward({inputs.to(torch::kHalf)});
}
cudaDeviceSynchronize();
after = std::chrono::system_clock::now();
std::chrono::duration all_time2 = after - before;
std::cout<
}
####2.4 编写CMakeLists.txt
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(FP_TEST)
set(CMAKE_PREFIX_PATH "/home/zwzhou/packages/libtorch/share/cmake/Torch")
set(DCMAKE_PREFIX_PATH /home/zwzhou/packages/libtorch)
find_package(Torch REQUIRED)
add_executable(mtest ./libtorch_test.cpp)
target_link_libraries(mtest ${TORCH_LIBRARIES})
set_property(TARGET mtest PROPERTY CXX_STANDARD 14)
####2.5 测评时间
cd build
cmake ..
make
./mtest
####2.6 输出时间
FP32 iteration per second: 60.6978
FP16 iteration per second: 91.5507
可以发现,libtorch版本比pytorch版本速度提升比较明显;另外,可以看出在V100上FP16同样能够提升libtorch的推理速度。