C语言fp32转为fp16的代码,FP32转FP16能否加速libtorch调用

FP32转FP16能否加速libtorch调用

pytorchlibtorchFP16

###1. PYTORCH 采用FP16后的速度提升问题

pytorch可以使用half()函数将模型由FP32迅速简洁的转换成FP16.但FP16速度是否提升还依赖于GPU。以下面的代码为例,

import time

import torch

from torch.autograd import Variable

import torchvision.models as models

import torch.backends.cudnn as cudnn

cudnn.benchmark = True

net = models.resnet18().cuda()

inp = torch.randn(64, 3, 224, 224).cuda()

for i in range(5):

net.zero_grad()

out = net.forward(Variable(inp, requires_grad=True))

loss = out.sum()

loss.backward()

torch.cuda.synchronize()

start=time.time()

for i in range(100):

net.zero_grad()

out = net.forward(Variable(inp, requires_grad=True))

loss = out.sum()

loss.backward()

torch.cuda.synchronize()

end=time.time()

print("FP32 Iterations per second: ", 100/(end-start))

net = models.resnet18().cuda().half()

inp = torch.randn(64, 3, 224, 224).cuda().half()

torch.cuda.synchronize()

start=time.time()

for i in range(100):

net.zero_grad()

out = net.forward(Variable(inp, requires_grad=True))

loss = out.sum()

loss.backward()

torch.cuda.synchronize()

end=time.time()

print("FP16 Iterations per second: ", 100/(end-start))

在1080Ti上的性能对比:

FP32 Iterations per second: 10.37743206218922

FP16 Iterations per second: 9.855269155760238

FP32 Memory:2497M

FP16 Memory:1611M

可以发现FP16显著的降低了显存,但是速度没有提升,反而有些许下降。

然后观察在 V100 上的性能对比:

FP32 Iterations per second: 16.325794715481173

FP16 Iterations per second: 24.853492643300903

FP32 Memory: 3202M

FP16 Memory: 2272M

###2. Libtorch采用FP16后的速度提升问题

我们在V100上测试FP16是否能提升libtorch的推理速度。

####2.1 下载libtorch

wget https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.6.0%2Bcu101.zip

unzip libtorch-cxx11-abi-shared-with-deps-1.6.0+cu101.zip

在pytorch官网找到对应版本的libtorch,libtorch一般会向下支持,我这里的libtorch版本1.6.0, pytorch安装的是1.1.0

import torch

import torchvision.models as models

net = models.resnet18().cuda()

net.eval()

inp = torch.randn(64, 3, 224, 224).cuda()

traced_script_module = torch.jit.trace(net, inp)

traced_script_module.save("RESNET18_trace.pt")

print("trace has been saved!")

####2.3 libtorch 调用trace

#include

#include

#include

#include

using namespace std;

int main()

{

at::globalContext().setBenchmarkCuDNN(true);

std::string model_file = "/home/zwzhou/Code/test_libtorch/RESNET18_trace.pt";

torch::Tensor inputs = torch::rand({64, 3, 224, 224}).to(at::kCUDA);

torch::jit::script::Module net = torch::jit::load(model_file); // load model

net.to(at::kCUDA);

auto outputs = net.forward({inputs});

cudaDeviceSynchronize();

auto before = std::chrono::system_clock::now();

for (int i=0; i<100; ++i)

{

outputs = net.forward({inputs});

}

cudaDeviceSynchronize();

cudaDeviceSynchronize();

auto after = std::chrono::system_clock::now();

std::chrono::duration all_time = after - before;

std::cout<

net.to(torch::kHalf);

cudaDeviceSynchronize();

before = std::chrono::system_clock::now();

for (int i=0; i<100; ++i)

{

outputs = net.forward({inputs.to(torch::kHalf)});

}

cudaDeviceSynchronize();

after = std::chrono::system_clock::now();

std::chrono::duration all_time2 = after - before;

std::cout<

}

####2.4 编写CMakeLists.txt

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)

project(FP_TEST)

set(CMAKE_PREFIX_PATH "/home/zwzhou/packages/libtorch/share/cmake/Torch")

set(DCMAKE_PREFIX_PATH /home/zwzhou/packages/libtorch)

find_package(Torch REQUIRED)

add_executable(mtest ./libtorch_test.cpp)

target_link_libraries(mtest ${TORCH_LIBRARIES})

set_property(TARGET mtest PROPERTY CXX_STANDARD 14)

####2.5 测评时间

cd build

cmake ..

make

./mtest

####2.6 输出时间

FP32 iteration per second: 60.6978

FP16 iteration per second: 91.5507

可以发现,libtorch版本比pytorch版本速度提升比较明显;另外,可以看出在V100上FP16同样能够提升libtorch的推理速度。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值