前言
作者经常看见很多人写代码进行数据保存与加载的时候,会选择numpy与pytorch混用,这是否是有必要的?
本文以比较torch与numpy的磁盘读写效率来进行研究探讨。
Pytorch 与 numpy 的磁盘读写效率实验
import torch
import numpy as np
import time
save() method
Python List 数据类型
length = int(1e7)
l = [i for i in range(length)]
torch.save()
start = time.time()
u_start = time.process_time()
torch.save(l, 'torch.pth')
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
cpu time: 2.796875
time: 4.318857431411743
numpy.save()
start = time.time()
u_start = time.process_time()
np.save('np.npy', l)
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
cpu time: 0.484375
time: 0.6534786224365234
Pytorch Tensor 数据类型
length = int(1e6)
batch_size = 256
t = torch.randn(batch_size, length, dtype=torch.float32).to('cuda:0')
torch.save()
start = time.time()
u_start = time.process_time()
torch.save(t, 'torch.pth')
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
cpu time: 0.375
time: 2.1566760540008545
numpy.save()
start = time.time()
u_start = time.process_time()
np.save('np.npy', t.to('cpu'))
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
cpu time: 0.375
time: 2.4183483123779297
Numpy ndarray 数据类型
length = int(1e7)
a = np.array(range(length))
torch.save()
start = time.time()
u_start = time.process_time()
torch.save(a, 'torch.pth')
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
cpu time: 0.21875
time: 0.3329944610595703
numpy.save()
start = time.time()
u_start = time.process_time()
np.save('np.npy', a)
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
cpu time: 0.078125
time: 0.14899635314941406
load() method
Python List 数据类型
length = int(1e7)
l = [i for i in range(length)]
torch.load()
torch.save(l, 'torch.pth')
start = time.time()
u_start = time.process_time()
l = torch.load('torch.pth')
u_end = time.process_time()
end = time.time()
print(type(l))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
<class 'list'>
cpu time: 1.1875
time: 1.6928999423980713
numpy.load()
np.save('np.npy', l)
start = time.time()
u_start = time.process_time()
l = np.load('np.npy')
u_end = time.process_time()
end = time.time()
print(type(l))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
<class 'numpy.ndarray'>
cpu time: 0.140625
time: 0.18318390846252441
Pytorch Tensor 数据类型
length = int(1e6)
batch_size = 256
t = torch.randn(batch_size, length, dtype=torch.float32).to('cuda:0')
torch.load()
torch.save(t, 'torch.pth')
start = time.time()
u_start = time.process_time()
t = torch.load('torch.pth')
u_end = time.process_time()
end = time.time()
print(type(t))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
<class 'torch.Tensor'>
cpu time: 0.46875
time: 0.6859660148620605
numpy.load()
np.save('np.npy', t.to('cpu'))
start = time.time()
u_start = time.process_time()
t = torch.from_numpy(np.load('np.npy'))
u_end = time.process_time()
end = time.time()
print(type(t))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
<class 'torch.Tensor'>
cpu time: 0.21875
time: 0.3618292808532715
Numpy ndarray 数据类型
length = int(1e7)
a = np.array(range(length))
torch.load()
torch.save(a, 'torch.pth')
start = time.time()
u_start = time.process_time()
a = torch.load('torch.pth')
u_end = time.process_time()
end = time.time()
print(type(a))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
<class 'numpy.ndarray'>
cpu time: 0.09375
time: 0.14090633392333984
numpy.load()
np.save('np.npy', a)
start = time.time()
u_start = time.process_time()
a = np.load('np.npy')
u_end = time.process_time()
end = time.time()
print(type(a))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)
<class 'numpy.ndarray'>
cpu time: 0.03125
time: 0.022988080978393555
结论
使用numpy对大量数据进行保存与加载要比使用pytorch效率高,总体上相差一个数量级左右。在进行大量数据的保存与加载时,可以考虑使用numpy来进行效率优化。
(如有错误,望请指正)