[Deep learning 系列] Pytorch 与 Numpy 的磁盘读写效率实验对比

本文链接：https://blog.csdn.net/KKdark/article/details/139028796

前言

作者经常看见很多人写代码进行数据保存与加载的时候，会选择numpy与pytorch混用，这是否是有必要的？
本文以比较torch与numpy的磁盘读写效率来进行研究探讨。

Pytorch 与 numpy 的磁盘读写效率实验

import torch
import numpy as np
import time

save() method

Python List 数据类型

length = int(1e7)
l = [i for i in range(length)]

torch.save()

start = time.time()
u_start = time.process_time()
torch.save(l, 'torch.pth')
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

cpu time:  2.796875
time:  4.318857431411743

numpy.save()

start = time.time()
u_start = time.process_time()
np.save('np.npy', l)
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

cpu time:  0.484375
time:  0.6534786224365234

Pytorch Tensor 数据类型

length = int(1e6)
batch_size = 256
t = torch.randn(batch_size, length, dtype=torch.float32).to('cuda:0')

torch.save()

start = time.time()
u_start = time.process_time()
torch.save(t, 'torch.pth')
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

cpu time:  0.375
time:  2.1566760540008545

numpy.save()

start = time.time()
u_start = time.process_time()
np.save('np.npy', t.to('cpu'))
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

cpu time:  0.375
time:  2.4183483123779297

Numpy ndarray 数据类型

length = int(1e7)
a = np.array(range(length))

torch.save()

start = time.time()
u_start = time.process_time()
torch.save(a, 'torch.pth')
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

cpu time:  0.21875
time:  0.3329944610595703

numpy.save()

start = time.time()
u_start = time.process_time()
np.save('np.npy', a)
u_end = time.process_time()
end = time.time()
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

cpu time:  0.078125
time:  0.14899635314941406

load() method

Python List 数据类型

length = int(1e7)
l = [i for i in range(length)]

torch.load()

torch.save(l, 'torch.pth')
start = time.time()
u_start = time.process_time()
l = torch.load('torch.pth')
u_end = time.process_time()
end = time.time()
print(type(l))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

<class 'list'>
cpu time:  1.1875
time:  1.6928999423980713

numpy.load()

np.save('np.npy', l)
start = time.time()
u_start = time.process_time()
l = np.load('np.npy')
u_end = time.process_time()
end = time.time()
print(type(l))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

<class 'numpy.ndarray'>
cpu time:  0.140625
time:  0.18318390846252441

Pytorch Tensor 数据类型

length = int(1e6)
batch_size = 256
t = torch.randn(batch_size, length, dtype=torch.float32).to('cuda:0')

torch.load()

torch.save(t, 'torch.pth')
start = time.time()
u_start = time.process_time()
t = torch.load('torch.pth')
u_end = time.process_time()
end = time.time()
print(type(t))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

<class 'torch.Tensor'>
cpu time:  0.46875
time:  0.6859660148620605

numpy.load()

np.save('np.npy', t.to('cpu'))
start = time.time()
u_start = time.process_time()
t = torch.from_numpy(np.load('np.npy'))
u_end = time.process_time()
end = time.time()
print(type(t))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

<class 'torch.Tensor'>
cpu time:  0.21875
time:  0.3618292808532715

Numpy ndarray 数据类型

length = int(1e7)
a = np.array(range(length))

torch.load()

torch.save(a, 'torch.pth')
start = time.time()
u_start = time.process_time()
a = torch.load('torch.pth')
u_end = time.process_time()
end = time.time()
print(type(a))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

<class 'numpy.ndarray'>
cpu time:  0.09375
time:  0.14090633392333984

numpy.load()

np.save('np.npy', a)
start = time.time()
u_start = time.process_time()
a = np.load('np.npy')
u_end = time.process_time()
end = time.time()
print(type(a))
print('cpu time: ', u_end - u_start)
print('time: ', end - start)

<class 'numpy.ndarray'>
cpu time:  0.03125
time:  0.022988080978393555

结论

使用numpy对大量数据进行保存与加载要比使用pytorch效率高，总体上相差一个数量级左右。在进行大量数据的保存与加载时，可以考虑使用numpy来进行效率优化。
（如有错误，望请指正）