numpy文件有两种后缀名:npy和npz。npy存储的是一个单独的numpy数组,而npz存储的是多个numpy数组。
python中pickle库也可以用于序列化numpy数组。下面做一个实验对比pickle文件和npz文件的读取速度:
import numpy as np
import pickle
import time
class Timer:
def __init__(self):
self.start = 0
self.end = 0
self.interval = 0
def __enter__(self):
self.start = time.time()
return self
def __exit__(self, *args):
self.end = time.time()
self.interval = self.end - self.start
shape = (1000,1000,500)
arr = np.random.rand(*shape)
name_pickle = "test.pkl"
with open(name_pickle, "wb") as f:
with Timer() as t:
pickle.dump(arr, f)
print(f"pickle dump time: {t.interval}s")
with open(name_pickle, "rb") as f:
with Timer() as t:
x = pickle.load(f)
print(f"pickle load time: {t.interval}s")
print(type(x))
print("\n\n")
name = "test.npz"
with Timer() as t:
np.savez(name, arr)
print(f"np.savez time: {t.interval}s")
with Timer() as t:
x = np.load(name, allow_pickle=True)
print(f"np.load time: {t.interval}s")
输出结果是:
pickle dump time: 3.172863721847534s
pickle load time: 1.2087347507476807s
<class 'numpy.ndarray'>
np.savez time: 3.008016347885132s
np.load time: 0.2541012763977051s
看起来np.load比pickle快了一个数量级,简直令人兴奋。不过别急,我们再补一个实验,让它打印npz文件中数组:
with Timer() as t:
print(type(x))
print(type(x['arr_0']))
print(f"read time: {t.interval}")
print("\n\n")
得到输出:
<class 'numpy.lib.npyio.NpzFile'>
<class 'numpy.ndarray'>
read time: 2.2933874130249023
什么玩意儿?将从npz文件读取到的东西打印出来要两秒多种,都比之前pickle.load从硬盘读文件还慢了。
这只能解释为numpy对npz文件的load函数是一种“懒读取”,返回的只是数组在文件中的句柄,只有在用这个句柄去索引数组时,才会真正读取文件。
所以,np.load并没有比pickle.load快一个数量级,在程序中,没有必要刻意用np.load去替换pickle.load!
另外,numpy的懒读取仅针对npz格式文件,对npy格式文件是立刻读取的。