我有非常大的数据集存储在硬盘上的二进制文件中.这是一个文件结构的例子:
文件头
149 Byte ASCII Header
记录开始
4 Byte Int - Record Timestamp
样品开始
2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample
样品结束
每个记录共有122,880个样本和每个文件713个记录.这产生的总大小为700,910,521字节.采样率和记录数量有时会有所不同,所以我必须编写检测每个文件数量的代码.
目前我用来将这些数据导入数组的代码如下:
from time import clock
from numpy import zeros , int16 , int32 , hstack , array , savez
from struct import unpack
from os.path import getsize
start_time = clock()
file_size = getsize(input_file)
with open(input_file,'rb') as openfile:
input_data = openfile.read()
header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2
time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)
for record in xrange(number_of_records):
time_stamp = array( unpack( '
unpacked_record = unpack( '
record_t = zeros(sample_rate , dtype=int16)
record_x = zeros(sample_rate , dtype=int16)
record_y = zeros(sample_rate , dtype=int16)
record_z = zeros(sample_rate , dtype=int16)
for sample in xrange(sample_rate):
record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]
time_series = hstack ( ( time_series , time_stamp ) )
t_series = hstack ( ( t_series , record_t ) )
x_series = hstack ( ( x_series , record_x ) )
y_series = hstack ( ( y_series , record_y ) )
z_series = hstack ( ( z_series , record_z ) )
savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'
这个目前每700 MB文件大概需要250秒,对我来说似乎很高.有没有更有效的方法我可以做到这一点?
最终解决方案
使用具有自定义dtype的numpy fromfile方法将运行时切割为9秒,比上述原始代码快27倍.最后的代码如下.
from numpy import savez, dtype , fromfile
from os.path import getsize
from time import clock
start_time = clock()
file_size = getsize(input_file)
openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2
record_dtype = dtype( [ ( 'timestamp' , '
data = fromfile(openfile , dtype = record_dtype , count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()
savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series)
end_time = clock()
print 'It took',end_time - start_time,'seconds'