梗概:
pandas读取一个2.2GB左右的csv文件,转换float64为float32类型,进行内存上的缩减。
文件说明:802行,除了第0和最后一行外都是float64格式的存储
代码:
import time
import pandas as pd
import os
import numpy as np
DatasetPath = '/home/Datasets/Astro_Part4DL/Astropart.csv'
# 获取文件磁盘占用大小
print('File Usage {} GB'.format((os.path.getsize(DatasetPath))/ 1024**3))
df_start_time = time.time()
df = pd.read_csv(DatasetPath, sep=',')
print(df.dtypes[0:6])
df_end_time = time.time()
df_total_time = df_end_time - df_start_time
# 输出文件的读取时间
print('df read time {} s'.format(df_total_time))
# 数据文件的内存占用
print('df float64 Memory Usage {} MB'.format(df.memory_usage().sum() / 1024**2))
to_f32 = {}
for col in df.columns:
if col.startswith('point'):
to_f32[col] = 'float32'
print('*'*40)
df1_start_time = time.time()
df1 = pd.read_csv(DatasetPath, sep=',', dtype=to_f32)
print(df1.dtypes[0:6])
df1_end_time = time.time()
df1_total_time = df1_end_time - df1_start_time
# 输出文件的读取时间
print('df1 read time {} s'.format(df1_total_time))
# 数据文件的内存占用
print('df1 trans float32 Memory Usage {} MB'.format(df1.memory_usage().sum() / 1024**2))
输出:
File Usage 2.148865501396358 GB
[5 rows x 802 columns]
ID object
point0 float64
point1 float64
point2 float64
point3 float64
point4 float64
dtype: object
df read time 20.757500886917114 s
df float64 Memory Usage 978.3126068115234 MB
****************************************
[5 rows x 802 columns]
ID object
point0 float32
point1 float32
point2 float32
point3 float32
point4 float32
dtype: object
df1 read time 18.900606155395508 s
df1 trans float32 Memory Usage 490.37620544433594 MB
结论:
内存缩减约一半,读取速度亦有所加快。