方法一:
pd.read_csv()中有个参数chunksize用来块的方式读取数据,例如:将chunksize指定为每次100万行,将大数据集分成许多小块
通过迭代每个块,在将每个块添加到列表之前,我使用函数chunk_preprocessing执行数据过滤/预处理。最后,我将列表连接到一个最终的dataframe中,以适应本地内存
# read the large csv file with specified chunksize
df_chunk
=
pd.
read_csv
(
r'../input/data.csv'
, chunksize
=1000000
)
chunk_list
=
[]
# append each chunk df here
# Each chunk is in df format
for
chunk
in
df_chunk:
# perform data filtering
chunk_filter
=
chunk_preprocessing
(chunk)
# Once the data filtering is done, append the chunk to list
chunk_list.
append
(chunk_filter)
# concat the list into dataframe
df_concat
=
pd.
concat
(chunk_list)
方法二:通过astype()将每列的内容的数据类型进行转换而节约内存,例如float64->float32