- 构造一个大的csv
import pandas as ps
dict_array=[]
for i in range(30000000):
dict_array.append({"cloumn1":i,"column2":"","column3":i,"column4":"我就不信了,我写一个这么长的文本,到底对这个csv会产生多大的影响"})
data = ps.DataFrame(dict_array)
data.to_csv(r'C:\Users\84977\Desktop\ellis.csv',index=False)
- 使用pandas读数据,看看耗时
import pandas as pd
import time
start = time.time()
dask_df = pd.read_csv(r'C:\Users\84977\Desktop\ellis.csv')
end = time.time()
print("Read csv with dask: ",(end-start),"sec")
Read csv with dask: 27.334847688674927 sec
3. 使用chunksize读取pandas
import pandas as pd
#加了iterator=True 才会一直往下读csv,否则读了前100万行就退出了
import time
start = time.time()
dask_df = pd.read_csv(r'C:\Users\84977\Desktop\ellis.csv',chunksize=1000000,dask_df = pd.read_csv(r'C:\Users\84977\Desktop\ellis.csv',chunksize=1000000,iterator=True)
end = time.time()
print("Read csv with dask: ",(end-start),"sec")
for i,data in enumerate(dask_df):
print(data)
Read csv with dask: 0.00599980354309082 sec
4. 使用dask 读取csv
pip install dask
from dask import dataframe as dd
import time
start = time.time()
dask_df = dd.read_csv(r'C:\Users\84977\Desktop\ellis.csv')
end = time.time()
print("Read csv with dask: ",(end-start),"sec")
Read csv with dask: 0.007996797561645508 sec
你看这个速度是不是很快