I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:
df = pd.read_csv('Check400_900.csv', sep='\t')
doesn't work so I found iterate and chunksize in a similar post so I used
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
All good, i can for example print df.get_chunk(5) and search the whole file with just
for chunk in df:
print chunk
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()
I hope my question is not so confusing
解决方案
I think you need concat chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
isn't dataframe, but pandas.io.parsers.TextFileReader - source.
tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.