pandas读取前几行数据_读取pandas数据框的前几行的方法

本文介绍了如何利用pandas的nrows参数高效地从大型CSV文件中只读取前几行数据,以获取样本,避免加载整个文件。
摘要由CSDN通过智能技术生成

Is there a built-in way to use read_csv to read only the first n lines of a file without knowing the length of the lines ahead of time? I have a large file that takes a long time to read, and occasionally only want to use the first, say, 20 lines to get a sample of it (and prefer not to load the full thing and take the head of it).

If I knew the total number of lines I could do something like footer_lines = total_lines - n and pass this to the skipfooter keyword arg. My current solution is to manually grab the first n lines with python and StringIO it to pandas:

import pandas as pd

from StringIO import StringIO

n = 20

with open('big_file.csv', 'r') as f:

head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

It's not that bad, but is there a more concise, 'pandasic' (?) way to do it with keywords or something?

解决方案

I think you can use the nrows parameter. From the docs:

nrows : int, default None

Number of rows of file to read. Useful for reading pieces of large files

which seems to work. Using one of the standard large test files (988504479 bytes, 5344499 lines):

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)

CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s

Wall time: 0.00 s

In [3]: len(z)

Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")

CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s

Wall time: 30.23 s

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值