python读取大文件csv_pandas-从较大的CSV文件中将少量随机样本读取到Python数据帧中...

pandas-从较大的CSV文件中将少量随机样本读取到Python数据帧中

我要读取的CSV文件不适合主存储器。 如何读取其中的几行(〜10K)随机行,并对所选数据帧进行一些简单的统计?

8个解决方案

55 votes

假设CSV文件中没有标题:

import pandas

import random

n = 1000000 #number of records in file

s = 10000 #desired sample size

filename = "data.txt"

skip = sorted(random.sample(xrange(n),n-s))

df = pandas.read_csv(filename, skiprows=skip)

如果read_csv有一个保留行,或者如果跳过行使用了回调函数而不是列表,那会更好。

具有标题和未知文件长度:

import pandas

import random

filename = "data.txt"

n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)

s = 10000 #desired sample size

skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list

df = pandas.read_csv(filename, skiprows=skip)

dlm answered 2020-02-09T17:34:30Z

31 votes

@dlm的答案很好,但是从v0.20.0开始,skiprows确实接受了可调用对象。 可调用对象接收行号作为参数。

如果您可以指定所需的行数百分比,而不是指定多少行,则您甚至不需要获取文件大小,而只需要通读一次文件即可。 假设标题在第一行:

import pandas as pd

import random

p = 0.01 # 1% of the lines

# keep the header, then take only 1% of lines

# if random from [0,1] interval is greater than 0.01 the row will be skipped

df = pd.read_csv(

filename,

header=0,

skiprows=lambda i: i>0 and random.random() > p

)

或者,如果您想乘第n行:

n = 100 # every 100th line = 1% of the lines

df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

exp1orer answered 2020-02-09T17:34:59Z

20 votes

这不在Pandas中,但是通过bash可以更快地达到相同的结果,而不会将整个文件读入内存:

shuf -n 100000 data/original.tsv > data/sample.tsv

shuf命令将对输入进行混洗,and和-n参数指示在输出中需要多少行。

相关问题:[https://unix.stackexchange.com/q/108581]

可在此处查看700万行CSV的基准(2008年):

最佳答案:

def pd_read():

filename = "2008.csv"

n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)

s = 100000 #desired sample size

skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list

df = pandas.read_csv(filename, skiprows=skip)

df.to_csv("temp.csv")

熊猫计时:

%time pd_read()

CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s

Wall time: 18.9 s

使用shuf时:

time shuf -n 100000 2008.csv > temp.csv

real 0m1.583s

user 0m1.445s

sys 0m0.136s

因此shuf的速度快12倍左右,重要的是不会将整个文件读入内存。

Bar answered 2020-02-09T17:35:50Z

10 votes

这是一种算法,不需要事先计算文件中的行数,因此您只需要读取一次文件。

假设您要m个样本。 首先,该算法保留前m个样本。 当它以概率m / i看到第i个样本(i> m)时,该算法将使用该样本随机替换已选择的样本。

这样,对于任何i> m,我们总是有从前i个样本中随机选择的m个样本的子集。

请参见下面的代码:

import random

n_samples = 10

samples = []

for i, line in enumerate(f):

if i < n_samples:

samples.append(line)

elif random.random() < n_samples * 1. / (i+1):

samples[random.randint(0, n_samples-1)] = line

desktable answered 2020-02-09T17:36:23Z

2 votes

以下代码首先读取标头,然后读取其他行上的随机样本:

import pandas as pd

import numpy as np

filename = 'hugedatafile.csv'

nlinesfile = 10000000

nlinesrandomsample = 10000

lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)

df = pd.read_csv(filename, skiprows=lines2skip)

queise answered 2020-02-09T17:36:43Z

1 votes

class magic_checker:

def __init__(self,target_count):

self.target = target_count

self.count = 0

def __eq__(self,x):

self.count += 1

return self.count >= self.target

min_target=100000

max_target = min_target*2

nlines = randint(100,1000)

seek_target = randint(min_target,max_target)

with open("big.csv") as f:

f.seek(seek_target)

f.readline() #discard this line

rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split

print rand_lines

print rand_lines[0].split(",")

我认为类似的东西应该起作用

Joran Beasley answered 2020-02-09T17:37:03Z

1 votes

没有熊猫!

import random

from os import fstat

from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read

lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped

min_bytes_to_skip = 10000

max_bytes_to_skip = 1000000

def is_EOF():

return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines

sampled_lines = []

for n in xrange(lines_to_read):

bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)

f.seek(bytes_to_skip, 1)

# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line

# Skip current entire line

f.readline()

if not is_EOF():

sampled_lines.append(f.readline())

else:

# Go to the begginig of the file ...

f.seek(0, 0)

# ... and skip lines again

f.seek(bytes_to_skip, 1)

# If it has reached the EOF again

if is_EOF():

print "You have skipped more lines than your file has"

print "Reduce the values of:"

print " min_bytes_to_skip"

print " max_bytes_to_skip"

exit(1)

else:

f.readline()

sampled_lines.append(f.readline())

print sampled_lines

您将得到一个sampled_lines列表。 您的意思是什么统计?

Vagner Guedes answered 2020-02-09T17:37:27Z

1 votes

使用子样本

pip install subsample

subsample -n 1000 file.csv > file_1000_sample.csv

Zhongjun 'Mark' Jin answered 2020-02-09T17:37:48Z

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值