python读取大文件csv_pandas-从较大的CSV文件中将少量随机样本读取到Python数据帧中...

最新推荐文章于 2023-05-26 16:24:16 发布

weixin_39782433

最新推荐文章于 2023-05-26 16:24:16 发布

阅读量298

点赞数

文章标签： python读取大文件csv

pandas-从较大的CSV文件中将少量随机样本读取到Python数据帧中

我要读取的CSV文件不适合主存储器。如何读取其中的几行（〜10K）随机行，并对所选数据帧进行一些简单的统计？

8个解决方案

55 votes

假设CSV文件中没有标题：

import pandas

import random

n = 1000000 #number of records in file

s = 10000 #desired sample size

filename = "data.txt"

skip = sorted(random.sample(xrange(n),n-s))

df = pandas.read_csv(filename, skiprows=skip)

如果read_csv有一个保留行，或者如果跳过行使用了回调函数而不是列表，那会更好。

具有标题和未知文件长度：

import pandas

import random

filename = "data.txt"

n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)

s = 10000 #desired sample size

skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list

df = pandas.read_csv(filename, skiprows=skip)

dlm answered 2020-02-09T17:34:30Z

31 votes

@dlm的答案很好，但是从v0.20.0开始，skiprows确实接受了可调用对象。可调用对象接收行号作为参数。

如果您可以指定所需的行数百分比，而不是指定多少行，则您甚至不需要获取文件大小，而只需要通读一次文件即可。假设标题在第一行：

import pandas as pd

import random

p = 0.01 # 1% of the lines

# keep the header, then take only 1% of lines

# if random from [0,1] interval is greater than 0.01 the row will be skipped

df = pd.read_csv(

filename,

header=0,

skiprows=lambda i: i>0 and random.random() > p

)

或者，如果您想乘第n行：

n = 100 # every 100th line = 1% of the lines

df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

exp1orer answered 2020-02-09T17:34:59Z

20 votes

这不在Pandas中，但是通过bash可以更快地达到相同的结果，而不会将整个文件读入内存：

shuf -n 100000 data/original.tsv > data/sample.tsv

shuf命令将对输入进行混洗，and和-n参数指示在输出中需要多少行。

相关问题：[https://unix.stackexchange.com/q/108581]

可在此处查看700万行CSV的基准（2008年）：

最佳答案：

def pd_read():

filename = "2008.csv"

n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)

s = 100000 #desired sample size

skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list

df = pandas.read_csv(filename, skiprows=skip)

df.to_csv("temp.csv")

熊猫计时：

%time pd_read()

CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s

Wall time: 18.9 s

使用shuf时：

time shuf -n 100000 2008.csv > temp.csv

real 0m1.583s

user 0m1.445s

sys 0m0.136s

因此shuf的速度快12倍左右，重要的是不会将整个文件读入内存。

Bar answered 2020-02-09T17:35:50Z

10 votes

这是一种算法，不需要事先计算文件中的行数，因此您只需要读取一次文件。

假设您要m个样本。首先，该算法保留前m个样本。当它以概率m / i看到第i个样本（i> m）时，该算法将使用该样本随机替换已选择的样本。

这样，对于任何i> m，我们总是有从前i个样本中随机选择的m个样本的子集。

请参见下面的代码：

import random

n_samples = 10

samples = []

for i, line in enumerate(f):

if i < n_samples:

samples.append(line)

elif random.random() < n_samples * 1. / (i+1):

samples[random.randint(0, n_samples-1)] = line

desktable answered 2020-02-09T17:36:23Z

2 votes

以下代码首先读取标头，然后读取其他行上的随机样本：

import pandas as pd

import numpy as np

filename = 'hugedatafile.csv'

nlinesfile = 10000000

nlinesrandomsample = 10000

lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)

df = pd.read_csv(filename, skiprows=lines2skip)

queise answered 2020-02-09T17:36:43Z

1 votes

class magic_checker:

def __init__(self,target_count):

self.target = target_count

self.count = 0

def __eq__(self,x):

self.count += 1

return self.count >= self.target

min_target=100000

max_target = min_target*2

nlines = randint(100,1000)

seek_target = randint(min_target,max_target)

with open("big.csv") as f:

f.seek(seek_target)

f.readline() #discard this line

rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split

print rand_lines

print rand_lines[0].split(",")

我认为类似的东西应该起作用

Joran Beasley answered 2020-02-09T17:37:03Z

1 votes

没有熊猫！

import random

from os import fstat

from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read

lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped

min_bytes_to_skip = 10000

max_bytes_to_skip = 1000000

def is_EOF():

return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines

sampled_lines = []

for n in xrange(lines_to_read):

bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)

f.seek(bytes_to_skip, 1)

# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line

# Skip current entire line

f.readline()

if not is_EOF():

sampled_lines.append(f.readline())

else:

# Go to the begginig of the file ...

f.seek(0, 0)

# ... and skip lines again

f.seek(bytes_to_skip, 1)

# If it has reached the EOF again

if is_EOF():

print "You have skipped more lines than your file has"

print "Reduce the values of:"

print " min_bytes_to_skip"

print " max_bytes_to_skip"

exit(1)

else:

f.readline()

sampled_lines.append(f.readline())

print sampled_lines

您将得到一个sampled_lines列表。您的意思是什么统计？

Vagner Guedes answered 2020-02-09T17:37:27Z

1 votes

使用子样本

pip install subsample

subsample -n 1000 file.csv > file_1000_sample.csv

Zhongjun 'Mark' Jin answered 2020-02-09T17:37:48Z

weixin_39782433

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python读取大文件csv_pandas-从较大的CSV文件中将少量随机样本读取到Python数据帧中...

pandas-从较大的CSV文件中将少量随机样本读取到Python数据帧中我要读取的CSV文件不适合主存储器。如何读取其中的几行（〜10K）随机行，并对所选数据帧进行一些简单的统计？8个解决方案55 votes假设CSV文件中没有标题：import pandasimport randomn = 1000000 #number of records in files = 10000 #desire...
复制链接

扫一扫

python读取大文件csv_pandas-从较大的CSV文件中将少量随机样本读取到Python数据帧中...

“相关推荐”对你有帮助么？