python读取大文件csv_python – 将多个csv文件读取到HDF5时的Pandas...

最新推荐文章于 2024-05-16 15:39:03 发布

weixin_39932344

最新推荐文章于 2024-05-16 15:39:03 发布

阅读量520

点赞数

文章标签： python读取大文件csv

使用Python3,Pandas 0.12

我正在尝试将多个csv文件(总大小为7.9 GB)写入HDF5存储,以便稍后处理. csv文件每个包含大约一百万行,15列和数据类型主要是字符串,但有些浮点数.但是,当我尝试读取csv文件时,我收到以下错误：

Traceback (most recent call last):

File "filter-1.py", line 38, in

to_hdf()

File "filter-1.py", line 31, in to_hdf

for chunk in reader:

File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__

yield self.read(self.chunksize)

File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read

ret = self._engine.read(nrows)

File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read

data = self._reader.read(nrows)

File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)

File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)

File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)

File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)

File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)

pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991

Closing remaining open files: ta_store.h5... done

编辑：

我设法找到一个产生这个问题的文件.我认为它正在阅读一个EOF角色.但是我无法克服这个问题.鉴于组合文件的大小,我认为检查每个字符串中的每个单个字符太麻烦了. (即便如此,我仍然不确定该怎么做.)据我检查,csv文件中没有可能引发错误的奇怪字符.

我也尝试将error_bad_lines = False传递给pd.read_csv(),但错误仍然存??在.

我的代码如下：

# -*- coding: utf-8 -*-

import pandas as pd

import os

from glob import glob

def list_files(path=os.getcwd()):

''' List all files in specified path '''

list_of_files = [f for f in glob('2013-06*.csv')]

return list_of_files

def to_hdf():

""" Function that reads multiple csv files to HDF5 Store """

# Defining path name

path = 'ta_store.h5'

# If path exists delete it such that a new instance can be created

if os.path.exists(path):

os.remove(path)

# Creating HDF5 Store

store = pd.HDFStore(path)

# Reading csv files from list_files function

for f in list_files():

# Creating reader in chunks -- reduces memory load

reader = pd.read_csv(f, chunksize=50000)

# Looping over chunks and storing them in store file, node name 'ta_data'

for chunk in reader:

chunk.to_hdf(store, 'ta_data', mode='w', table=True)

# Return store

return store.select('ta_data')

return 'Finished reading to HDF5 Store, continuing processing data.'

to_hdf()

编辑

如果我进入引发CParserError EOF的CSV文件…并手动删除导致问题的行之后的所有行,则正确读取csv文件.但是我删除的所有内容都是空行.

奇怪的是,当我手动纠正错误的csv文件时,它们会被单独加载到商店中.但是当我再次使用多个文件的列表时,’false’文件仍然会返回错误.

weixin_39932344

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python读取大文件csv_python – 将多个csv文件读取到HDF5时的Pandas...

使用Python3,Pandas 0.12我正在尝试将多个csv文件(总大小为7.9 GB)写入HDF5存储,以便稍后处理. csv文件每个包含大约一百万行,15列和数据类型主要是字符串,但有些浮点数.但是,当我尝试读取csv文件时,我收到以下错误：Traceback (most recent call last):File "filter-1.py", line 38, in to_hdf()F...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。