python read csv dtype,熊猫read_csv low_memory和dtype选项

When calling

df = pd.read_csv('somefile.csv')

I get:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130:

DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype

option on import or set low_memory=False.

Why is the dtype option related to low_memory, and why would making it False help with this problem?

解决方案

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id.

It contains 10 million rows where the user_id is always numbers.

Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd

from StringIO import StringIO

csvdata = """user_id,username

1,Alice

3,Bob

foobar,Caesar"""

sio = StringIO(csvdata)

pd.read_csv(sio, dtype={"user_id": int, "username": object})

ValueError: invalid literal for long() with base 10: 'foobar'

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值