使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

在处理大量CSV文件时,部分文件可能导致UnicodeDecodeError。最佳解决方案包括确定文件编码(如utf-8)并在Pandas中指定,或者使用error_handler处理编码错误。可以使用chardet或file -I检测文件编码,然后用相应的编码读取文件。
摘要由CSDN通过智能技术生成

本文翻译自:UnicodeDecodeError when reading CSV file in Pandas with Python

I'm running a program which is processing 30,000 similar files. 我正在运行一个程序,正在处理30,000个类似文件。 A random number of them are stopping and producing this error... 他们中有随机数正在停止并产生此错误...

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

The source/creation of these files all come from the same place. 这些文件的源/创建都来自同一位置。 What's the best way to correct this to proceed with the import? 纠正此错误以继续导入的最佳方法是什么?


#1楼

参考:https://stackoom.com/question/1EFIZ/使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError


#2楼

read_csv takes an encoding option to deal with files in different formats. read_csv采用encoding选项来处理不同格式的文件。 I mostly use read_csv('file', encoding = "ISO-8859-1") , or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv . 我主要使用read_csv('file', encoding = "ISO-8859-1") ,或者encoding = "utf-8"进行读取,通常使用utf-8进行to_csv

You can also use one of several alias options like 'latin' instead of 'ISO-8859-1' (see python docs , also for numerous other encodings you may encounter). 您还可以使用多个alias选项之一,例如'latin'而不是'ISO-8859-1' (有关可能会遇到的许多其他编码,请参见python docs )。

See relevant Pandas documentation , python docs examples on csv files , and plenty of related questions here on SO. 请参阅相关的Pandas文档关于csv文件的python文档示例以及有关SO的大量相关问题。 A good background resource is What every developer should know about unicode and character sets . 一个好的背景资源是每个开发人员应了解的unicode和字符集

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page ) or file -i (linux) or file -I (osx) (see man page ). 要检测编码(假设文件包含非ASCII字符),可以使用enca (请参见手册页 )或file -i (Linux)或file -I (osx)(请参见手册页 )。


#3楼

Simplest of all Solutions: 所有解决方案中最简单的:

  • Open the csv file in Sublime text editor . Sublime文本编辑器中打开csv文件。
  • Save the file in utf-8 format. 以utf-8格式保存文件。

In sublime, Click File -> Save with encoding -> UTF-8 崇高地,单击文件->使用编码保存-> UTF-8

Then, you can read your file as usual: 然后,您可以照常读取文件:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

EDIT 1: 编辑1:

If there are many files, then you can skip the sublime step. 如果文件很多,则可以跳过升华步骤。

Just read the file using 只需使用读取文件

data = pd.read_csv('file_name.csv', encoding='utf-8')

and the other different encoding types are: 其他不同的编码类型是:

encoding = "cp1252"
encoding = "ISO-8859-1"

#4楼

Struggled with this a while and thought I'd post on this question as it's the first search result. 挣扎了一段时间,以为我会在这个问题上发布,因为它是第一个搜索结果。 Adding the encoding="iso-8859-1" tag to pandas read_csv didn't work, nor did any other encoding, kept giving a UnicodeDecodeError. 向熊猫read_csv添加encoding="iso-8859-1"标签不起作用,其他任何编码也不起作用,并始终给出UnicodeDecodeError。

If you're passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv . 如果要将文件句柄传递给pd.read_csv(),需要将encoding属性放在打开的文件上,而不是在read_csv Obvious in hindsight, but a subtle error to track down. 事后看来很明显,但要跟踪却有一个细微的错误。


#5楼

Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. 熊猫允许指定编码,但不允许忽略错误以免自动替换有问题的字节。 So there is no one size fits all method but different ways depending on the actual use case. 因此,没有一种适合所有方法的大小,而是取决于实际用例的不同方法。

  1. You know the encoding, and there is no encoding error in the file. 您知道编码,并且文件中没有编码错误。 Great: you have just to specify the encoding: 太好了:您只需要指定编码即可:

     file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.) pd.read_csv(input_file_and_path, ..., encoding=file_encoding) 
  2. You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. 您不希望被编码问题困扰,无论某些文本字段是否包含垃圾内容,都只希望加载该死的文件。 Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code): 好的,您只需要使用Latin1编码,因为它接受任何可能的字节作为输入(并将其转换为相同代码的unicode字符):

     pd.read_csv(input_file_and_path, ..., encoding='latin1') 
  3. You know that most of the file is written with a specific encoding, but it also contains encoding errors. 您知道大多数文件都是用特定的编码编写的,但是它也包含编码错误。 A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. 一个真实的示例是一个UTF8文件,该文件已使用非utf8编辑器进行了编辑,并且其中包含一些使用不同编码的行。 Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Pandas没有提供特殊的错误处理措施,但是Python open函数具有(假设Python3),并且read_csv接受类似于object的文件。 Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python's backslashed escape sequence: 这里使用的典型错误参数是'ignore' ,它仅抑制有问题的字节,或者(恕我直言更好)是'backslashreplace' ,其由其Python的反斜杠转义序列代替了有问题的字节:

     file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.) input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace') pd.read_csv(input_fd, ...) 

#6楼

with open('filename.csv') as f:
   print(f)

after executing this code you will find encoding of 'filename.csv' then execute code as following 执行此代码后,您将找到“ filename.csv”的编码,然后执行以下代码

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

there you go 你去

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值