python csv.reader没有指定引用符_如何解决:使用.csv Dictreader的Python导入文件失败时出现未定义字符...

First of all, I found the following which is basically the same as my question, but it is closed and I'm not sure I understand the reason for closing vs. the content of the post. I also don't really see a working answer.

I have 20+ input files from 4 apps. All files are exported as .csv files. The first 19 files worked (4 others exported from the same app work) and then I ran into a file that gives me this error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5762: character maps to

If I looked that up right it is a &lt ctrl &gt. The code below are the relevant lines:

with open(file, newline = '') as f:

reader = csv.DictReader(f, dialect = 'excel')

for line in reader:

I know I'm going to be getting a file. I know it will be a .csv. There may be some variance in what I get due to the manual generation/export of the source files. There may also be some strange characters in some of the files (e.g. Japanese, Russian, etc.). I provide this information because going back to the source to get a different file might just kick the can down the road until I have to pull updated data (or worse, someone else does).

So the question is probably multi-part:

1) Is there a way to tell the csv.DictReader to ignore undefined characters? (Hint for the codec: if I can't see it, it is of no value to me.)

2) If I do have "crazy" characters, what should I do? I've considered opening each input as a binary file, filtering out offending hex characters, writing the file back to disk and then opening the new file, but that seems like a lot of overhead for the program and even more for me. It's also a few JCL statements from being 1977 again.

3) How do I figure out what I'm getting as an input if it crashes while I'm reading it in.

4) I chose the "dialect = 'excel'"; because many of the inputs are Excel files that can be downloaded from one of the source applications. From the docs on dictreader, my impression is that this just defines delimiter, quote character and EOL characters to expect/use. Therefore, I don't think this is my issue, but I'm also a Python noob, so I'm not 100% sure.

解决方案

I posted the solution I went with in the comments above; it was to set the errors argument of open() to 'ignore':

with open(file, newline = '', errors='ignore') as f:

This is exactly what I was looking for in my first question in the original post above (i.e. whether there is a way to tell the csv.DictReader to ignore undefined characters).

Update: Later I did need to work with some of the Unicode characters and couldn't ignore them. The correct answer for that solution based on Excel-produced unicode .csv file was to use the 'utf_8_sig' codec. That deletes the byte order marker (utf-16 BOM) that Windows writes at the top of the file to let it know there are unicode characters in it.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值