python文件操作解码_Python - 处理混合编码文件

最新推荐文章于 2020-12-21 14:33:04 发布

weixin_39761558

最新推荐文章于 2020-12-21 14:33:04 发布

阅读量242

点赞数

文章标签： python文件操作解码

本文链接：https://blog.csdn.net/weixin_39761558/article/details/111763999

版权

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found there way in.

I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.

cp1252_to_unicode = {

"\x85": u'\u2026', # …

"\x91": u'\u2018', # ‘

"\x92": u'\u2019', # ’

"\x93": u'\u201c', # “

"\x94": u'\u201d', # ”

"\x97": u'\u2014' # —

}

for l in open('file.txt'):

for c, u in cp1252_to_unicode.items():

l = l.replace(c, u)

But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:

"\x85".replace("\x85", u'\u2026')

UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

Any ideas for how to deal with this?

解决方案

If you try to decode this sring as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -

However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.

In my utf-8 terminal, I can build a mixed incorrect string like this:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")

>>> print a

maçã ma��

>>> a.decode("utf-8")

Traceback (most recent call last):

File "", line 1, in

File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode

return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data

I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.

The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):

import codecs

last_position = -1

def mixed_decoder(unicode_error):

global last_position

string = unicode_error[1]

position = unicode_error.start

if position <= last_position:

position = last_position + 1

last_position = position

new_char = string[position].decode("cp1252")

#new_char = u"_"

return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

And on the console:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")

>>> last_position = -1

>>> print a.decode("utf-8", "mixed")

maçã maçã

weixin_39761558

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python文件操作解码_Python - 处理混合编码文件

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found there way in.I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and...
复制链接

扫一扫