python读取euc-kr编码文件_修复损坏的编码（使用Python）

最新推荐文章于 2021-07-07 15:14:39 发布

weixin_39940425

最新推荐文章于 2021-07-07 15:14:39 发布

阅读量181

点赞数

文章标签： python读取euc-kr编码文件

I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.

So far I managed to fix a file with the following:

Open a file with EditPlus (it shows the file's encoding is UTF8+BOM)

In EditPlus, save the file as ANSI

Lastly, in Python:

with codecs.open(html, 'rb', encoding='euc-kr') as source_file:

contents = source_file.read()

with open(html, 'w+b') as dest_file:

dest_file.write(contents.encode('utf-8'))

I want to automate this, but I have not been able to do so. I can open the original file in Python:

codecs.open(html, 'rb', encoding='utf-8-sig')

However, I haven't been able to figure out how to do the 2. part.

解决方案

I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.

Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:

import io

with io.open(html, encoding='utf-8-sig') as infh:

data = infh.read().encode('latin1').decode('euc-kr')

with io.open(html, 'w', encoding='utf8') as outfh:

outfh.write(data)

I'm using the io.open() function here instead of codecs as the more robust method; io is the new Python 3 library also backported to Python 2.

Demo:

>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'

>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')

미술

weixin_39940425

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python读取euc-kr编码文件_修复损坏的编码（使用Python）

I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.So far I ma...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。