python在文件末尾追加_追加到末尾时,python utf-8-sig BOM位于文件中间

当使用Python的codecs模块以utf-8-sig编码追加写入文件时,每次都写入BOM(字节顺序标记)并非bug,而是正常行为。codecs无法检测文件已写入内容,可能导致多次写入BOM。解决方法是在写入前检查文件是否已存在,并手动处理BOM。建议仅在新建文件时使用utf-8-sig,以确保只在需要时添加BOM。
摘要由CSDN通过智能技术生成

1586010002-jmsa.png

I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. See below:

>>> import codecs, os

>>> os.path.isfile('123')

False

>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')

>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')

The following text ends up to the file:

123

123

Isn't that a bug? This is so not logical.

Could anyone explain to me why it was done so?

Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?

解决方案

No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.

Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.

Only use utf-8-sig on a new file; the codec will always write the BOM out whenever you use it.

If you are working directly with files, you can test for the start yourself; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:

import io

with io.open(filename, 'a', encoding='utf8') as outfh:

if outfh.tell() == 0:

# start of file

outfh.write(u'\ufeff')

I used the newer io.open() instead of codecs.open(); io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience.

Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.

The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值