python在文件末尾追加_追加到末尾时，python utf-8-sig BOM位于文件中间

最新推荐文章于 2024-03-25 21:41:32 发布

weixin_39648297

最新推荐文章于 2024-03-25 21:41:32 发布

阅读量63

点赞数

文章标签： python在文件末尾追加

当使用Python的codecs模块以utf-8-sig编码追加写入文件时，每次都写入BOM（字节顺序标记）并非bug，而是正常行为。codecs无法检测文件已写入内容，可能导致多次写入BOM。解决方法是在写入前检查文件是否已存在，并手动处理BOM。建议仅在新建文件时使用utf-8-sig，以确保只在需要时添加BOM。

摘要由CSDN通过智能技术生成

I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. See below:

>>> import codecs, os

>>> os.path.isfile('123')

False

>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')

The following text ends up to the file:

123

Isn't that a bug? This is so not logical.

Could anyone explain to me why it was done so?

Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?

解决方案

No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.

Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.

Only use utf-8-sig on a new file; the codec will always write the BOM out whenever you use it.

If you are working directly with files, you can test for the start yourself; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:

import io

with io.open(filename, 'a', encoding='utf8') as outfh:

if outfh.tell() == 0:

# start of file

outfh.write(u'\ufeff')

I used the newer io.open() instead of codecs.open(); io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience.

Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.

The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).