python非法字符ufeff,u'\ufeff'在Python字符串中

最新推荐文章于 2024-05-06 22:01:59 发布

穿杨之光

最新推荐文章于 2024-05-06 22:01:59 发布

阅读量465

点赞数

文章标签： python非法字符ufeff

I get an error with the following patter:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

Not sure what u'\ufeff' is, it shows up when I'm web scraping. How can I remedy the situation? The .replace() string method doesn't work on it.

解决方案

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2

#coding: utf8

u = u'ABC'

e8 = u.encode('utf-8') # encode without BOM

e8s = u.encode('utf-8-sig') # encode with BOM

e16 = u.encode('utf-16') # encode with BOM

e16le = u.encode('utf-16le') # encode without BOM

e16be = u.encode('utf-16be') # encode without BOM

print 'utf-8 %r' % e8

print 'utf-8-sig %r' % e8s

print 'utf-16 %r' % e16

print 'utf-16le %r' % e16le

print 'utf-16be %r' % e16be

print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')

print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')

print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')

print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8 'ABC'

utf-8-sig '\xef\xbb\xbfABC'

utf-16 '\xff\xfeA\x00B\x00C\x00' # Adds BOM and encodes using native processor endian-ness.

utf-16le 'A\x00B\x00C\x00'

utf-16be '\x00A\x00B\x00C'

utf-8 w/ BOM decoded with utf-8 u'\ufeffABC' # doesn't remove BOM if present.

utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present.

utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present.

utf-16 w/ BOM decoded with utf-16le u'\ufeffABC' # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won't know if the data is big- or little-endian.

穿杨之光

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫