python非法字符ufeff,u'\ufeff'在Python字符串中

I get an error with the following patter:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

Not sure what u'\ufeff' is, it shows up when I'm web scraping. How can I remedy the situation? The .replace() string method doesn't work on it.

解决方案

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2

#coding: utf8

u = u'ABC'

e8 = u.encode('utf-8') # encode without BOM

e8s = u.encode('utf-8-sig') # encode with BOM

e16 = u.encode('utf-16') # encode with BOM

e16le = u.encode('utf-16le') # encode without BOM

e16be = u.encode('utf-16be') # encode without BOM

print 'utf-8 %r' % e8

print 'utf-8-sig %r' % e8s

print 'utf-16 %r' % e16

print 'utf-16le %r' % e16le

print 'utf-16be %r' % e16be

print

print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')

print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')

print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')

print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8 'ABC'

utf-8-sig '\xef\xbb\xbfABC'

utf-16 '\xff\xfeA\x00B\x00C\x00' # Adds BOM and encodes using native processor endian-ness.

utf-16le 'A\x00B\x00C\x00'

utf-16be '\x00A\x00B\x00C'

utf-8 w/ BOM decoded with utf-8 u'\ufeffABC' # doesn't remove BOM if present.

utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present.

utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present.

utf-16 w/ BOM decoded with utf-16le u'\ufeffABC' # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won't know if the data is big- or little-endian.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值