python解码unicode_python中的双解码unicode

最新推荐文章于 2023-07-11 16:54:46 发布

weixin_39625468

最新推荐文章于 2023-07-11 16:54:46 发布

阅读量119

点赞数

文章标签： python解码unicode

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.

I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).

The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.

But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:

>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')

>>> ret

u'X\xc3\xbcY\xc3\x9f'

>>> ret.decode('utf-8')

# Throws UnicodeEncodeError: 'ascii' codec can't encode ...

How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?

(And yes, I have reported this behaviour with the developers of the server-side.)

解决方案

ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.

If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:

>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')

'XüYß'

Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:

>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')

'\\u20ac€'

>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')

Traceback (most recent call last):

File "", line 1, in

UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)

In case you run into this sort of mixed data, you can use the codec again, to normalize everything:

>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')

'\\u20ac€'

>>> '\\u20ac€'.encode('raw_unicode_escape')

b'\\u20ac\\u20ac'

>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')

'€€'

weixin_39625468

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python解码unicode_python中的双解码unicode

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal ...
复制链接

扫一扫

python解码unicode_python中的双解码unicode

“相关推荐”对你有帮助么？