I have an UTF-8 character encoded with `_' in between, e.g., '_ea_b4_80'.
I'm trying to convert it into UTF-8 character using replace method, but I can't get the correct encoding.
This is a code example:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
r = '_ea_b4_80'
r2 = '\xea\xb4\x80'
r = r.replace('_', '\\x')
print r
print r.encode("utf-8")
print r2
In this example, r is not the same as r2; this is an output.
\xea\xb4\x80
\xea\xb4\x80
관 <-- correctly shown
What might be wrong?
解决方案
\x is only meaningful in string literals, you're can't use replace to add it.
To get your desired result, convert to bytes, then decode:
import binascii
r = '_ea_b4_80'
rhexonly = r.replace('_', '') # Returns 'eab480'
rbytes = binascii.unhexlify(rhexonly) # Returns b'\xea\xb4\x80'
rtext = rbytes.decode('utf-8') # Returns '관' (unicode if Py2, str Py3)
print(rtext)
which should get you 관 as you desire.
If you're using modern Py3, you can avoid the import (assuming r is in fact a str; bytes.fromhex, unlike binascii.hexlify, only take str inputs, not bytes inputs) using the bytes.fromhex class method in place of binascii.unhexlify:
rbytes = bytes.fromhex(rhexonly) # Returns b'\xea\xb4\x80'