[Python]字符编解码

最新推荐文章于 2024-07-26 09:40:54 发布

xl365t

最新推荐文章于 2024-07-26 09:40:54 发布

阅读量217

点赞数

分类专栏： Python 文章标签： python 编码

本文链接：https://blog.csdn.net/u010318270/article/details/81327972

版权

Python 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

# utf-8 str
>>> utf8_str = "你好"
>>> utf8_str
'\xe4\xbd\xa0\xe5\xa5\xbd'

# unicode str
>>> unicode_str = u"你好"
>>> unicode_str
u'\u4f60\u597d'

unicode编码的字符串，是一种内存编码格式，如果将这些数据存储到文件时，
一种方式是需要将unicode编码的字符串转换为特定字符集的存储编码格式，如：UTF-8,GBK等。

# unicode 转换为 utf-8
>>> unicode_str.encode('utf8')
'\xe4\xbd\xa0\xe5\xa5\xbd' 
>>> type(unicode_str.encode('utf8'))
<type 'str'>

另一种方式是使用unicode-escape进行存储，读取文件时再反向转换回来。

# unicode 转换为 unicode-escape
>>> unicode_escape_str = unicode_str.encode('unicode-escape')
>>> unicode_escape_str
'\\u4f60\\u597d'
# unicode-escape 转换为 unicode
>>> unicode_escape_str.decode('unicode-escape')
u'\u4f60\u597d'

utf-8编码的字符串，在存储时通常直接存储，但是还有另一种存储utf-8编码值的方式：string_escape

# utf-8 转换为 unicode
>>> utf8_str.decode('utf8')
u'\u4f60\u597d'
>>> utf8_str.decode('utf8') == unicode_str
True
# utf-8 转换为 string_escape
>>> string_escape_str = utf8_str.encode('string_escape')
>>> string_escape_str
'\\xe4\\xbd\\xa0\\xe5\\xa5\\xbd'
# string_escape 转换为 utf-8
>>> string_escape_str.decode('string_escape')
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print string_escape_str.decode('string_escape')
你好

url编码和中文的转换
# 在URL连接中会看到带有"%..%.."的字符，这是URL编码方式。
# 通过urllib库中的quote()和unquote()实现中文编码和URL编码的互换

# URL编码 转换 中文
>>> import urllib
>>> url_str = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA"
>>> str =  urllib.unquote(url_str)
>>> print str
https://search.jd.com/Search?keyword=手机
# 中文 转换 URL编码
>>> urllib.quote(str)
'https%3A//search.jd.com/Search%3Fkeyword%3D%E6%89%8B%E6%9C%BA'

遇到错误提示"UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)"时，
把编码方式设置为"utf8"

>>> utf8_str.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf8')
>>> utf8_str.encode('utf8')
'\xe4\xbd\xa0\xe5\xa5\xbd'