Python 普通str字符串和 unicode 字符串及字符串编码探测、转换

最新推荐文章于 2024-07-27 12:20:46 发布

weixin_34138255

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量241

点赞数

文章标签： python c/c++ 操作系统

原文链接：https://my.oschina.net/tinyhare/blog/295293

版权

2019独角兽企业重金招聘Python工程师标准>>>

本文研究时的环境是CentOS release 6.4，内核版本2.6.32-358.el6.x86_64，python2.6.6

内容：关于字符串的两个魔术方法__str__() 、__unicode__() 两个函数str() 、unicode() 类型转换encode 、decode 和编码探测chardet、 cchardet

先看一下对象的两个魔术方法

第一个：object.__str__(self)

Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object.The return value must be a string object.

被内建函数str() 和 print语句调用，产生非正式的对对象的描述字符串。返回值必须是string对象（这里指的应该是bytes object字节对象）

第二个：object.__unicode__(self)

Called to implement unicode() built-in; should return a Unicode object. When this method is not defined,string conversion is attempted, and the result of string conversion is converted to Unicode using the system default encoding.

被内建函数unicode()调用；应当返回一个Unicode对象。当没有定义此方法时，将会尝试字符串转换，字符串转换的结果是：使用系统默认编码将其转换为Unicode string。

str() 和 unicode()

str(object='')

Return a string containing a nicely printable representation of an object. For strings, this returns the string itself.If no argument is given, returns the empty string, ''.

返回对传入对象的便于打印的描述的字符串（调用对象的__str__()方法）。对于字符串对象（字节对象）将会返回他本身。如果没有参数，将返回空字符串。

unicode(object='')

unicode(object[, encoding[, errors]])

If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.

如果没有提供可选参数，unicode()将会模拟str()的行为，只是返回的是Unicode strings而不是8-bit strings。更准确的情况，如果传入的对象是Unicode string或他的子类，将不会进行任何译码操作，直接返回它本身。

For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.

对于提供了__unicode__()方法的对象，将不带参数调用此方法。其他情况下传入的必须是8-bit字符串描述，而后用编码解码器用系统默认编码译码为Unicode string。

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according toerrors; this specifies the treatment of characters which are invalid in the input encoding. If errors is'strict' (the default), a ValueError is raised on errors, while a value of 'ignore' causes errors to be silently ignored, and a value of 'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded. See also the codecs module.

简单的说就是将传入的8-bit字串或缓冲区用指定的编码解码生成Unicode string，error参数用来指定无法解码时的处理方式。

小结：

str():调用对象的__str__()方法，产生8-bit string

unicode()：调用对象的__unicode__()方法，返回Unicode string。如过对象没有__unicode__()方法就调用__str__()方法生成8-bit string（传入的如果就是8-bit string将省略这一步），然后对其用系统默认编码解码，生成Unicode string。

可见，我们自己做的类，最好还是提供__str__()和__unicode__()方法，对于写日志，debug等是很有用的。

注：查看python系统默认编码（上文中unicode()将会使用的默认值）

import sys
print sys.getdefaultencoding()

在交互模式设置字符编码

>>> reload(sys) # 这个很重要，否则报错
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf8')
>>> sys.getdefaultencoding()
'utf8'

普通string和Unicode string的区别

首先确认系统环境设置的是UTF-8：
echo $LANG
en_US.UTF-8

然后确认使用的终端调成utf-8编码否则有些实验会混乱。

1.普通string可以理解为我们平时理解的字符串，一个缓冲区里边放着字符串，内容可能是各种类型的编码。（我是把它当作C语言char型数组理解的）

>>> str_string = '你好，字符编码'

str_string的内容可能是：

utf8: '\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe7\xbc\x96\xe7\xa0\x81'

utf16: '\xff\xfe`O}Y\x0c\xff\x16\x7f\x01x'

utf32: '\xff\xfe\x00\x00`O\x00\x00}Y\x00\x00\x0c\xff\x00\x00\x16\x7f\x00\x00\x01x\x00\x00'

gb2312： '\xc4\xe3\xba\xc3\xa3\xac\xb1\xe0\xc2\xeb'

2. unicode是一种对象，用unicode字符集保存字符串（内部存储的是UCS2或UCS4，更底层依据操作系统的环境，是用wchar_t、unsigned short或unsigned long。详情可在python文档中搜索Encodings and Unicode及Py_UNICODE，）

>>> unicode_string = u'你好，字符编码'
>>> unicode_string
u'\u4f60\u4f1a\u597d\uff0c\u5b57\u7b26\u7f16\u7801'

下面这种情况（unicode应该是u'里边是\uxxxx的格式，但这个却是\x的），应该是系统编码与终端（本人用的Xshell）编码不一致造成，其实u'\xe4' == u'\u00e4'，这样容易造成乱码，请避免这种环境。

>>> unicode_string = u'你好，字符编码'
>>> unicode_string
u'\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe5\xad\x97\xe7\xac\xa6\xe7\xbc\x96\xe7\xa0\x81'

参考http://stackoverflow.com/questions/9845842/bytes-in-a-unicode-python-string

小结：

普通字符串（8-bit string，字节字符串）：

是用挨着的一个一个8位的二进制位保存字符串，是有编码的区别的。

Unicode string：

是用UCS2（或UCS4编译时决定）保存字符串的对象，是没有编码区别的，用它可以生成各种编码的普通字符串。

encode和decode的使用

上边两种字符串的区别弄明白了这两个函数就好理解了

encode是编码，decode是解码。

Unicode字符串要变成普通字符串就要用某种编码去“编码”（encode）。

普通字符串需要知道它本身是什么编码的，用此编码来“解码”（decode）才能生成Unicode字符串对象。

所以：

encode的使用

>>> a = u'你好，世界！'
>>> a
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> a.encode('utf8')
'\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x81'
>>> a.encode('utf16')
'\xff\xfe`O}Y\x0c\xff\x16NLu\x01\xff'
>>> a.encode('gb2312')
'\xc4\xe3\xba\xc3\xa3\xac\xca\xc0\xbd\xe7\xa3\xa1'
>>> a.encode('ascii') # 因为ascii字符集无法表示中文，所以会报错，字符串是u'Hello,World!'就行了
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
>>>

decode的使用

>>> b = a.encode('utf8')#先用a生成某中编码的普通字符串，然后进行decode，注意编码必须对应！
>>> b.decode('utf8')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> b = a.encode('utf16')
>>> b.decode('utf16')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> b = a.encode('gb2312')
>>> b.decode('gb2312')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> b = a.encode('utf8')
>>> b.decode('gb2312')#编码不对应的情况，b是utf8编码的字符串，用gb2312是不能解码的。
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 2-3: illegal multibyte sequence
>>> b.decode('gb2312','ignore')#这时候第二个参数出场了，可以设置忽略或者用问号替换，防止抛出异常
u'\u6d63\u30bd\u951b'
>>> print b.decode('gb2312','ignore')
浣ソ锛
>>> b.decode('gb2312','replace')
u'\u6d63\ufffd\u30bd\u951b\ufffd\ufffd\ufffd\ufffd\ufffd'
>>> print b.decode('gb2312','replace')
浣�ソ锛�����
>>>

来个混合使用的：-）

>>> a = u'你好，世界！'
>>> a
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> a.encode('utf8').decode('utf8')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> a.encode('utf8').decode('utf8').encode('gb2312')
'\xc4\xe3\xba\xc3\xa3\xac\xca\xc0\xbd\xe7\xa3\xa1'
>>> a.encode('utf8').decode('utf8').encode('gb2312').decode('gb2312')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> a.encode('utf8').decode('utf8').encode('gb2312').decode('gb2312').encode('utf16')
'\xff\xfe`O}Y\x0c\xff\x16NLu\x01\xff'
>>> a.encode('utf8').decode('utf8').encode('gb2312').decode('gb2312').encode('utf16').decode('utf16')u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'

python普通字符串编码检测

普通字符串有编码的分别，我们经常遇到通过网络或打开某个文件读取字符串的情况，而如果对端不是我们自己的程序，用的什么编码还真不好说，这就涉及到字符串编码检测了。

提前声明：

1.理论上是无法100%检测出是什么编码的，因为各种编码间存在冲突，同一个编码可能不同的字符集里都出现了，但是代表不同的字符，这里只能说检测字符串最可能是什么编码。（比如虽然你是用utf8编码的‘abc’，但是会被探测为ascii，因为ascii表示a、b、c的编码和utf8一样，这是utf8对ascii的兼容特性，其他各种编码很多也有这种兼容ascii的设计，这就是全屏乱码，字母数字下划线不会乱的原因所在，因为用任何编码解码都能正确显示abc123）

2.只能探测出“测码库”支持的编码，不支持的编码就无能为力了。（设想你自己制定的私有编码，别人怎么能检测出？）

3.被探测的样本字符串字符数越多越准确，太少了不行，两三个汉字的gb2312字符串是不能正确检测出的。

言归正传，咱们开始解码！！

先去pypi下载检测字符编码的库chardet或cchardet（文档说它更快，但是依赖另外的一个库）

https://pypi.python.org/pypi?%3Aaction=search&term=chardet&submit=search

来自chardet文档的例子：

The easiest way to use the Universal Encoding Detector library is with the detect function.

>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

更高级的例子：

If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.

import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print detector.result
{'encoding': 'EUC-JP', 'confidence': 0.99}

还有一个：

If you want to detect the encoding of multiple texts (such as separate files)

import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob('*.xml'):
    print filename.ljust(60),
    detector.reset()
    for line in file(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print detector.result

附：一篇比较好的关于python编码的文章，英文的但很易懂。