python中常用的字符串格式有两种:一种是str类型,一种是bytes类型。
- str类型和bytes类型的转换:
1 >>> str1 = 'hello world!' 2 >>> type(str1) ##查看str1的数据类型 3 <class 'str'> 4 >>> b = str1.encode('utf-8') ##str到bytes的转换 5 >>> b,type(b) 6 (b'hello world!', <class 'bytes'>) 7 >>> str2 = b.decode('utf-8') ##bytes到str的转换 8 >>> str2,type(str2) 9 ('hello world!', <class 'str'>)
- 不同编码格式之间的转换:
一般采用先按照运来的编码格式解码到str,然后再编码为bytes。
1 >>> strs = "常用的字符串格式" 2 >>> b = strs.encode('GBK') 3 >>> b 4 b'\xb3\xa3\xd3\xc3\xb5\xc4\xd7\xd6\xb7\xfb\xb4\xae\xb8\xf1\xca\xbd' 5 >>> str1 = b.decode('GBK') 6 >>> str1 7 '常用的字符串格式' 8 >>> str1.encode('UTF-8') 9 b'\xe5\xb8\xb8\xe7\x94\xa8\xe7\x9a\x84\xe5\xad\x97\xe7\xac\xa6\xe4\xb8\xb2\xe6\xa0\xbc\xe5\xbc\x8f'
- 打开特定编码格式的文件
1 >>> output = open('test','r',encoding='utf-8') 2 >>> output.read() 3 '百度百科——全球最大中文百科全书' 4 >>> output.close() 5 >>> output = open('test','r',encoding='GBK') 6 >>> output.read() 7 Traceback (most recent call last): 8 File "<pyshell#28>", line 1, in <module> 9 output.read() 10 UnicodeDecodeError: 'gbk' codec can't decode byte 0xa7 in position 10: illegal multibyte sequence
- url中的非ascii码字符
1 def url2ascii(self,url_addr): ##url转码acsii 2 index = 0 3 url_ascii = "" 4 try: ##捕获转码失败,然后处理 5 url_ascii = url_addr.encode('ascii')##bytes 6 url_ascii = str(url_ascii, encoding = "ascii")##str 7 except UnicodeEncodeError: 8 url_Nonascii = re.findall(r'[\u0080-\uffff]+',url_addr) ##提取非ascii码编码范围的字符 9 url_asciilist = re.split(r'[\u0080-\uffff]+',url_addr) ##分割url_addr 10 for s in url_Nonascii: 11 url_ascii += url_asciilist[index] + quote(s) ##quote()函数将非ascci码转为%E4%BD%A0%E5%A5%BD格式的ascii码 12 index += 1 13 if index < len(url_asciilist): 14 url_ascii += url_asciilist[index]; 15 print(url_ascii) 16 return url_ascii
- 自动检测bytes的编码格式(Python V3.3 win32)
chardet模块可以自动检测网页,文件等以二进制方式打开的bytes stream的编码方式。
下载安装:
下载网址:https://github.com/byroot/chardet
安装过程,解压后,进入chardet-master文件夹,运行:
1 python srtup.py --help-commands 2 python setup.py build 3 python setup.py install 4 可能会需要安装setuptools模块(我安装的版本是distribute-0.6.38)
示例:
1 >>> import urllib.request 2 >>> urlp = urllib.request.urlopen('http://www.baidu.com') 3 >>> import chardet 4 >>> chardet.detect(urlp.read()) 5 {'encoding': 'utf-8', 'confidence': 0.99} 6 >>> urlp.close() 7 >>> inputs = open('test','rb') 8 >>> chardet.detect(inputs.read()) 9 {'encoding': 'GB2312', 'confidence': 0.99} 10 >>> inputs.close() 11 >>> inputs = open('test','rt') 12 >>> chardet.detect(inputs.read()) 13 Traceback (most recent call last): 14 File "<pyshell#9>", line 1, in <module> 15 chardet.detect(inputs.read()) 16 File "D:\program file\Python33\lib\site-packages\chardet2-2.0.3-py3.3.egg\chardet\__init__.py", line 24, in detect 17 u.feed(aBuf) 18 File "D:\program file\Python33\lib\site-packages\chardet2-2.0.3-py3.3.egg\chardet\universaldetector.py", line 98, in feed 19 if self._highBitDetector.search(aBuf): 20 TypeError: can't use a bytes pattern on a string-like object