首先要明确一点内容,那就是在Python3当中,所有的字符串类型都是Unicode。
而你想读入的磁盘当中的文档是有编码的。
2. 演示1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28f1 = r'./Files/UTF8.txt'
f2 = r'./Files/GBK.txt'
with open(f1, encoding='utf-8') as f:
s_utf8 = f.read()
with open(f2, encoding='gbk') as f:
s_gbk = f.read()
with open(f1, 'rb') as f:
b_utf8 = f.read()
with open(f2, 'rb') as f:
b_gbk = f.read()
print('读入的str是Uincode'.center(70, '-'))
print('s_utf8:', s_utf8)
print('s_gbk:', s_gbk)
print('s_utf8 == s_gbk: ', s_utf8 == s_gbk)
print('读入的bytes是编码后的结果(utf8码gbk码等等)'.center(70, '-'))
print('b_utf8:')
print(b_utf8)
print('s_gbk:')
print(b_gbk)
print('对str进行编码encode'.center(70, '-'))
print('s_utf8和s_gbk本质上都是相同的Unicode, 结果只取决于encode为什么')
print("s_utf8.encode('utf-8') == b_utf8: ", s_utf8.encode('utf-8') == b_utf8)
print("s_utf8.encode('gbk') == b_gbk: ", s_utf8.encode('gbk') == b_gbk)
print("s_gbk.encode('utf-8') == b_utf8: ", s_gbk.encode('utf-8') == b_utf8)
print("s_gbk.encode('gbk') == b_gbk: ", s_gbk.encode('gbk') == b_gbk)
print('对byte进行解码decode'.center(70, '-'))
print("b_utf8.decode('utf-8'):", b_utf8.decode('utf-8'))
print('b_gbk.decode("gbk"):', b_gbk.decode("gbk"))
结果如下:----------------------------读入的str是Uincode----------------------------
s_utf8: 共享单车是人类社会的一大进步。
s_gbk: 共享单车是人类社会的一大进步。
s_utf8 == s_gbk: True
---------------------读入的bytes是编码后的结果(utf8码gbk码等等)-----------------
----
b_utf8:
b'xe5x85xb1xe4xbaxabxe5x8dx95xe8xbdxa6xe6x98xafxe4xbaxbaxe7x
b1xbbxe7xa4xbexe4xbcx9axe7x9ax84xe4xb8x80xe5xa4xa7xe8xbfx9bx
e6xadxa5xe3x80x82'
s_gbk:
b'xb9xb2xcfxedxb5xa5xb3xb5xcaxc7xc8xcbxc0xe0xc9xe7xbbxe1xb5x
c4xd2xbbxb4xf3xbdxf8xb2xbdxa1xa3'
----------------------------对str进行编码encode----------------------------
s_utf8和s_gbk本质上都是相同的Unicode, 结果只取决于encode为什么
s_utf8.encode('utf-8') == b_utf8: True
s_utf8.encode('gbk') == b_gbk: True
s_gbk.encode('utf-8') == b_utf8: True
s_gbk.encode('gbk') == b_gbk: True
---------------------------对byte进行解码decode----------------------------
b_utf8.decode('utf-8'): 共享单车是人类社会的一大进步。
b_gbk.decode("gbk"): 共享单车是人类社会的一大进步。
3. 如何打开一个不知道编码的文档1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18import chardet
"""
使用chardet,演示如何打开一个不知道编码的文档
"""
f1 = r'./Files/UTF8.txt'
f2 = r'./Files/GBK.txt'
def (path):
with open(path, 'rb') as f:
byteSeq = f.read()
res = chardet.detect(byteSeq)
print(res)
s = byteSeq.decode(res['encoding'])
return s
s = openFile(f1)
print(s)
s = openFile(f2)
print(s){'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
共享单车是人类社会的一大进步。
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
共享单车是人类社会的一大进步。