“笨办法”学Python 3 ——练习23字符串，字节和字符编码

最新推荐文章于 2024-03-01 15:27:17 发布

原创最新推荐文章于 2024-03-01 15:27:17 发布 · 465 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python

Python 专栏收录该内容

47 篇文章

订阅专栏

本文通过练习23的源代码深入探讨了Python中的编码与解码概念，涉及open()函数、readline()函数、encode()和decode()函数的使用。通过不同编码方式（如utf-8、utf-16、big5）处理文本文件，展示了编码错误的处理方法。同时，介绍了ASCII码、Unicode、UTF-8编码的原理，并分析了终端显示问题及其解决方案。内容涵盖了文件操作、字符串处理和编码转换等多个方面。

练习23 源代码

#encode编码-----decode解码
import sys
script, encoding,error = sys.argv


def main(language_file, encoding,errors):
    line = language_file.readline()     #按行阅读文件
    
    if line:    #当line有内容时会返回True
        print_line(line,encoding,errors)    #运行自定义的print_line函数
        return main(language_file,encoding,errors)　 #返回main（）函数
    
def print_line(line, encoding, errors):
    next_lang = line.strip()   #strip（）函数：删除字符串前导和后面的空格
    raw_bytes = next_lang.encode(encoding, errors = errors) #encode（）函数编码。
    cooked_string = raw_bytes.decode(encoding, errors = errors) #decode（）函数解码
    
    
    print(raw_bytes, "<===>", cooked_string)
    
    
languages = open("C:\\Users\limin\Desktop\Python3_exercises\languages.txt", encoding = "utf-8")

main(languages,encoding,error)

输出结果

在终端输入 C:\Users\limin\Desktop\Python3_exercises\ex23.py utf-8 strict

b'Afrikaans' <==> Afrikaans
b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b' <==> አማርኛ
b'\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0' <==> Аҧсшәа
b'\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9' <==>
b'Aragon\xc3\xa9s' <==> Aragonés
b'Arpetan' <==> Arpetan
b'Az\xc9\x99rbaycanca' <==> Azərbaycanca
b'Bamanankan' <==> Bamanankan
b'\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbe' <==> বাংলা
b'B\xc3\xa2n-l\xc3\xa2m-g\xc3\xba' <==> Bân-lâm-gú
b'\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0\xb0\xd1\x8f' <==> Беларуская
b'\xd0\x91\xd1\x8a\xd0\xbb\xd0\xb3\xd0\xb0\xd1\x80\xd1\x81\xd0\xba\xd0\xb8' <==> Български
b'Boarisch' <==> Boarisch
b'Bosanski' <==> Bosanski
......很多行，没有完全复制

注意：
终端输出的内容不一定如上是完整的，实际上会有很多空白方格组成，出现的原因很可能是终端没有用 UTF-8 来显示。
解决方法：windows10系统，可在cmd终端输入chcp 65001，表示utf-8编码，点击终端框，选择“属性”，可在“选项”栏目下看见代码页为utf-8。终端默认为chcp 936，即简体中文gkb。（实际上，即使改为utf-8的编码，依旧有少部分文字无法显示。）
另外有办法可以直接修改终端的默认编码，可参考以下文章：https://blog.csdn.net/weixin_45265547/article/details/121931397

知识点总结

一、编码与解码
二、open（）函数使用
三、readline（）函数使用
四、encode（）和decode（）函数
五、函数使用时不需要用“.”,方法需要用“.”
六、if语句的初接触，以及布尔值
七、位、字节、字节串、字符串、ASCII码、utf-8惯例的了解。

开关、惯例和编码：

什么是bits(位)：计算机的核心是大量的开关，其中1表示开，0表示关，1和0被成为“比特”（bits），即“位”。
什么是byte（字节）：计算机为了编码更大的数字，使用例如8个1和0来编码256个数，其中000000000代表数字0，11111111表示255。现在定义一个字节（byte）称为8个比特（位）的序列，定义我们对于字节的编码，另外还有用16,32,64甚至更多比特来给字节赋值的。
什么是编码：让数字映射（map）成文字；编码是信息从一种形式或格式转换为另一种形式的过程；解码是编码的逆过程，解码是受传者将接受到的符号或代码还原为信息的过程。
什么是“DBES”：“Decode Bytes Encode Strings”（解码字节，编码字符串），记忆为“迪拜斯”。
什么是ASCII码：是一种最常见的数字映射文本的惯例，American standard code for information interchange，美国标准信息交换代码。其中 90 表示 Z，十进制 90 转换为二进制就是 01011010（1*2^1+1*2^3+1*2^4+1*2^6 = 2+8+16+64 = 90），在电脑中 Z 就是用一组 8 位字节 01011010 表示。
什么是Unicode：ASCII码有个问题，它只能编码英文以及一些相似的语言，所以有人发明了Unicode（universal encoding）通用编码，可以用 32 位编码一个Unicode字符。
32 位编码一个Unicode字符太浪费了，所以我们大部分常用 8 位，需要用更大的数的时候就去使用 16 位或 32 位，这就是一种压缩编码的惯例。在Python中编码文本的惯例叫做UTF-8（Unicode Transformation Foemat 8 Bits），也可以选择其他的编码方式如UTF-16、Big 5 等等，但是utf-8是目前的标准。

分解输出结果

脚本把左边的b‘ ’（字节串）中的字节转换成了右边的utf-8编码。
用b’ '告诉python这是字节（byte），这些原始字节在被加工（编码）后显示在右边，以便在终端呈现出来真正的字符。

分解代码

open()函数：
语法：open(file,mode,encoding)
用法：file <=> 文件名，作为输入参数文件名，是字符串类型；
mode <=> 文件打开模式，常用的为只写“w”和只读“r”；只写在文件已存在时会清空再返回文件对象，若不存在则创建新文件，默认模式是只读“r”；encoding <===> 所要打开文件的编码格式，打开时若编码格式不对应会报错
例子：languages = open("C:\\Users\limin\Desktop\Python3_exercises\languages.txt", encoding = "utf-8") #打开文件languages.txt，以只读模式，编码格式为utf-8。
if 语句：
形如“if+变量名”时，这里将字符串作为布尔值（True/False），此时None、0（数字零）、“”(空字符串）、（）、[]、{}(空元组、列表、字典）均表示False；非空、负数为True，真值就执行if语句里面的内容。
read/readline方法：
区别：read 读取整个文件，将文件内容放到一个字符串变量中；readline 每次读取一行；返回的是一个字符串对象，保持当前行的内存
例如：

line = language_file.readline()

encode/decode方法：
语法：encode（encoding，errors）/ decode（encoding，errors）
用法：encoding <=> 要使用的编码格式，如”utf-8“、”utf-16“等等，默认”utf-8“
errors <=> 设置不同的错误处理方案。，默认为”strict“，意为编码错误引起一个UnicodeError。其他可能得值有 ‘ignore’, ‘replace’, ‘xmlcharrefreplace’, ‘backslashreplace’ 以及通过 codecs.register_error() 注册的任何值
例如：
输入

a = "Python世界"
a_utf8 = a.encode(encoding = 'utf-8',errors = 'strict') #使用utf-8惯例编码
a_gbk = a.encode(encoding = 'GBK',errors = 'strict') #使用gbk惯例编码

print(a)
print("UTF-8编码：",a_utf8)#打印
print("GBK编码：", a_gbk)

print("UTF-8解码：", a_utf8.decode(encoding = "utf-8",errors = "strict")) #使用utf-8解码
print("GBK解码：",a_gbk.decode(encoding = 'GBK',errors = 'strict')) #使用GBK解码

输出

Python世界
UTF-8编码： b'Python\xe4\xb8\x96\xe7\x95\x8c'
GBK编码： b'Python\xca\xc0\xbd\xe7'
UTF-8解码： Python世界
GBK解码： Python世界

深入了解代码

将languages.txt删除部分，只截取部分。
1. 用utf-8编码：

C:\Users\limin>python C:\Users\limin\Desktop\Python3_exercises\ex23.py utf-8 strict
b'Afrikaans' <===> Afrikaans
b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b' <===> አማርኛ
b'\xd0\x90\xd2\xa7\xd1\x81\xd1\x88\xd3\x99\xd0\xb0' <===> Аҧсшәа
b'\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9' <===> العربية
b'Aragon\xc3\xa9s' <===> Aragonés
b'Arpetan' <===> Arpetan
b'Az\xc9\x99rbaycanca' <===> Azərbaycanca
b'Bamanankan' <===> Bamanankan
b'' <===>
b'Hrvatski' <===> Hrvatski
b'Ido' <===> Ido
b'Interlingua' <===> Interlingua
b'Italiano' <===> Italiano
b'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e' <===> 日本語
b'Norsk bokm\xc3\xa5l' <===> Norsk bokmål
b'Nouormand' <===> Nouormand
b'V\xc3\xb5ro' <===> Võro
b'\xe6\x96\x87\xe8\xa8\x80' <===> 文言
b'\xe5\x90\xb4\xe8\xaf\xad' <===> 吴语
b'\xd7\x99\xd7\x99\xd6\xb4\xd7\x93\xd7\x99\xd7\xa9' <===> ייִדיש
b'\xe4\xb8\xad\xe6\x96\x87' <===> 中文

2. 用utf-16编码：

C:\Users\limin>python C:\Users\limin\Desktop\Python3_exercises\ex23.py utf-16 strict
b'\xff\xfeA\x00f\x00r\x00i\x00k\x00a\x00a\x00n\x00s\x00' <===> Afrikaans
b'\xff\xfe\xa0\x12\x1b\x12-\x12\x9b\x12' <===> አማርኛ
b'\xff\xfe\x10\x04\xa7\x04A\x04H\x04\xd9\x040\x04' <===> Аҧсшәа
b"\xff\xfe'\x06D\x069\x061\x06(\x06J\x06)\x06" <===> العربية
b'\xff\xfeA\x00r\x00a\x00g\x00o\x00n\x00\xe9\x00s\x00' <===> Aragonés
b'\xff\xfeA\x00r\x00p\x00e\x00t\x00a\x00n\x00' <===> Arpetan
b'\xff\xfeA\x00z\x00Y\x02r\x00b\x00a\x00y\x00c\x00a\x00n\x00c\x00a\x00' <===> Azərbaycanca
b'\xff\xfeB\x00a\x00m\x00a\x00n\x00a\x00n\x00k\x00a\x00n\x00' <===> Bamanankan
b'\xff\xfe' <===>
b'\xff\xfeH\x00r\x00v\x00a\x00t\x00s\x00k\x00i\x00' <===> Hrvatski
b'\xff\xfeI\x00d\x00o\x00' <===> Ido
b'\xff\xfeI\x00n\x00t\x00e\x00r\x00l\x00i\x00n\x00g\x00u\x00a\x00' <===> Interlingua
b'\xff\xfeI\x00t\x00a\x00l\x00i\x00a\x00n\x00o\x00' <===> Italiano
b'\xff\xfe\xe5e,g\x9e\x8a' <===> 日本語
b'\xff\xfeN\x00o\x00r\x00s\x00k\x00 \x00b\x00o\x00k\x00m\x00\xe5\x00l\x00' <===> Norsk bokmål
b'\xff\xfeN\x00o\x00u\x00o\x00r\x00m\x00a\x00n\x00d\x00' <===> Nouormand
b'\xff\xfeV\x00\xf5\x00r\x00o\x00' <===> Võro
b'\xff\xfe\x87e\x00\x8a' <===> 文言
b'\xff\xfe4T\xed\x8b' <===> 吴语
b'\xff\xfe\xd9\x05\xd9\x05\xb4\x05\xd3\x05\xd9\x05\xe9\x05' <===> ייִדיש
b'\xff\xfe-N\x87e' <===> 中文

3.用big5编码：
报错，如下：

C:\Users\limin>python C:\Users\limin\Desktop\Python3_exercises\ex23.py big5 strict
b'Afrikaans' <===> Afrikaans
Traceback (most recent call last):
  File "C:\Users\limin\Desktop\Python3_exercises\ex23.py", line 31, in <module>
    main(languages,encoding,error)
  File "C:\Users\limin\Desktop\Python3_exercises\ex23.py", line 18, in main
    return main(language_file, encoding,errors) #返回main（）函数
  File "C:\Users\limin\Desktop\Python3_exercises\ex23.py", line 17, in main
    print_line(line,encoding,errors)    #运行自定义的print_line函数
  File "C:\Users\limin\Desktop\Python3_exercises\ex23.py", line 22, in print_line
    raw_bytes = next_lang.encode(encoding, errors = errors) #encode（）函数编码。
UnicodeEncodeError: 'big5' codec can't encode character '\u12a0' in position 0: illegal multibyte sequence

更改错误处理方式，改为replace：

C:\Users\limin>python C:\Users\limin\Desktop\Python3_exercises\ex23.py big5 replace
b'Afrikaans' <===> Afrikaans
b'????' <===> ????
b'??\xc7\xda\xc7\xe1?\xc7\xc8' <===> ??сш?а
b'???????' <===> ???????
b'Aragon?s' <===> Aragon?s
b'Arpetan' <===> Arpetan
b'Az?rbaycanca' <===> Az?rbaycanca
b'Bamanankan' <===> Bamanankan
b'' <===>
b'Hrvatski' <===> Hrvatski
b'Ido' <===> Ido
b'Interlingua' <===> Interlingua
b'Italiano' <===> Italiano
b'\xa4\xe9\xa5\xbb\xbby' <===> 日本語
b'Norsk bokm?l' <===> Norsk bokm?l
b'Nouormand' <===> Nouormand
b'V?ro' <===> V?ro
b'\xa4\xe5\xa8\xa5' <===> 文言
b'??' <===> ??
b'??????' <===> ??????
b'\xa4\xa4\xa4\xe5' <===> 中文