LearnPython3theHardWay__Excercise 23 Strings, Bytes, and Character Encodings

最新推荐文章于 2019-10-06 16:13:54 发布

byakki

最新推荐文章于 2019-10-06 16:13:54 发布

阅读量304

点赞数

分类专栏： LearnPython3theHardWay

本文链接：https://blog.csdn.net/byakki/article/details/86761510

版权

LearnPython3theHardWay 专栏收录该内容

48 篇文章 2 订阅

订阅专栏

字符串，字节，和字符编码
这一节篇幅非常非常的长。。。看得我头晕

做这一节练习前，我们下载一个UTF-8编码的文本文件 languages.txt 。

这个文件内容是人类的语言列表，可以告诉我们一些有趣的概念：

现代计算机是怎么存储人类的语言，python3怎么调用这些字符串
如何把python的字符串“编码”和“解码”为一种称为bytes的类型
在操作字符串和字节时怎么处理错误
怎样读代码和理解它的意思，即使你从来没见过

import sys
script, input_encoding, error = sys.argv


def main(language_file, encoding, errors):	# 定义main函数，内有三个参数，是代码的主要部分，在脚本的末尾调用，以便启动脚本
	line = language_file.readline()		# 取出文件的一行，赋值给line

	if line:	# if，条件分支，如果line有值，就执行 print_line函数和return，如果没有就跳过下面两行
		print_line(line, encoding, errors)
		return main(language_file, encoding, errors) # 我又调用了main函数。我希望不停地调用main，一旦达不成if的条件，就会停止循环。


def print_line(line, encoding, errors):	# 定义print_line函数，内有三个参数
	next_lang = line.strip()	# .strip()可以将句首尾的\n去掉
	raw_bytes = next_lang.encode(encoding, errors=errors)	# 对每一行进行编码为原始字节。和怎么处理errors
	cooked_string = raw_bytes.decode(encoding, errors = errors) # 上一行的反操作。把收到的编码进行解码

	print(raw_bytes, "<===>", cooked_string) # 把上面的变量都打印出来 


languages = open("languages.txt", encoding="utf-8") # 函数弄好了，需要打开文件才能操作

main(languages, input_encoding, error) # 开始调用main 函数

肯定有很多看不懂，不过都打完代码了，何不运行一次看看？由于结果太长。。只截第一行和最后10行

C:\Users\PycharmProjects\learnpythonthehardway>python ex23.py utf-8 strict
...
b'\xd0\xa2\xd0\xbe\xd2\xb7\xd0\xb8\xd0\xba\xd3\xa3' <===> Тоҷикӣ
b'T\xc3\xbcrk\xc3\xa7e' <===> Türkçe
b'\xd0\xa3\xd0\xba\xd1\x80\xd0\xb0\xd1\x97\xd0\xbd\xd1\x81\xd1\x8c\xd0\xba\xd0\xb0' <===> Українська
b'\xd8\xa7\xd8\xb1\xd8\xaf\xd9\x88' <===> اردو
b'Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t' <===> Tiếng Việt
b'V\xc3\xb5ro' <===> Võro
b'\xe6\x96\x87\xe8\xa8\x80' <===> 文言
b'\xe5\x90\xb4\xe8\xaf\xad' <===> 吴语
b'\xd7\x99\xd7\x99\xd6\xb4\xd7\x93\xd7\x99\xd7\xa9' <===> ייִדיש
b'\xe4\xb8\xad\xe6\x96\x87' <===> 中文

这些例子使用 uft-8 , utf-16, 和 big5 编码来显示转换和你能得到的错误类型。所有这些都被叫做编译器在python3里，但你用的参数’ encoding’

Switches, Conventions , and Encodings

接下来作者开始讲计算机的基础知识，如为什么计算机的语言只有0和1？我们称他们为bits。
以前，8位机用0和1能表示的数字是（0-255）。（80、90年代的游戏，人物的属性最大为255，现在可以理解了吧）
目前仍然流行的是（ASCII）。它将数字映射为字母。如数字90是Z，其在bit里是 1011010。你可以在python里试试

>>> 0b1011010
90
>>> ord('Z')
90
>>> chr(90)
'Z'

如果我想写我的名字，“ZedA.Shaw” 我只需要一个字节序列就可以了：[90，101，100，32，65，46，32，83，104，97，119]

然后作者啪嗒啪嗒啪嗒。。。我就懒得说了。。建议大家看ascii和unicode的前世今生。。。

正如我们从运行结果里看到的，python把语言转换成 byte string，b’\xd7\x99…'这种，再转换成utf-8输出

我们把执行命令改一下，看看结果？

python ex23.py utf-16 strict

python ex23.py big5 strict

作者最后给了我们挑战任务。。我已经晕了。交给你们！

Find strings of text encoded in other encodings and place them in the ex23 .py file to see how it breaks.
Find out what happens when you give an encoding that doesn’t exist.
Extra challenging: Rewrite this using the b’’ bytes instead of the UTF-8 strings, effectively reversing the script.
If you can do that, then you can also break these bytes by removing some to see what happens. How much do you need to remove to cause Python to break? How much can you remove to damage the string output but pass Python’s decoding system.
Use what you learned from item 4 to see if you can mangle the files. What errors do you get? How much damage can you cause and get the file past Python’s decoding system?