《流畅的Python》第四章学习笔记

最新推荐文章于 2023-08-11 15:46:25 发布

测试游记

最新推荐文章于 2023-08-11 15:46:25 发布

阅读量627

点赞数

本文链接：https://blog.csdn.net/weixin_37786060/article/details/110790019

版权

一个字符串是一个字符序列

字节序列:机器磁芯转储

Unicode:人类可读的本文

把字节序列变成人类可读的文本字符串就是解码「decode」

把字符串变成用于存储或传输的字节序列激素编码「encode」

Python3的「str」类型基本相当于Python2的「unicode」类型

Python3默认使用「UTF-8」编码

Pyhon2默认使用ASCII

编解码

def encode(self, *args, **kwargs): # real signature unknown
    """
    Encode the string using the codec registered for encoding.

      encoding「参数1:编码」
        The encoding in which to encode the string.
      errors「参数2:错误处理方案」
        The error handling scheme to use for encoding errors.
        The default is 'strict' meaning that encoding errors raise a
        UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
        'xmlcharrefreplace' as well as any other name registered with
        codecs.register_error that can handle UnicodeEncodeErrors.
    """
    pass

def decode(self, *args, **kwargs): # real signature unknown
    """
    Decode the bytearray using the codec registered for encoding.

      encoding
        The encoding with which to decode the bytearray.
      errors
        The error handling scheme to use for the handling of decoding errors.
        The default is 'strict' meaning that decoding errors raise a
        UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
        as well as any other name registered with codecs.register_error that
        can handle UnicodeDecodeErrors.
    """
    pass

错误处理方案

编解码器可以通过接受 errors 字符串参数来实现不同的错误处理方案。

值	含义
`'strict'`	引发 `UnicodeError` (或其子类)；这是默认的方案。在 `strict_errors()` 中实现。
`'ignore'`	忽略错误格式的数据并且不加进一步通知就继续执行。在 `ignore_errors()` 中实现。

以下错误处理方案仅适用于文本编码:

值	含义
`'replace'`	使用适当的替换标记进行替换；Python 内置编解码器将在解码时使用官方 `U+FFFD` 替换字符，而在编码时使用 '?' 。在 `replace_errors()` 中实现。
`'xmlcharrefreplace'`	使用适当的 XML 字符引用进行替换（仅在编码时）。在 `xmlcharrefreplace_errors()` 中实现。
`'backslashreplace'`	使用带反斜杠的转义序列进行替换。在 `backslashreplace_errors()` 中实现。
`'namereplace'`	使用 `\N{...}` 转义序列进行替换（仅在编码时）。在 `namereplace_errors()` 中实现。
`'surrogateescape'`	在解码时，将字节替换为 `U+DC80` 至 `U+DCFF` 范围内的单个代理代码。当在编码数据时使用 `'surrogateescape'` 错误处理方案时，此代理将被转换回相同的字节。（请参阅 PEP 383 了解详情。）

此外，以下错误处理方案被专门用于指定的编解码器：

值	编解码器	含义
`'surrogatepass'`	utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le	允许编码和解码代理代码。这些编解

自行定义编码错误处理方案

codes.register_error(name,error_handler)

name:名称
error_handler:错误处理函数

自定义错误处理

判断字符串编码

import chardet

print(chardet.detect(b'aaaa'))
# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
print(chardet.detect(b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'))
# {'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}

通过chardet模块可以判断出内容的编码方式

import locale

print(locale.getpreferredencoding())  # UTF-8

BOM

在Windows上使用open打开utf-8编码的txt文件时开头会有一个多余的字符\ufeff，它叫BOM,是用来声明编码等信息的,但python会把它当作文本解析。

对UTF-16, Python将BOM解码为空字串。

对UTF-8, BOM被解码为一个字符\ufeff。

Unicode三明治-目前处理文本的最佳实践

「bytest」->「str」解码输入的字节序列
「str」只处理文本
「str」->「bytest」编码输出的文本

⚠️需要在多台设备或者多种场景下运行的代码，一定不能依赖「默认编码」。

规范化文本匹配

unicodedata.normalize(form,unistr)

normalize

import unicodedata


def nfc_equal(s1, s2):
    print(unicodedata.normalize('NFC', s1))
    print(unicodedata.normalize('NFC', s2))
    return unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2)


def fold_equal(s1, s2):
    print(unicodedata.normalize('NFC', s1).casefold())
    print(unicodedata.normalize('NFC', s2).casefold())
    return unicodedata.normalize('NFC', s1).casefold() == unicodedata.normalize('NFC', s2).casefold()


s1 = 'café'
s2 = 'cafe\u0301'
print(s1, s2)  # café café
print(nfc_equal(s1, s2))
# café
# café
# True
print(nfc_equal('A', 'a'))
# A
# a
# False
print(fold_equal('A', 'a'))
# a
# a
# True

测试游记

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
《流畅的Python》第四章学习笔记

一个字符串是一个字符序列字节序列:机器磁芯转储Unicode:人类可读的本文把字节序列变成人类可读的文本字符串就是解码「decode」把字符串变成用于存储或传输的字节序列激素编码「enc...
复制链接

扫一扫