Python CookBook —— Chapter 2 （个人笔记）

最新推荐文章于 2021-03-23 16:08:16 发布

Gozen Sanji

最新推荐文章于 2021-03-23 16:08:16 发布

阅读量213

点赞数

分类专栏：个人笔记 Python 进阶

本文链接：https://blog.csdn.net/JayChang9/article/details/108397323

版权

个人笔记同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

Python 进阶

7 篇文章 0 订阅

订阅专栏

文章目录

Chap 2 字符串和文本

Chap 2 字符串和文本

2.1 使用多个界定符分割字符串 — re.split() & 捕获分组 & 非捕获分组

假设需要将一个字符串分割为多个部分，但是分隔符并不是固定的。string 对象的 split() 方法只适应于非常简单的字符串分割情形，它并不允许有多个分隔符 or 分隔符周围有不确定的空格。当需要更加灵活地切割字符串时，推荐使用 re.split() 方法：

import re

line = 'asdf fjdk; afed, fjek,asdf, foo'
items = re.split(r'[;,\s]\s*', line)    # [;,\s] 表示匹配 (;) or (,) or (\s)
print(items)

上面的 RegEx 指定【分隔符】可以是分号，逗号 or 空格，并且后面紧跟着【任意多个空格】，且 re.split 函数返回一个列表，这与 str.split 函数一致。

当使用 re.split 函数时要注意 RegEx 中是否包含一个括号捕获分组。若使用了捕获分组，则被匹配的文本也将出现在结果列表中：

import re

line = 'asdf fjdk; afed, fjek,asdf, foo'
fields = re.split(r'(;|,|\s)\s*', line)    # 使用了捕获分组(); RegEx 中的 | 表示“或”
print(fields)    # ['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

获取分割符有时也是有用的。比如，你想保留分割符，以重新构造一个输出字符串：

# 1. 取出 fields 中所有【非分隔符】的元素
values = fields[::2]
# 2. 取出 fields 中所有【分隔符】
delimiters = fields[1::2] + ['']    # 结尾追加一个【空字符列表】以实现一一对应 (见下面 zip )

print(f"values = {values}")
print(f"delimiters = {delimiters}")

# 3. 使用原来的分隔符重构字符串（虽然缺少了一些空格）
s = ''.join(v+d for v, d in zip(values, delimiters))
print(s)

若不想保留分割符到结果列表中，但仍要使用括号来分组正则表达式，那就确保分组是非捕获分组：

fields_ = re.split(r'(?:,|;|\s)\s*', line)    # 非捕获分组形式 (?:xxxxx)
print(fields_)

2.2 字符串开头或结尾匹配 — startswith / endswith

要通过指定的文本模式来检查字符串的开头 or 结尾，如文件名后缀，URL Scheme 等等，可使用 startswith & endswith 方法：

filename = "spam.txt"
url = 'http://www.python.org'

print(filename.startswith('file:'))     # False
print(filename.endswith('.txt'))        # True
print(url.startswith('http:'))          # True
print(url.endswith('.com'))             # False

若你想检查多种匹配可能，只需将所有匹配项放到一个 tuple 中，然后传给 startswith / endswith 方法：

import os
# 1. os.listdir() 函数返回指定路径下的文件和文件夹列表
filenames = os.listdir('..')    # 这里指定了当前目录的上一层目录
print(filenames)

# 2. 列表解析, 后缀为 .py 或 .json 的文件
result_1 = [name for name in filenames if name.endswith(('.py', '.json'))]    # 注意这里参数是 tuple 
# 2. 是否存在后缀为 .py 的文件
result_2 = any(name.endswith('.py') for name in filenames)

print(result_1, result_2, sep="\n")

2.3 用 Shell 通配符匹配字符串 — fnmatch & fnmatchcase

想使用 Unix Shell 中常用の通配符来匹配字符串，fnmatch 模块提供了两个函数：fnmatch & fnmatchcase，可以用来实现这样的匹配：

from fnmatch import fnmatch, fnmatchcase

print(fnmatch('foo.txt', '*.txt'))			# True
print(fnmatch('foo.txt', '?oo.txt'))		# True
print(fnmatch('Dat45.csv', 'Dat[0-9]*'))	# True

names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']

for name_ in (name for name in names if fnmatch(name, 'Dat[0-9].csv')):    # genExpr
    print(name_)

fnmatch 函数使用底层操作系统的大小写敏感规则来匹配模式，因此不同操作系统对大小写的匹配结果可能不同，但是通过 fnmatchcase 函数就可以避免出现这个问题：

print(fnmatch('foo.txt', '*.TXT'))      # True
print(fnmatchcase('foo.txt', '*.TXT'))  # False

这两个函数通常会被忽略的一个特性是：在处理非文件名的字符串时候它们也是很有用的。

addresses = [
    '5412 N CLARK ST',
    '1060 W ADDISON ST',
    '1039 W GRANVILLE AVE',
    '2122 N CLARK ST',
    '4802 N BROADWAY',
            ]

for x in (addr for addr in addresses if fnmatchcase(addr, '* ST')):
    print(f"■ - {x}")

print("~~~~~~~~~~~~~~~~~~~~~")

for x in (addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')):
    print(f"● - {x}")

如果需要做文件名的匹配，最好使用 glob 模块，参考 5.13 小节

2.4 字符串匹配和搜索 — str.find / match & findall & finditer

若想匹配字面字符串，则只需要调用基本字符串方法，如 str.find，str.endswith，str.startswith 或类似的函数：

text = 'yeah, but no, but yeah, but no, but yeah'

# 1. Exact match
print(text == "yeah")    # False

# 2. Match at start or end
print(text.startswith('yeah'), text.endswith('no'), sep='\n')    # True, False

# 3. Search for the location of the first occurrence
print(text.find('no'))    # 10, 即 index = 10 开始为 ‘no’

对于复杂的匹配需要使用正则表达式和 re 模块，如果想用同一个模式去做多次匹配，应该先将模式字符串预编译为模式对象：

import re

date_pattern = re.compile(r'\d+/\d+/\d+')    # \d+ 表示任意多个数字, 至少一个
text1 = '11/27/2012'
text2 = 'Nov 27, 2012'

print('match' if date_pattern.match(text1) else 'no match')
print('match' if date_pattern.match(text2) else 'no match')

# Remark: match() 总是从字符串开始去匹配, 若想查找字符串任意部分的模式出现位置, 可用 findall() 方法代替
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print(date_pattern.findall(text))    # ['11/27/2012', '3/13/2013']

在定义正则式时通常会利用括号去捕获分组，因为这样可以分别将每个组的内容提取出来，使后面的处理更加简单：

import re

date_pattern = re.compile(r'(\d+)/(\d+)/(\d+)')    # 以 () 实现“捕获分组”
m = date_pattern.match('12/25/2020')
print(m)    # <re.Match object; span=(0, 10), match='12/25/2020'>

# 1. Extract the contents of each group
print(m.group(0))    # 完整匹配的内容: 12/25/2020
print(m.group(1))    # 第一个分组内容: 12
print(m.group(2))    # 第二个分组内容: 25
print(m.group(3))    # 第三个分组内容: 2020
print(m.groups())    # 以 tuple 方式返回每个分组元素: ('12', '25', '2020')

# 2. Find all matches (date_pattern 使用了捕获分组后, 返回的列表中の元素也是元组)
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print(date_pattern.findall(text))    # [('11', '27', '2012'), ('3', '13', '2013')]

for month, day, year in date_pattern.findall(text):
    print('{}-{}-{}'.format(year, month, day))    # 按 年-月-日 格式输出

findall 方法会搜索文本并以列表形式返回所有的匹配。若想以迭代方式返回匹配，可使用 finditer 方法代替：

for m in date_pattern.finditer(text):    # finditer 将以迭代方式返回匹配
    print(m.groups())

当写正则式字符串时相对普遍的做法是使用原始字符串，比如 r’(\d+)/(\d+)/(\d+)’。这种字符串将不去解析反斜杠。

2.5 字符串搜索和替换 — re.sub & re.subn & 替换回调函数

想在字符串中搜索和匹配指定的文本模式，对于简单的字面模式，直接使用 str.replace 方法即可。对于复杂的模式，请使用 re 模块中的 sub 函数：

import re 

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
date_pattern = re.compile(r'(\d+)/(\d+)/(\d+)')    # 反复匹配时可提升性能

text_ = date_pattern.sub(r'\3-\1-\2', text)    # 反斜杠数字(如 \3 )指向【被匹配模式】的捕获组号
print(text_)

# 另一种用法: sub() 的第一个参数是被匹配的模式, 第二个参数是替换模式
print(re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text))

若除了替换后的结果外，还想知道有多少替换发生了，可以使用 re.subn 函数来实现：

new_text, n = date_pattern.subn(r'\3-\1-\2', text)
print(f"替换一共发生了{n}次")

对于更加复杂的替换，可以传递一个替换回调函数来作为替换模式参数：

import re
from calendar import month_abbr


def change_date(match):
    """
    >>>>>> 定义替换回调函数
    :param match: 该参数是 RegEx 的一个 match 对象, 即 match / find 返回的对象
    :return: 用于替换の字符串
    """
    month_name = month_abbr[int(match.group(1))]    # 注意这里的括号 []
    return '{} {} {}'.format(match.group(2), month_name, match.group(3))


text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
date_pattern = re.compile(r'(\d+)/(\d+)/(\d+)')

# 对于字符串 text, 使用 date_pattern 进行匹配, 并将匹配结果按照 change_date 函数指定的方式进行替换
text_ = date_pattern.sub(change_date, text)    # 此时 sub() 函数的替换模式参数是一个“回调函数”
print(text_)

2.6 字符串忽略大小写的搜索替换 — re.IGNORECASE

要以忽略大小写的方式搜索 & 替换文本字符串，可在使用 re 模块时给这些操作提供 re.IGNORECASE 标志参数：

import re

text = 'UPPER PYTHON, lower python, Mixed Python'
# 1. 通过 flags=re.IGNORECASE 设置忽略大小写差异的匹配
res = re.findall('python', text, flags=re.IGNORECASE)
print(res)

# 2. 对于替换也可采用同样的做法
res_ = re.sub('python', 'java', text, flags=re.IGNORECASE)
print(res_)

上面 re.sub() 的例子有个小缺陷，替换字符串并不会自动跟被匹配字符串的大小写保持一致，为解决此问题，需要一个【辅助函数】：

def matchcase(word):
    """ 这个 matchcase 函数中定义了另一个函数, 它就是要返回的“回调函数” """
    def replace(match):
        # 1. 先获取匹配部分的内容
        text = match.group()
        if text.isupper():    # 2. 检测字符串中是否所有字符都为大写
            return word.upper()
        elif text.islower():    # 3. 检测字符串中是否所有字符都为小写
            return word.lower()
        elif text[0].isupper():    # 4. 检测字符串首个字符是否为大写
            return word.capitalize()
        else:    # 其他情况 (如数字等不分大小写的内容)
            return word
    return replace    # 注意这里只返回函数名, 不用加括号！


# matchcase('java') 返回了一个回调函数 replace (其参数必须是 match 对象)
res__ = re.sub('python', matchcase('java'), text, flags=re.IGNORECASE)
print(res__)

2.7 最短匹配模式 (贪婪 & 非贪婪) — .* 还是 .*?

使用 ( .* ) 做正则匹配却得到预期之外的结果，见下例：

import re

str_pattern = re.compile(r'\"(.*)\"')    # 要匹配一对双引号("")之间的内容
text1 = 'Computer says "no."'
text2 = 'Computer says "no." Phone says "yes."'

print(str_pattern.findall(text1))    # ['no.']
print(str_pattern.findall(text2))    # ['no." Phone says "yes.'], 这并非预期的结果

RegEx 中 * 操作符是贪婪的，它会尽可能匹配最多的内容，但有时这会带来不想要的结果；为修复这个问题，只需使用【非贪婪模式】进行匹配：

str_pattern_2 = re.compile(r'\"(.*?)\"')    # 在 * or + 后加上 ? 修饰符, 成为非贪婪模式
print(str_pattern_2.findall(text2))    # 期望的结果: ['no.', 'yes.']

接下来用这个修改字体颜色

2.8 多行匹配模式 — 标志参数 re.DOTALL

使用 RegEx 去匹配一大块文本时，需要跨越多行去匹配。但不能直接用点 (.) 匹配来实现，因为点 (.) 不能匹配换行符。

import re

text1 = '/* this is a comment */'
text2 = '''
/* this is a 
multiline comment */ 
'''
comment_pattern = re.compile(r'/\*(.*?)\*/')    # 注意这里用 \* 匹配星号(*)

res1 = comment_pattern.findall(text1)
res2 = comment_pattern.findall(text2)    # 这里没有匹配到内容！
print(res1, res2, sep='\n')

res2 中匹配不到文本的问题可通过两种方式解决，第一种解法：修改模式字符串以增加对换行の支持。

comment_pattern_2 = re.compile(r'/\*((?:.|\n)*?)\*/')    # 非捕获分组: (.) or (\n)
res3 = comment_pattern_2.findall(text2)
print(res3)    # 这种解法虽然不算易读, 但却更能适应复杂模式的匹配

第二种解法：通过 re.compile 函数の标志参数 re.DOTALL 使 RegEx 中的点 (.) 匹配包括换行符在内的任意字符。

comment_pattern_3 = re.compile(r'/\*(.*?)\*/', re.DOTALL)    # 通过【标志参数】匹配换行符
res4 = comment_pattern_3.findall(text2)
print(res4)    # 这种解法在简单情况下运作的很好, 但在复杂模式的匹配中可能会出现问题

2.9 将 Unicode 文本标准化 — unicodedata.normalize() / unicodedata.combining()

在 Unicode 中，某些字符能够用多个合法的编码表示，因此在处理 Unicode 字符串时，需要确保所有字符串 在底层 有相同的表示。

import unicodedata

s1 = 'Spicy Jalape\u00f1o'     # 整体字符  (U+00F1)
s2 = 'Spicy Jalapen\u0303o'    # 组合字符  (U+0303)
print(s1, len(s1), s2, len(s2), sep='\n')    # 两个字符串, 表面相同, 实际不相同
print(s1 == s2)

# 在需要比较字符串的程序中使用字符的多种表示会产生问题。为修正此问题, 可使用 unicodedata 模块将文本标准化:
t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2)    # True
print(ascii(t1), ascii(t2))    # 可看出字符被标准化了

t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)    # True
print(ascii(t3), ascii(t4))    # 另一种标准化

"""
	normalize() 的第一个参数指定【字符串の标准化方式】:
		1. NFC 表示字符应该是整体组成 (比如可能的话就使用单一编码)
		2. NFD 表示字符应该分解为多个组合字符表示。
"""

Python 还支持【扩展の标准化形式】：NFKC & NFKD，它们在处理某些字符时增加了额外的兼容特性：

s = '\ufb01'    # 单个字符
print(s)    # ﬁ
print(unicodedata.normalize('NFC', s))    # len = 1
print(unicodedata.normalize('NFD', s))    # len = 1

# 下面的标准化形式使单个字符分开了
print(unicodedata.normalize('NFKC', s))    # len = 2
print(unicodedata.normalize('NFKD', s))    # len = 2

标准化对任何需以一致的方式来处理 Unicode 文本的程序都是非常重要的。当处理【来自用户输入的】字符串而你很难去控制编码时尤其如此；在清理 & 过滤文本时字符的标准化也是很重要的。比如，你想清除掉一些文本上的【变音符】（可能是为了搜索和匹配）：

import unicodedata

s1 = 'Spicy Jalape\u00f1o'
t1 = unicodedata.normalize('NFD', s1)

# combining 函数可以测试一个字符是否为和音字符
de_hat = ''.join(c for c in t1 if not unicodedata.combining(c))

# 注意第二个输出没有了变音符
print(t1, de_hat, sep='\t')

2.10 在正则式中使用 Unicode

这一节只看懂了一句话：最好不要混合使用 Unicode 和 RegEx，除非你有第三方正则式库的支持。（剩下的内容等以后翅膀硬了再回来补充）

2.11 删除字符串中不需要的字符 — strip / lstrip / rstrip / str.replace / re.sub

strip 方法能用于删除【开始 or 结尾】的字符；lstrip & rstrip 分别【从左 & 从右】执行删除操作。默认情况下，这些方法会去除空白字符，但也可以指定要删除的字符。

# 1. Whitespace stripping
s = ' hello world \n'
print(s.strip(), s.rstrip(), s.lstrip(), sep='\n')

# 2. Character stripping
t = '-----hello====='
print(t.rstrip('='), t.lstrip('-'), t.strip('-='), sep='\n')    # 注意 strip() 的参数

若想处理字符串中间的空格，则需求助其他技术，比如使用 replace / re.sub 函数：

import re

s = 'hello    world'
print(s.replace(' ', ''))    # helloworld
print(re.sub(r'\s+', ' ', s))    # hello world

若想将 str.strip 操作和【其他迭代操作】相结合，比如从文件中读取多行数据，那么 genExpr 就可以大显身手了：

def read_pretty_lines(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        # 这里的 f 居然可以直接遍历, 甚至都不需要 contents = f.readlines()
        lines = (line.strip() for line in f)    # genExpr
        for line in lines:
            print(line)

"""
	在这里, 表达式 lines = (line.strip() for line in f) 执行 数据转换/处理 操作。
	这种方式非常高效, 因为它无需预先读取所有数据放到一个临时的列表中;
	它仅仅只是创建一个生成器, 并且每次返回行之前会先执行 strip() 操作。
"""

2.12 实现字符间的映射 — str.translate() / dict.fromkeys()

想清除文本 pýtĥöñ 中的变音符，可以使用经常会被忽视的 str.translate() 方法：

s = 'pýtĥöñ\fis\tawesome\r\n'
print(s)

# 1. 首先清理空白字符。为此创建一个【转换表格】然后使用 translate() 方法:
re_map = {
    ord('\t'): ' ',    
    ord('\f'): ' ',
    ord('\r'): None,    # Deleted
}    # 使用 translate 函数时, 映射の key 必须通过 ord() 函数定义

a = s.translate(re_map)
print(a)

# 2. 以这个表格为基础进一步构建更大的表格, 以删除所有的和音符:
# ------------------------------------------------------------------->
import sys
import unicodedata

# dict.fromkeys() 函数创建一个新字典, 以序列 seq 中元素作为字典的键, 所有键对应的值为 None
# sys.maxunicode 为一个整数, 表示 Unicode 字符支持的最大代码点
# chr(i) 返回 Unicode 码位为整数 i 的字符的字符串格式
cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c)))

# 使用 unicodedata.normalize 将原始输入标准化为【分解形式】字符
b = unicodedata.normalize('NFD', a)
print(a, b, sep='\t')    
print(ascii(a), ascii(b), sep='\n')    # 表面看似相同, 但底层字符不一样

# 调用 translate 函数删除所有重音符
c = b.translate(cmb_chrs)
print(c)

Remark：ord() 函数以一个字符为参数，返回对应的 ASCII 数值 or Unicode 数值。如果所给的 Unicode 字符超出了你的 Python 定义范围，则会引发一个 TypeError 的异常。

Page 64 这里有两个例子不太明白，以后再回来加上。

文本字符清理一个最主要的问题应该是运行的性能。① 对于简单的替换操作，str.replace() 方法通常是最快的（即使反复调用多次）：

def clean_space(s):
    """
        对简单的替换操作, 以这种方式做替换仍会比使用 translate() or RegEx 要快很多
    """
    s.replace('\r', '')
    s.replace('\t', ' ')
    s.replace('\f', ' ')
    return s

② 如果需要执行任何【复杂字符 to 字符の重新映射】，则 translate 方法会非常快。

2.13 字符串对齐 — ljust() & rjust() & center() / format()

有时需要对齐字符串，可采用下面的做法：

# >>>>>> 1. 对于基本的字符串对齐操作, 可使用字符串的 ljust(), rjust() & center() 方法:
text = 'Hello World'
print(f"[{text.ljust(20)}]")    # 左对齐
print(f"[{text.rjust(20)}]")    # 右对齐
print(f"[{text.center(20)}]")   # 居中

# 所有这些方法都能接受一个可选的填充字符:
print(f"[{text.ljust(20, '-')}]")
print(f"[{text.center(20, '*')}]")


# >>>>>> 2. format() 函数同样可用来对齐字符串:
print(f"[{format(text, '>20')}]")    # 用 (>) 表示右对齐
print(f"[{format(text, '<20')}]")    # 用 (<) 表示左对齐
print(f"[{format(text, '^20')}]")    # 用 (^) 表示居中

# 如果想指定一个【非空格の填充字符】, 将它写到对齐字符的前面即可:
print(f"[{format(text, '=>20')}]")
print(f"[{format(text, '%^20')}]")

# 当格式化多个值时这些格式代码也可被用在 format() 中:
print(f"[{'{:<10s} {:>10s}'.format('Hello', 'World')}]")    # 注意这种写法中的 (:) 不能改

format() 函数的好处之一是它不仅适用于字符串。它可用来格式化任何值，使得它非常通用：

x = 1.2345    # float
print(f"[{format(x, '*>10')}]")
print(f"[{format(x, '^10.2f')}]")    # 这里甚至加上了保留小数位的限制

在老的代码中，你经常会看到被用来格式化文本的 % 操作符。但在新版本代码中，你应该优先选择 format 方法，因为 format() 要比 % 操作符的功能更为强大，并且 format() 也比 ljust()，rjust()，center() 方法更通用（它可用来格式化任意对象，而不仅仅是字符串）。

2.14 合并拼接字符串 — join() / (+)

当你想将几个小的字符串合并为一个大的字符串时：

# 1. 若要合并的字符串在一个序列 or iterable 对象中, 则最快的方式就是用 join() 方法:
parts = ['Is', 'Chicago', 'Not', 'Chicago?']
print(' '.join(parts))
print(','.join(parts))
print(''.join(parts))

# 2. 若只想合并少数几个字符串时, 使用加号 (+) 通常已经足够:
a = 'Is Chicago'
b = 'Not Chicago?'
print(a + ' ' + b)              # (+) 操作符的方案
print('{} {}'.format(a, b))     # 字符串格式化的方案 (和上面效果相同)

# 3. 若想【在源码中】将两个“字面字符串”合并起来, 只需简单地将它们放到一起:
a = 'Hello' 'World'    # 没有任何操作符
print(a)

使用加号 (+) 操作符来连接大量字符串的效率非常低，因为加号连接会引起内存复制 & 垃圾回收操作。特别的，永远都不要像下面这样写字符串连接的代码：

s = ''
for p in parts:    # 这是一种效率极其低下的垃圾代码！
    s += p

这种写法会比使用 join() 方法运行的要慢一些，因为每次执行 (+=) 操作时都会创建一个新的字符串对象。

# 4. 一个较聪明的技巧是利用 genExpr 转换数据为字符串的同时合并字符串：
data = ['ACME', 50, 91.1]
print(','.join(str(d) for d in data))    # join() 中的参数是一个 genExpr

# 5. 还得注意不必要的字符串连接操作:
a, b, c = 'ABC'
print(a + ':' + b + ':' + c)    # Ugly
print(':'.join([a, b, c]))      # Still ugly
print(a, b, c, sep=':')         # Better

当混合使用 I/O 操作 & 字符串连接操作时，需要仔细研究你的程序。比如，考虑下面的两个代码片段：

# Version 1 (string concatenation) 
f.write(chunk1 + chunk2)

# Version 2 (separate I/O operations) 
f.write(chunk1)
f.write(chunk2)

若两个字符串都很小，则第一个版本性能会更好些，因为 I/O 系统调用天生就慢；若两个字符串都很大，则第二个版本可能会更加高效，因为它避免了创建一个很大的临时结果并且要复制大量的内存块数据。

本节最后的例子有点不明白。

2.15 字符串中插入变量 — str.format() / str.format_map() / vars() / missing ()

Python 并没有对在字符串中替换变量值提供直接的支持，但可以通过使用 str.format() 方法来解决此问题：

s = '{name} has {n} messages.'
print(s.format(name='Jay', n=16))

若要被替换的变量能在【变量域】中找到，那么你可以结合使用 format_map 和 vars：

s = '{name} has {n} messages.'
name, n = 'Jay', 16
print(s.format_map(vars()))    # 结果与上面一致

Remark：format_map(mapping) 方法类似 str.format(*args, **kwargs)，不同的是其参数 mapping 要求是一个字典对象；vars() 方法返回对象 object 的属性 & 属性值の字典对象，若不传参数，则返回当前调用位置の属性 & 属性值，类似 locals()。

class Info:
    def __init__(self, name, n):
        self.name = name
        self.n = n


a = Info('Doris', 27)    # 实例化
print(vars(a))    # 这里的 vars(a) 返回实例 a 的 属性&属性值 所成的字典: {'name': 'Doris', 'n': 27}
print(s.format_map(vars(a)))

format 和 format_map() 的一个缺陷就是它们并不能很好地处理【变量缺失】的情况：

print(s.format(name='Doris'))    # 这样的代码会报错 KeyError: 'n'

# 避免这种错误的方法是另外定义一个含有 __missing__() 方法的字典对象:


class safesub(dict):
    """ 防止 key 找不到的字典对象 """
    # 映射 or 字典类中鲜为人知的 __missing__() 方法可让你定义如何处理缺失值
    def __missing__(self, key):
        # 这里的 __missing__() 方法被定义为: 对缺失的值返回一个占位符, 从而避免了 KeyError 异常
        return '{' + key + '}'    


del n    # Make sure n is undefined
print(s.format_map(safesub(vars())))    # Jay has {n} messages.

接下来 Page71 上的例子不太理解

2.16 以指定列宽格式化字符串 — textwrap.fill()

你有一些长字符串，想【以指定的列宽】将它们重新格式化，可以使用 textwrap 模块来格式化字符串的输出：

import textwrap

s = """Look into my eyes, look into my eyes, the eyes, the eyes,
the eyes, not around the eyes, don't look around the eyes,
look into my eyes, you're under."""

print(textwrap.fill(s, 60))    # 指定输出列宽为 60
print(textwrap.fill(s, 40))
print(textwrap.fill(s, 40, initial_indent='    '))    # initial_indent 用于指定首行缩进
print(textwrap.fill(s, 40, subsequent_indent='    '))    # subsequent_indent 用于指定后续行缩进

2.17 在字符串中处理 html 和 xml — ?

看不懂，暂时跳过

2.18 字符串令牌解析 — ?

看不懂，暂时跳过

2.19 实现一个简单的递归下降分析器 — ?

看不懂，暂时跳过

2.20 字节字符串上的字符串操作 — 字节字符串与文本字符串的异同

你想在字节字符串上执行普通的文本操作 (如移除，搜索和替换)。字节字符串同样也支持大部分与文本字符串一样的内置操作。

data = b'Hello World'    # 这是一个【字节字符串】
# 1. slice
print(data[0:5])
# 2. startswith / endswith
print(data.startswith(b'Hello'))
# 3. split
print(data.split(b' '))
# 4. replace
print(data.replace(b'Hello', b'Goodbye'))

上面的操作同样适用于字节数组：

data2 = bytearray(b'Hello World')    # 实例化一个【字节数组】对象
# 1. slice
print(data2[0:5])
# 2. startswith / endswith
print(data2.startswith(b'Hello'))
# 3. split
print(data2.split(b' '))
# 4. replace
print(data2.replace(b'Hello', b'Goodbye'))

# 可使用 RegEx 匹配字节字符串, 但 RegEx 本身必须也是字节字符串
import re

data3 = b'FOO: BAR, SPAM'

# 1. ×    TypeError: cannot use a string pattern on a bytes-like object
print(re.split('[:,]', data3))

# 2. √    [b'FOO', b' BAR', b' SPAM']
print(re.split(b'[:,]', data3))

下面介绍【文本字符串】与【字节字符串】的区别：

# 1. 字节字符串の索引操作返回整数
a = 'Hello World'
print(a[0], a[1])    # 返回字符
b = b'Hello World'
print(b[0], b[1])    # 返回整数

# 2. 字节字符串不会提供一个美观的字符串表示, 也不能很好地打印出来 (除非它们先被解码为一个文本字符串)
s = b'Hello World'
print(s)
print(s.decode('ascii'))

# 3. 字节字符串不支持格式化操作: ×    AttributeError: 'bytes' object has no attribute 'format'
print(b'{}{}{}'.format(b'Doris', 1996, 0628))

# 若想格式化字节字符串, 应先使用标准的文本字符串, 然后将其编码为字节字符串
res = '{:10s} {:10d} {:10.2f}'.format('ACME', 100, 490.1).encode('ascii')
print(res)

Page 87 的例子以及最后的忠告：在处理文本时，推荐直接在程序中使用普通的文本字符串而不是字节字符串。

Gozen Sanji

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python CookBook —— Chapter 2 （个人笔记）

文章目录Chap 2 字符串和文本2.1 使用多个界定符分割字符串 --- re.split() & 捕获分组 & 非捕获分组2.2 字符串开头或结尾匹配 --- startswith / endswith2.3 用 Shell 通配符匹配字符串 --- fnmatch & fnmatchcase2.4 字符串匹配和搜索 --- str.find / match & findall & finditer2.5 字符串搜索和替换 --- re.sub & re.
复制链接

扫一扫