【Python Cookbook】第二章字符串和文本

Prymce-Q

已于 2022-09-28 22:31:26 修改

阅读量501

点赞数

分类专栏： Python Cookbook笔记文章标签： python

于 2022-09-28 11:56:03 首次发布

本文链接：https://blog.csdn.net/weixin_47691066/article/details/127042614

版权

Python Cookbook笔记专栏收录该内容

6 篇文章 1 订阅

订阅专栏

文章目录

一、字符串
二、文本
总结

一、字符串

1.1 字符串的划分

虽然split()可以处理一些简单的情况一些简单的情况，但需要一些更为灵活的功能时，就需要使用re.split()方法，例如对任意多的分隔符进行拆分，如下：（按;分号、,逗号、\s空格进行分隔，\s*表示可以跟着任意数量的额外空格）

string1 = 'a; b  c, d,   e'
import re
re.split(r'[;,\s]\s*', string1)
['a', 'b', 'c', 'd', 'e']

若将re.split()中的[]改为( | )，则表示捕获组，具体如下：

string1 = 'a; b  c, d,   e'
import re
fields = re.split(r'(;|,|\s)\s*', string1)
print(fields)
['a', ';', 'b', ' ', 'c', ',', 'd', ',', 'e']

values = fields[::2]
print(values)
['a', 'b', 'c', 'd', 'e']

delimiter = fields[1::2]
print(delimiter)
[';', ' ', ',', ',']

1.2 字符串的文本匹配与查找

1.2.1 匹配

首先，是字符串的开头与结尾文本匹配，看一个字符串开头结尾的文字是否满足一定的要求，如下：

url = 'http://www.python.org'
url.startswith('http')
True
url.endswith('.txt')
False

接下来的例子展示了如何同时对多个字符串进行文本匹配，以及如何同时进行多文本条件匹配，如下：

filenames = ['http://www.python.org', 'detect.py', 'test.ipynb', 'try.py']
[file1 for file1 in filenames if file1.endswith(('.py', '.ipynb'))]
['detect.py', 'test.ipynb', 'try.py']

需要注意，进行多文本条件匹配时，条件要放入tuple()中。

当然可以使用fnmatch库来进行一些常见的通配符进行文本匹配，其中fnmatchcase方法完全根据提供的大小写来匹配，如下：

from fnmatch import fnmatch, fnmatchcase
fnmatch('test.txt', '*.txt')
True

fnmatch('test.txt', '?est.txt')
True

fnmatch('test45.txt', 'test[0-9][0-9]*')
True

fnmatchcase('test.txt', '*.TXT')
False

1.2.2 查找

使用find()方法，输出查找到的第一个字符位置，如下：

string1 = 'I am a student.'
string1.find('a')
2

可以使用re.compile()建立一个正则表达式模型，再使用match()方法进行匹配，如下：

import re
test1 = '11/27/2021'
test2 = 'Nov 27, 2021'
datapat = re.compile(r'\d+/\d+/\d+')

if re.match(datapat, test1):
    print(True)
else:
    print(False)
True

if re.match(datapat, test2):
    print(True)
else:
    print(False)
False

re库也可以使用findall方法来找到一个字符串中符合某种模式的字符，如下：

import re
test1 = 'Today is 11/27/2021, tomorrow is 11/28/2021'
datapat = re.compile(r'\d+/\d+/\d+')
datapat.findall(test1)
['11/27/2021', '11/28/2021']

另外，re.findall()方法可以不区分大小写进行查找，如下：

text = 'upper PYTHON, lower python, mixed Python'
re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']

re.sub('python', 'PP', text, flags=re.IGNORECASE)
'upper PP, lower PP, mixed PP'

1.3 字符串的文本替换

在简单的情况下，可以使用str.replace()方法进行文本替换，如下：

test1 = 'Today is 11/27/2021, tomorrow is 11/28/2021'
test1.replace('is', 'IS')
'Today IS 11/27/2021, tomorrow IS 11/28/2021'

复杂情况下，可以使用re.sub()进行，第1个参数是要匹配的模式，第2个参数是要替换的模式，\3反斜杠带数字用来调整捕获组的位置，如下：

test1 = 'Today is 11/27/2021, tomorrow is 11/28/2021'
import re
re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', test1)
'Today is 2021-11-27, tomorrow is 2021-11-28'

1.4 字符串删去不需要的字符

strip()方法可以删去开头与结尾的字符，lstrip()与rstrip()分别从左或从右开始执行字符去除的操作，默认情况下，删去的是空格符，但也可以指定其他的字符，如下：

s = '  hello world  \n'
s.strip()
'hello world'
s.lstrip()
'hello world  \n'
s.rstrip()
'  hello world'

t = '----hello===='
t.strip('-=')
'hello'
t.lstrip('-')
'hello===='

当然可以通过replace()来删去中间的空格，如下：

s = 'hello  world'
s.replace(' ', '')
'helloworld'

1.5 字符串的连接与合并

字符串的简单连接方式有3种，join()、+、fomat()，如下：

parts = ['I', 'love', 'you']
','.join(parts)
'I,love,you'

a, b, c = ['I', 'love', 'you']
a + b + c
'Iloveyou'

print('{} {} {}'.format(a,b,c))
I love you

print(a, b, c, sep=';')
I;love;you

当然可以使用for循环来构建字符串，但这略显麻烦，如下：

parts = ['I', 'love', 'you']
s = ''
for i in parts:
    s += i
print(s)
Iloveyou

' '.join(str(d) for d in parts)
'I love you'

1.6 字符串的变量名插值

我们可以创建一个字符串，使得其中嵌入的变量名称会以变量的字符串形式替换掉，如下：

s = '{name} has {n} messages'
s.format(name='Jhon', n=3)
'Jhon has 3 messages'

另一种方法，若要被替换的值能够在变量中找到，则可以将format_map()与vars()一起使用，如下：

s = '{name} has {n} messages'
name = 'Jhon'
n = 3
s.format_map(vars())
'Jhon has 3 messages'

vars()还有一个微妙的特性，他可以在一个具体的类实例中寻找变量，如下：

class Info:
    def __init__(self, name, n):
        self.name = name
        self.n = n

a = Info('Jhon', 3)
s = '{name} has {n} messages'
s.format_map(vars(a))
'Jhon has 3 messages'

而format()与format_map()这两个方法存在缺点，他不能够处理缺少某个值的情况，如下：

s = '{name} has {n} messages'
s.format(name='Jhon')
KeyError: 'n'

上面的这种情况可以单独定义一个带有__missing__()方法的字典类来避免，如下：

class safesub(dict):
    def __missing__(self, key):
        return '{' + key + '}'

# del n
s = '{name} has {n} messages'
s.format_map(safesub(name='Jhon'))
'Jhon has {n} messages'

当然，对于字符串的变量名插值这一问题，还可以使用string库，如下：

name = 'Jhon'
n = 3
import string
s = string.Template('$name has $n messages.')
s.substitute(vars())
'Jhon has 3 messages.'

二、文本

2.1 文本的过滤与清理

通常可以使用str.translate()方法来处理混乱的字符串，如下：（s表示一个混乱的字符串）

s = 'pyth᷃on\fis\tawesome\r\n'
print(s)
pyth᷃onis	awesome

可以建立一个小型的转换表清理空格，如下：

remap = {
    ord('\t') : ' ',
    ord('\f') : ' ',
    ord('\r') : None
}

a = s.translate(remap)
print(a)
pyth᷃on is awesome

虽然空格去掉了，但是其中像h᷃这种组合字符并未调整，因此我们在上面的基础上将这种Unicode组合字符进行去除，如下：

Unicode组合字符的生成，例如：print('h' + chr(0x1dc3)) -> h᷃

import unicodedata
import os
cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c)))

b = unicodedata.normalize('NFD', a)
print(b)
pyth᷃on is awesome

b.translate(cmb_chrs)
'python is awesome\n'

先使用dict.fromkeys()方法构建了一个将每个Unicode组合字符都映射为None的字典，通过unicodedata.normalize()方法将原始输入转换为分离形式，然后通过translate()方法去除重音符号。

可以用一张转换表将所有的Unicode十进制数字字符映射为它们对应的ASCII版本，如下：

digitmap = { c: ord('0') + unicodedata.digit(chr(c)) 
                    for c in range(sys.maxunicode) 
                    if unicodedata.category(chr(c)) == 'Nd' }

print(len(digitmap))
630
x = '\u0661\u0662\u0663'
x.translate(digitmap)
'123'

2.2 对齐文本字符

使用字符串的ljust()、rjust()、center()方法进行文本的对齐，且可以选择填充字符，如下：

text = 'Hello world!'
text.ljust(20)
'Hello world!        '

text.rjust(20)
'        Hello world!'

text.center(20, '=')
'====Hello world!===='

也可以使用format方法来完成对齐任务，需要注意<、>、^字符的使用，如下：

text = 'Hello world!'
format(text, '<20')
'Hello world!        '

format(text, '>20')
'        Hello world!'

format(text, '*^20')    # 在^前面添加*，来作为填充
'****Hello world!****'

format方法的优点是它不仅可以对字符串进行对齐，对数字也可以，如下：

x = 1.255
format(x, '^20')
'       1.255        '

2.3 以固定列数重新格式化文本

使用textwrap库来完成这一任务，如下：

s = "I'd found my best love .But i didn't treasure her.I felt regretful after that,It's the ultimate pain in the world!Just cut my throat ,Pleasee don't hesitate!If God can give me a chance.I'll tell her three words:'i love you'!If God wanna give me a time limit.I'll say this love will last 10 thousand years!"

import textwrap
print(textwrap.fill(s, 40, initial_indent=' '))

I’d found my best love .But i didn’t
treasure her.I felt regretful after
that,It’s the ultimate pain in the
world!Just cut my throat ,Pleasee don’t
hesitate!If God can give me a
chance.I’ll tell her three words:‘i
love you’!If God wanna give me a time
limit.I’ll say this love will last 10
thousand years!

若希望输出的结果能够较好地显示在终端上，可以通过os.get_terminal_size()来获取终端的尺寸，如下：

import os 
os.get_terminal_size().columns

2.4 文本的分词

这块的分词手法有些高级，假设我们有如下的字符串文本：

s = 'foo = 23 + 42 * 10'

想将上面的字符串转化为如下的格式：

tokens = [('NAME', 'foo'), ('EQ', '='), ('NUM', '23'), ('PLUS', '+'), ('NUM', '42'), ('TIMES', '*'), ('NUM', '10')]

可以通过正则表达式中的命名组来实现：

import re
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ = r'(?P<EQ>\=)'
WS = r'(?P<WS>\s+)'

master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))
scanner = master_pat.scanner('foo = 1')
scanner.match()
print(_.lastgroup, _.group())
NAME foo

在正则表达式中，用?P<TOKENNAME>这样的格式来为名称分配格式。

当然可以通过构建生成器函数来完成这一工作，如下：

from collections import namedtuple
Token = namedtuple('Token', ['type','value'])

def generate_tokens(pat, text):
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        yield Token(m.lastgroup, m.group())
        
for tok in generate_tokens(master_pat, 'foo = 42'):
    print(tok)
Token(type='NAME', value='foo')
Token(type='WS', value=' ')
Token(type='EQ', value='=')
Token(type='WS', value=' ')
Token(type='NUM', value='42')

2.5 在字节串上执行文本操作

字符串是字符序列，它是一种抽象的概念，不能直接存储在硬盘。
字节串是字节序列，它可以直接存储在硬盘。它们之间的映射被称为编码/解码。
具体可以看这篇博客，这里引用以下~字节串解释

字节串bytes与字节数组bytearray已经支持大多数文本字符串一样的内建操作，如下：

# 字符串
data = b'Hello World'
print(type(data))
<class 'bytes'>

print(data[0:5])
b'Hello'
print(data.startswith(b'Hello'))
True
print(data.split())
[b'Hello', b'World']
print(data.replace(b'Hello', b'Hi'))
b'Hi World'

# 字节数组
data = bytearray(b'Hello World')
print(type(data))
<class 'bytearray'>

print(data[0:5])
bytearray(b'Hello')
print(data.startswith(b'Hello'))
True
print(data.split())
[bytearray(b'Hello'), bytearray(b'World')]
print(data.replace(b'Hello', b'Hi'))
bytearray(b'Hi World')

在字节串上执行正则表达式的模式匹配时，需要用字节形式来指定，如下：

data = b'FRUIT:APPLE,BANANA'
import re
re.split('[:,]', data)
TypeError: cannot use a string pattern on a bytes-like object

re.split(b'[:,]', data)
[b'FRUIT', b'APPLE', b'BANANA']

当然，字节串与字符串也有一些显著的区别，如下：

data = b'FRUIT:APPLE,BANANA'
print(data[0])
70

b'{} {}'.format(b'FRUIT', 10)
AttributeError: 'bytes' object has no attribute 'format'

若想在字节串上做任何形式的格式化操作，可以先用普通文本字符串，然后再做解码，如下：

'{:10s} {:3d}'.format('FRUIT', 10).encode('ascii')
b'FRUIT       10'

在日常的任务中，更多的鼓励使用普通的字符串，而非字节串

总结

查漏补缺~

参考：《Python cookbook 中文版》[美]David Beazley&Brian K. Jones 著