python文本处理常用函数整理

最新推荐文章于 2024-01-18 20:28:35 发布

皮卡丘吃桃子

最新推荐文章于 2024-01-18 20:28:35 发布

阅读量1.2k

点赞数

分类专栏：代码文章标签： python

本文链接：https://blog.csdn.net/qq_34124009/article/details/108593442

版权

代码专栏收录该内容

28 篇文章 1 订阅

订阅专栏

python文本处理常用函数整理

string,文本处理工具str类
textwrap:格式化文本段落
- dedent和fill
- indent()
re:正则表达式
difflib:比较序列

string,文本处理工具str类

函数capwords()

函数capwords()会把一个字符串中的所有单词首字母大写，举例如下所示，输出结果为：The Quick Brown Fox Jumped Over The Lazy Dog.

import string
s = 'The quick brown fox jumped over the lazy dog.'
print(s)
print(string.capwords(s))

模版，使用string.Template拼接，文本替代

import string
values = {'var': 'foo'}
t = string.Template("""
Variable        : $var
Escape          : $$
Variable in text: ${var}iable
""")

print('TEMPLATE:', t.substitute(values))

s = """
Variable        : %(var)s
Escape          : %%
Variable in text: %(var)siable
"""

print('INTERPOLATION:', s % values)

s = """
Variable        : {var}
Escape          : {{}}
Variable in text: {var}iable
"""

print('FORMAT:', s.format(**values))

输出结果：
TEMPLATE:
Variable : foo
Escape : $
Variable in text: fooiable

INTERPOLATION:
Variable : foo
Escape : %
Variable in text: fooiable

FORMAT:
Variable : foo
Escape : {}
Variable in text: fooiable
模版与字符串拼接或格式化的一个关键区别是，它不考虑参数的类型，值会转换为字符串，再将字符串插入结果

safe_substitute()方法

通过使用safe_substitute()方法，可以避免未能向模版提供所需的所有参数值时可能发生的异常，如下所示：

import string
values = {'var': 'foo'}
t = string.Template("$var is here but $missing is not provided")
try:
    print('substitute()     :', t.substitute(values))
except KeyError as err:
    print('ERROR:', str(err))
print('safe_substitute():', t.safe_substitute(values))

由于values字典中没有missing的值，但确不会抛出这个错误，会捕获这个错误并保留文本中的变量表达式。输出结果：
ERROR: ‘missing’
safe_substitute(): foo is here but $missing is not provided

高级模版

可以通过调整string.Template在模版中查找变量名所使用的正则表达式模式，改变它的默认语法，即修改delimiter和idpattern类属性。如下使用通过改变替换规则，定界符是%而不是$,而且变量名中间的某个位置必须含有一个下划线，如下代码所示：

import string
class MyTemplate(string.Template):
    delimiter = '%'
    idpattern = '[a-z]+_[a-z]+'
template_text = '''
  Delimiter : %%
  Replaced  : %with_underscore
  Ignored   : %notunderscored
'''
d = {
    'with_underscore': 'replaced',
    'notunderscored': 'not replaced',
}
t = MyTemplate(template_text)
print('Modified ID pattern:')
print(t.safe_substitute(d))

输出为：
Modified ID pattern:
Delimiter : %
Replaced : replaced
Ignored : %notunderscored

textwrap:格式化文本段落

dedent和fill

可以把去除缩进的文本传入fill()，并指定一些不同的width值

import textwrap
from textwrap_example import sample_text

dedented_text = textwrap.dedent(sample_text).strip()
for width in [45, 60]:
    print('{} Columns:\n'.format(width))
    print(textwrap.fill(dedented_text, width=width))
    print()

输出结果：
在这里插入图片描述

indent()

可以使用indent()函数为一个字符串中的所有行增加一致的前缀文本，如下所示这里使用>作为每一行的前缀。

import textwrap
from textwrap_example import sample_text
dedented_text = textwrap.dedent(sample_text)
wrapped = textwrap.fill(dedented_text, width=50)
wrapped += '\n\nSecond paragraph after a blank line.'
final = textwrap.indent(wrapped, '> ')
print('Quoted block:\n')
print(final)

文本块按照换行符分解，将为包含文本的各行增加前缀，然后再把这些行合并为一个新的字符串并返回
在这里插入图片描述
还可以传入一个callable对象作为indent()的参数，然后各行文本就会调用这个callable,为对应行添加前缀

import textwrap
from textwrap_example import sample_text
def should_indent(line):
    print('Indent {!r}?'.format(line))
    return len(line.strip()) % 2 == 0
dedented_text = textwrap.dedent(sample_text)
wrapped = textwrap.fill(dedented_text, width=50)
final = textwrap.indent(wrapped, 'EVEN ',
                        predicate=should_indent)
print('\nQuoted block:\n')
print(final)

如代码所示len(line.strip()) % 2 == 0表示会为包含偶数个字符的行添加前缀EVEN

re:正则表达式

文本中查找search()

import re
pattern = 'this'
text = 'Does this text match the pattern?'
match = re.search(pattern, text)
s = match.start()
e = match.end()
print('Found "{}"\nin "{}"\nfrom {} to {} ("{}")'.format(
    match.re.pattern, match.string, s, e, text[s:e]))

start和end方法可以提供字符串中的相应索引，指示与模式匹配的文本在字符串中出现的位置，输出结果：
Found “this”
in “Does this text match the pattern?”
from 5 to 9 (“this”)

编译表达式compile()

compile()函数会把一个表达式字符串转换为一个RegexObject.

import re
regexes = [
    re.compile(p)
    for p in ['this', 'that']
]
text = 'Does this text match the pattern?'
print('Text: {!r}\n'.format(text))
for regex in regexes:
    print('Seeking "{}" ->'.format(regex.pattern),
          end=' ')
    if regex.search(text):
        print('match!')
    else:
        print('no match')

多重匹配

findall()函数会返回输入中与模式匹配而且不重叠的子串，如下所示输出结果：
Found ‘ab’
Found ‘ab’

import re
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.findall(pattern, text):
    print('Found {!r}'.format(match))

可以通过使用finditer()返回一个迭代器

import re
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print('Found {!r} at {:d}:{:d}'.format(
        text[s:e], s, e))

同样会找到ab出现两次，但是可以同时显示他们在原文中出现的位置，如下所示：
Found ‘ab’ at 0:2
Found ‘ab’ at 5:7

difflib:比较序列

使用compare()函数将其分解为由单个文本行构成的序列，与传入字符串进行比较

import difflib
from difflib_data import *
d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
print('\n'.join(diff))

举例如下输出：

pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis

pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
? +
这里符号的意思是：
有-前缀的行在第一个序列中
有+前缀的行在第二个序列中
如果某一行存在增量差异，会使用一个？前缀的额外行来强调变更
还可以使用一些参数来指示需要忽略哪些行，以及行中的哪些字符

皮卡丘吃桃子

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
python文本处理常用函数整理

python文本处理常用函数整理string,文本处理工具str类textwrap:格式化文本段落re:正则表达式difflib:比较序列string,文本处理工具str类textwrap:格式化文本段落re:正则表达式difflib:比较序列
复制链接

扫一扫

专栏目录