python：使用多个界定符分隔字符串

OceanStar的学习笔记

已于 2022-10-23 12:29:56 修改

阅读量407

点赞数 1

分类专栏： python 文章标签：设计模式

于 2022-01-02 17:34:42 首次发布

原文链接：https://time.geekbang.org/

版权

正则表达式字符串分割文本处理 re.split() strip()

关键词由CSDN通过智能技术生成

python 专栏收录该内容

87 篇文章 9 订阅

订阅专栏

使用多个界定符分割字符串

string对象的split()方法只支持非常简单的字符串分隔情形，如果你需要灵活切割字符的时候，最好使用re.split()方法

>>> line = 'asdf fjdk; afed, fjek,asdf, foo'
>>> import re
>>> re.split(r'[;,\s]\s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

当你使用 re.split() 函数时候，需要特别注意的是正则表达式中是否包含一个括号捕获分组。如果使用了捕获分组，那么被匹配的文本也将出现在结果列表中。比如，观察一下这段代码运行后的结果：

>>> fields = re.split(r'(;|,|\s)\s*', line)
>>> fields
['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
>>>

删除字符串中不需要的字符

strip() 方法能用于删除开始或结尾的字符。lstrip() 和 rstrip() 分别从左和从右执行删除操作。默认情况下，这些方法会去除空白字符，但是你也可以指定其他字符。比如：

>>> # Whitespace stripping
>>> s = ' hello world \n'
>>> s.strip()
'hello world'
>>> s.lstrip()
'hello world \n'
>>> s.rstrip()
' hello world'
>>>
>>> # Character stripping
>>> t = '-----hello====='
>>> t.lstrip('-')
'hello====='
>>> t.strip('-=')
'hello'
>>>

但是这些方法不会对空间的文本产生任何影响，如果你想处理中间的空格，那么你需要求助其他技术。比如使用 replace() 方法或者是用正则表达式替换。示例如下：

>>> s.replace(' ', '')
'helloworld'
>>> import re
>>> re.sub('\s+', ' ', s)
'hello world'
>>>

字符串开头或结尾匹配

方法一：startswith() 和 endswith() （推荐）

检查字符串开头或结尾的一个简单方法是使用 str.startswith() 或者是 str.
endswith() 方法。比如：

>>> filename = 'spam.txt'
>>> filename.endswith('.txt')
True
>>> filename.startswith('file:')
False
>>> url = 'http://www.python.org'
>>> url.startswith('http:')
True
>>>

如果你想检查多种匹配可能，只需要将所有的匹配项放入到一个元组中去，然后传
给 startswith() 或者 endswith() 方法：

>>> [name for name in filenames if name.endswith(('.c', '.h')) ]
['foo.c', 'spam.c', 'spam.h'
>>> any(name.endswith('.py') for name in filenames)
True
>>>

# 检查某个文件夹中是否存在指定的文件类型
if any(name.endswith(('.c', '.h')) for name in listdir(dirname)):
...

方法二：切片（不推荐）

>>> filename = 'spam.txt'
>>> filename[-4:] == '.txt'
True
>>> url = 'http://www.python.org'
>>> url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'
True
>>>

方法三：正则表达式（不推荐）

>>> import re
>>> url = 'http://www.python.org'
>>> re.match('http:|https:|ftp:', url)

字符串对齐

对于基本的字符串对齐操作，可以使用字符串的 ljust() , rjust() 和 center()方法。比如：

>>> text = 'Hello World'
>>> text.ljust(20)
'Hello World '
>>> text.rjust(20) ' Hello World'
>>> text.center(20)
' Hello World '
>>>

>>> text.rjust(20,'=')
'=========Hello World'
>>> text.center(20,'*')
'****Hello World*****'
>>>

函数 format() 同样可以用来很容易的对齐任何值。你要做的就是使用 <,> 或者 ^
字符后面紧跟一个指定的宽度。比如：

>>> x = 1.2345
>>> format(x, '>10')
' 1.2345'
>>> format(x, '^10.2f')
' 1.23 '
>>>


>>> format(text, '>20') ' Hello World'
>>> format(text, '<20')
'Hello World '
>>> format(text, '^20')
' Hello World '
>>>


# 如果你想指定一个非空格的填充字符，将它写到对齐字符的前面即可：
>>> format(text, '=>20s')
'=========Hello World'
>>> format(text, '*^20s')
'****Hello World*****'
>>>

# 当格式化多个值的时候
>>> '{:>10s} {:>10s}'.format('Hello', 'World')
' Hello World'
>>>

在老的代码中，你经常会看到被用来格式化文本的 % 操作符。

>>> '%-20s' % text
'Hello World '
>>> '%20s' % text
' Hello World'
>>>

但是，在新版本代码中，你应该优先选择 format() 函数或者方法。format() 要比% 操作符的功能更为强大。并且 format() 也比使用 ljust() , rjust() 或 center() 方法更通用，因为它可以用来格式化任意对象，而不仅仅是字符串。

以指定列宽格式化字符串

使用 textwrap 模块来格式化字符串的输出。textwrap 模块对于字符串打印是非常有用的，特别是当你希望输出自动匹配终端大小的时候。你可以使用 os.get_terminal_size() 方法来获取终端的大小尺寸。比如：

>>> import os
>>> os.get_terminal_size().columns
80
>>>

下面是一些用法


s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."


>>> import textwrap
>>> print(textwrap.fill(s, 70))
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes, don't look around the eyes, look into my eyes,
you're under.
>>> print(textwrap.fill(s, 40))
Look into my eyes, look into my eyes,
the eyes, the eyes, the eyes, not around
the eyes, don't look around the eyes,
look into my eyes, you're under.
>>> print(textwrap.fill(s, 40, initial_indent=' '))
Look into my eyes, look into my
eyes, the eyes, the eyes, the eyes, not
around the eyes, don't look around the
eyes, look into my eyes, you're under.
>>> print(textwrap.fill(s, 40, subsequent_indent=' '))
Look into my eyes, look into my eyes,
the eyes, the eyes, the eyes, not
around the eyes, don't look around
the eyes, look into my eyes, you're
under.

fill() 方法接受一些其他可选参数来控制 tab，语句结尾等

合并拼接字符串

如果你想要合并的字符串是在一个序列或者 iterable 中，那么最快的方式就是使用 join() 方法。比如：

>>> parts = ['Is', 'Chicago', 'Not', 'Chicago?']
>>> ' '.join(parts)
'Is Chicago Not Chicago?'
>>> ','.join(parts)
'Is,Chicago,Not,Chicago?'
>>> ''.join(parts)
'IsChicagoNotChicago?'
>>>

如果你仅仅只是合并少数几个字符串，使用加号 (+) 通常已经足够了：

>>> a = 'Is Chicago'
>>> b = 'Not Chicago?'
>>> a + ' ' + b
'Is Chicago Not Chicago?'
>>>

如果你想在源码中将两个字面字符串合并起来，你只需要简单的将它们放到一起，不需要用加号 (+)。比如：

>>> a = 'Hello' 'World'
>>> a
'HelloWorld'
>>>

但是，当我们使用加号 (+) 操作符去连接大量的字符串的时候是非常低效率的，因为加号连接会引起内存复制以及垃圾回收操作。特别的，你永远都不应像下面这样写字符串连接代码：

s = ''
for p in parts:
s += p

我们应该先利用生成器表达式转换数据为字符串的同时合并字符串，比如：

>>> data = ['ACME', 50, 91.1]
>>> ','.join(str(d) for d in data)
'ACME,50,91.1'
>>>

打印的时候也要注意：

print(a + ':' + b + ':' + c) # Ugly
print(':'.join([a, b, c])) # Still ugly
print(a, b, c, sep=':') # Better

字符串中插入变量

Python 并没有对在字符串中简单替换变量值提供直接的支持。但是通过使用字符串的 format() 方法来解决这个问题。比如：

>>> s = '{name} has {n} messages.'
>>> s.format(name='Guido', n=37)
'Guido has 37 messages.'
>>>

或者，如果要被替换的变量能在变量域中找到，那么你可以结合使用 format_map()和 vars() 。就像下面这样：

>>> name = 'Guido'
>>> n = 37
>>> s.format_map(vars())
'Guido has 37 messages.'
>>>

>>> class Info:
... def __init__(self, name, n):
... self.name = name
... self.n = n
...
>>> a = Info('Guido',37)
>>> s.format_map(vars(a))
'Guido has 37 messages.'
>>>

format 和 format_map() 的一个缺陷就是它们并不能很好的处理变量缺失的情况，比如：

>>> s.format(name='Guido')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'n'
>>>

一种避免这种错误的方法是另外定义一个含有__missing__() 方法的字典对象，
就像下面这样：

class safesub(dict):
""" 防止 key 找不到"""
def __missing__(self, key):
	return '{' + key + '}'

现在你可以利用这个类包装输入后传递给 format_map() ：

>>> del n # Make sure n is undefined
>>> s.format_map(safesub(vars()))
'Guido has {n} messages.'
>>>

映射或者字典类中鲜为人知的 __missing__()方法可以让你定义如何处理缺失的值。
在 SafeSub 类中，这个方法被定义为对缺失的值返回一个占位符。你可以发现缺失的值会出现在结果字符串中 (在调试的时候可能很有用)，而不是产生一个 KeyError 异常。

如果你发现自己在代码中频繁的执行这些步骤，你可以将变量替换步骤用一个工具函数封装起来。就像下面这样：

import sys

def sub(text):
	return text.format_map(safesub(sys._getframe(1).f_locals))

>>> name = 'Guido'
>>> n = 37
>>> print(sub('Hello {name}'))
Hello Guido
>>> print(sub('You have {n} messages.'))
You have 37 messages.
>>> print(sub('Your favorite color is {color}'))
Your favorite color is {color}
>>>

用Shell 通配符匹配字符串

当你想使用 Unix Shell 中常用的通配符 (比如*.py , Dat[0-9]*.csv等) 去匹配文本字符串时，fnmatch 模块提供了两个函数——fnmatch() 和 fnmatchcase() ，可以用来实现这样的匹配
fnmatch() 函数匹配能力介于简单的字符串方法和强大的正则表达式之间。如果在数据处理操作中只需要简单的通配符就能完成的时候，这通常是一个比较合理的方案
如果你的代码需要做文件名的匹配，最好使用 glob 模块

>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
True
>>> fnmatch('foo.txt', '?oo.txt')
True
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
True
>>> names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
>>> [name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']
>>>

fnmatch() 函数使用底层操作系统的大小写敏感规则 (不同的系统是不一样的) 来匹配模式。比如：

>>> # On OS X (Mac)
>>> fnmatch('foo.txt', '*.TXT')
False
>>> # On Windows
>>> fnmatch('foo.txt', '*.TXT')
True
>>>

如果你对这个区别很在意，可以使用 fnmatchcase() 来代替。它完全使用你的模式大小写匹配。比如：

>>> fnmatchcase('foo.txt', '*.TXT')
False
>>>

这两个函数还可以处理非文件名的字符串

addresses = [ '5412 N CLARK ST', '1060 W ADDISON ST', '1039 W GRANVILLE AVE', '2122 N CLARK ST', '4802 N BROADWAY', ]

>>> from fnmatch import fnmatchcase
>>> [addr for addr in addresses if fnmatchcase(addr, '* ST')]
['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']
>>> [addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]
['5412 N CLARK ST']
>>>

字符串匹配和搜索

如果要匹配字面字符串，直接调用str.find() , str.endswith() , str.startswith() 或者类似的方法：

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> # Exact match
>>> text == 'yeah'
>False
>>> # Match at start or end
>>> text.startswith('yeah')
True
>>> text.endswith('no')
False
>>> # Search for the location of the first occurrence
>>> text.find('no')
10
>>>

对于复杂的匹配需要使用正则表达式和 re 模块。

>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>>
>>> import re
>>> # Simple matching: \d+ means match one or more digits
>>> if re.match(r'\d+/\d+/\d+', text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if re.match(r'\d+/\d+/\d+', text2):
... print('yes')
... else:
... print('no')
...
no
>>>

如果你想使用同一个模式去做多次匹配，你应该先将模式字符串预编译为模式对象。比如：

>>> datepat = re.compile(r'\d+/\d+/\d+')
>>> if datepat.match(text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if datepat.match(text2):
... print('yes')
... else:
... print('no')
...
no
>>>

match() 总是从字符串开始去匹配，如果你想查找字符串任意部分的模式出现位置，使用 findall() 方法去代替。比如：

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
['11/27/2012', '3/13/2013']
>>>

在定义正则式的时候，通常会利用括号去捕获分组。比如：

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
>>>
>>>> m = datepat.match('11/27/2012')
>>> m
<_sre.SRE_Match object at 0x1005d2750>
>>> # Extract the contents of each group
>>> m.group(0)
'11/27/2012'
>>> m.group(1)
'11'
>>> m.group(2)
'27'
>>> m.group(3)
'2012'
>>> m.groups()
('11', '27', '2012')
>>> month, day, year = m.groups()
>>>
>>> # Find all matches (notice splitting into tuples)
>>> text
'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
[('11', '27', '2012'), ('3', '13', '2013')]
>>> for month, day, year in datepat.findall(text):
... print('{}-{}-{}'.format(year, month, day))
...
2012-11-27
2013-3-13
>>>

findall() 方法会搜索文本并以列表形式返回所有的匹配。如果你想以迭代方式返回匹配，可以使用 finditer() 方法来代替，比如

>>> for m in datepat.finditer(text):
... print(m.groups())
...
('11', '27', '2012')
('3', '13', '2013')
>>>

字符串搜索和替换

对于简单的字面模式，直接使用 str.replace() 方法即可

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> text.replace('yeah', 'yep')
'yep, but no, but yep, but no, but yep'
>>>

对于复杂的模式，请使用 re 模块中的 sub() 函数。sub() 函数中的第一个参数是被匹配的模式，第二个参数是替换模式。

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> import re
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>>

如果你打算用相同的模式做多次替换，考虑先编译它来提升性能。

>>> import re
>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
>>> datepat.sub(r'\3-\1-\2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>>

对于更加复杂的替换，可以传递一个替换回调函数来代替，比如：

>>> from calendar import month_abbr
>>> def change_date(m):
... mon_name = month_abbr[int(m.group(1))]
... return '{} {} {}'.format(m.group(2), mon_name, m.group(3))
...
>>> datepat.sub(change_date, text)
'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'
>>>

如果除了替换后的结果外，你还想知道有多少替换发生了，可以使用 re.subn()来代替。比如：

>>> newtext, n = datepat.subn(r'\3-\1-\2', text)
>>> newtext
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>> n 2
>>>

以忽略大小写的方式搜索与替换文本字符串

为了在文本操作时忽略大小写，你需要在使用 re 模块的时候给这些操作提供re.IGNORECASE 标志参数。比如：

>>> text = 'UPPER PYTHON, lower python, Mixed Python'
>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']
>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)
'UPPER snake, lower snake, Mixed snake'
>>>

最后的那个例子揭示了一个小缺陷，替换字符串并不会自动跟被匹配字符串的大小写保持一致。为了修复这个，你可能需要一个辅助函数，就像下面的这样：

def matchcase(word):
	def replace(m):
		text = m.group()
		if text.isupper():
			return word.upper()
		elif text.islower():
			return word.lower()
		elif text[0].isupper():
			return word.capitalize()
		else:
			return word
	return replace

>>> re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)
'UPPER SNAKE, lower snake, Mixed Snake'
>>>