Cookbook:2.字符串和文本-CSDN博客

本文链接：https://blog.csdn.net/weixin_44901257/article/details/115520403

2.字符串和文本

2.1针对任意多的分割符拆分字符串

re.split()方法

line='asasd sda; asdw, sad ,sdaw,      sdppppp'
import re
re.split(r'[;,\s]\s*',line)

2.2在字符串开头或结尾做文本匹配

str.startswith()和endswith()

filename='spam.txt'
filename.endswith('.txt')

True

多项检查

import os
filenames=os.listdir('.')
filenames

['.ipynb_checkpoints',
 '3kettles.ipynb',
 'book_py',
 'CookBook',
 'DL',
 'doors.ipynb',
 'gen1.ipynb',
 'homework2',
 'juewei.jpg',
 'plate.ipynb',
 'sorts.ipynb',
 'spider1.ipynb',
 'Untitled.ipynb',
 '__pycache__']

[name for name in filenames if name.endswith(('.c','.py'))]
#这里需要两层括号，因为函数输入为单个元素或一个元组

注意的是要换成元组输入startswith，endswith才行

choices=['http:','https:']
url.startswith(choices)
TypeError
url.startswith(tuple(choices))	#才能正确使用

与其他技术同时使用效率提高：

if any(name.endswith(('.c','.h')) for name i  listdir(dirname)):

2.3利用shell通配符做字符串匹配

UNIX shell下的运行

addresses=[
    '3134 N SDS ST',
    '1341 S SDAQS AHS',
    '1322 N ADA ST',
    '1312 S sdq',
]

from fnmatch import fnmatchcase
[addr for addr in addresses if fnmatchcase(addr,'* ST')]#匹配字符可以是正则表达式

['3134 N SDS ST', '1322 N ADA ST']

2.4文本模式的匹配和查找

str.find()

以及str.startswith() str.endswith()等类似函数

复杂的使用正则表达式

import re

text1='11/27/2012'
text2='Nov 27,2012'

if re.match(r'\d+/\d+/\d+',text1):
    print('yes')
else:
    print('no')

yes

模式预编译

如果打算针对同一模式多次匹配，先将正则表达式模式预编译成一个模式对象：

datepat=re.compile(r'\d+/\d+/\d')
if datepat.match(text1):
    print('yes')
else:
    print('no')

yes

findall()方法

match()方法总是尝试在字符串开头找到匹配项，如果要针对文本搜索所有的，用findall()方法

text='Today is 11/27/2012,Pycon starts 3/13/2013.'
datepat.findall(text)

['11/27/2', '3/13/2']

定义正则表达式时，我们常会将部分模式用括号抱起来引入捕获组：

datepat=re.compile(r'(\d+)/(\d+)/(\d+)')

捕获组通常能简化后续对文本的处理，因为每个组的内容都可以单独提取出来

m=datepat.match('11/27/2012')

print(m.group(0))	#'11/27/2012'
print(m.group(1))	#'11'

2.5查找和替换文本

str.replace()

text='yeah, but no, but ,yeah'
text.replace('year','yep')

'yep, but no, but ,yep'

针对复杂的模式，用re模块的sub()方法

re.sub()

text='Today is 11/27/2012 ,Pycon starts 3/13/2013'
import re
re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',text)
#第一个参数为查找的参数，第二个为替换值，第三个为字符串，\3\1\2代表模式中的捕获组

'Today is 2012-11-27 ,Pycon starts 2013-3-13'

预编译以获更好的性能：

import re
datepat=re.compile(r'(\d+)/(\d+)/(\d+)')
datepat.sub(r'\3-\1-\2',text)

'Today is 2012-11-27 ,Pycon starts 2013-3-13'

对于更复杂的情况，可以指定一个替换回调函数：

from calendar import month_abbr
def change_date(m):
    mon_name=month_abbr[int(m.group(1))]
    return '{} {} {}'.format(m.group(2),mon_name,m.group(3))

datepat.sub(change_date,text)

'Today is 27 Nov 2012 ,Pycon starts 13 Mar 2013'

替换回调函数的输入参数是一个匹配对象，由match（）或find（）返回。用.group()方法来提取匹配中特定的部分。函数返回替换后的文本。

如果想知道完成多少次替换，有

re.sunb()方法

newtext,n =datepat.subn(r'\3-\1-\2',text)
n

2.6以不区分大小写的方式对文本做查找和替换

IGNORECASE

text='UPPER PYTHON,lower python,Mixed Python'
re.findall('python',text,flags=re.IGNORECASE)
re.sub('python','java',text,flags=re.IGNORECASE)

'UPPER java,lower java,Mixed java'

上面有缺陷

def matchcase(word):
    def replace(m):
        text=m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

re.sub('python',matchcase('snake'),text,flags=re.IGNORECASE)

'UPPER SNAKE,lower snake,Mixed Snake'

2.7实现最短匹配的正则表达式

str_pat=re.compile(r'\"(.*)\"')
text1='Computer says "no."'
str_pat.findall(text1)

['no.']

text2='Computer says "no." Phone syas "yes."'
str_pat.findall(text2)

['no." Phone syas "yes.']

这里错误，*操作符采用的是贪心策略，所以找出最长可能的匹配字串，解决这个问题，只要在*后加上？

str_pat=re.compile(r'\"(.*?)\"')

text2='Computer says "no." Phone syas "yes."'
str_pat.findall(text2)

['no.' , 'yes.']

句点（.）字符可以匹配换行符之外的任意字符，将*或+后加上？，将强制匹配算法调整为最短匹配

2.8多行模式正则表达式

文本块匹配，希望能够跨越多行

comment=re.compile(r'/\*(.*?)\*/')

text1='/* this is a comment */'
text2='''/*this is a
mutiline comment*/
'''

comment.findall(text1)

[' this is a comment ']

comment.findall(text2)

[]

失败

comment=re.compile(r'/\*((?:.|\n)*?)\*/')
comment.findall(text2)

['this is a\nmutiline comment']

在这个模式中，(?:.|\n)指定了一个非捕获组，即这个组只做匹配但不捕获结果，也不会分配组号

可以使用标记 re.DOTALL 试句点匹配所有字符

 comment=re.compile(r'/\*(.*?)\*/',re.DOTALL)  #可以去运行看看

2.9将Unicode文本统一表示为规范形式

同一文本有多种表达形式

unicodedata模块

s1='Spicy Jalape\uoof1o'
s2='Spicy Jalapen\u0303o'
#s1==s2    #False
import unicodedata
t1=unicodedata.normalize('NFC',s1)
t2=unicodedata.normalize('NFC',s2)
#s1==s2		#True

print(ascii(s1))
>>>'Spicy Jalape\xf1o'

normalize()的第一个参数指定了字符串应该如何完成规范表示。NFC表示字符是全组成的，NFD表示使用组合字符，每个字符能完全分解开。

2.10用正则表达式处理Unicode字符

import re

num=re.complile('\d+')
num.match('123')

<re.Match object; span=(0, 3), match='123'>