2022-4-27 python cookbook(v3.0) 学习笔记(二)

字符串和文本

使用多个界定符分割字符串

>>> line = 'asdf fjdk; afed, fjek,asdf, foo'
>>> import re
>>> re.split(r'[;,\s]\s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
>>> import re
>>> line = 'asdf fjdk; afed, fjek,asdf, foo'
>>> fields = re.split(r'(;|,|\s)\s*', line)	#分割字符串也会出现在列表中
>>> fields
['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
>>> re.split(r'(?:,|;|\s)\s*', line)		#不想保留分割字符串且仍需要括号来分组正则,确保分组傻逼非捕获分组
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

字符串开头或结尾匹配

>>> filename = 'spam.txt'
>>> filename.endswith('.txt')
True
>>> filename.startswith('file:')
False
>>> url = 'http://www.python.org'
>>> url.startswith('http:')
True
>>> import os
>>> filename = os.listdir('.')
>>> filename
['DLLs', 'Doc', 'include', 'Lib', 'libs', 'LICENSE.txt', 'NEWS.txt', 'python.exe', 'python3.dll', 'python38.dll', 'pythonw.exe', 'Scripts', 'tcl', 'Tools', 'vcruntime140.dll', 'vcruntime140_1.dll']
>>> [name for name in filename if name.endswith(('.dll', '.txt'))]		#参数类型必须是元组
['LICENSE.txt', 'NEWS.txt', 'python3.dll', 'python38.dll', 'vcruntime140.dll', 'vcruntime140_1.dll']
>>> any(name.endswith('.txt') for name in filename)
True
>>> 

用shell通配符匹配字符串

>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
True
>>> fnmatch('foo.txt', '?oo.txt')
True
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
True

fnmatch()使用底层操作系统的大小写敏感规则,导致不同的系统结果是不同的,可以使用fnmatchcase()来代替

>>> fnmatch('foo.txt', '*.TXT')
True
>>> fnmatchcase('foo.txt', '*.TXT')
False
>>> 

字符串匹配和搜索

匹配字面字符串

>>> test = 'yeah, but, no, but, yeah, but, no, but, yeah'
>>> test == 'yeah'
False
>>> test.startswith('yeah')
True
>>> test.endswith('no')
False
>>> test.find('but')
6
>>> 

复杂匹配,需要使用正则表达式和re模块

>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>> import re
>>> if re.match(r'\d+/\d+/\d', text1):
	print('yes')
else:
	print('No')

yes
>>> 

使用同一模式做多次匹配,先将模式字符串预编译为模式对象

>>> datepat = re.compile(r'\d+/\d+/\d')
>>> if datepat.match(text1):
	print('yes')
else:
	print('no')

yes
>>> 

match()总是从字符串开始匹配,若想查找任意部分,使用findall()

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
['11/27/2', '3/13/2']
>>> 

定义正则表达式时,会利用括号捕获分组

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
>>> m = datepat.match('11/27/2012')
>>> m
<re.Match object; span=(0, 10), match='11/27/2012'>
>>> m.group(1)
'11'
>>> m.group(2)
'27'
>>> m.group(3)
'2012'
>>> m.groups()
('11', '27', '2012')
>>> month, day, year = m.groups()
>>> datepat.findall(text)
[('11', '27', '2012'), ('3', '13', '2013')]
>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)$')		#精确匹配时正则表达式以$结尾
>>> datepat.match('11/27/2/12abcd')
>>> datepat.match('11/27/2012')
<re.Match object; span=(0, 10), match='11/27/2012'>
>>> 

findall()是以列表形式返回匹配,如果以迭代形式返回,可以使用finditer()

>>> for m in datepat.finditer(text):
	print(m.groups())

('11', '27', '2012')
('3', '13', '2013')
>>> 

字符串搜索和替换

简单的字面模式

>>> text = 'yeah, but. no. bnut, yeah, but, no, but, yeah'
>>> text.replace('yeah', 'yep')
'yep, but. no. bnut, yep, but, no, but, yep'
>>> 

复杂模式,使用sub()

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> import re
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)		#\3指向前面模式的捕获组号
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>>

多次替换,可以先编译模式

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>> 

更复杂的替换,可以传递一个回调函数

>>> from calendar import month_abbr
>>> def change_date(m):
	mon_name = month_abbr[int(m.group(1))]
	return '{} {} {}'.format(m.group(2), mon_name, m.group(3))

>>> datepat.sub(change_date, text)
'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'
>>> 

想知道有多少替换发生,使用subn()

>>> newtext, n = datepat.subn(r'\3-\1-\2', text)
>>> newtext
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>> n
2
>>> 

字符串忽略大小写的搜索替换

>>> text = 'UPPER PYTHON, lower python, Mixed Python'
>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']
>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)
'UPPER snake, lower snake, Mixed snake'
>>> 

最短匹配模式

>>> str_pat = re.compile(r'\"(.*)\"')
>>> text1 = 'Computer says "no."'
>>> str_pat.findall(text1)
['no.']
>>> text2 = 'Computer says "no." Phone says "yes."'	
>>> str_pat.findall(text2)
['no." Phone says "yes.']		#匹配的内容不是我们想要的
>>> str_pat = re.compile(r'\"(.*?)\"')		#追加“?”
>>> str_pat.findall(text2)
['no.', 'yes.']
>>> 

多行匹配模式

>>> comment = re.compile(r'/\*(.*?)\*/')
>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a
multiline comment */
'''
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]
>>> comment = re.compile('/\*((?:.|\n)*?)\*/')
>>> comment.findall(text2)
[' this is a\nmultiline comment ']
>>> comment.findall(text1)
[' this is a comment ']
>>> 

将Unicode文本标准化

>>> s1 = 'Spicy Jalape\u00f1o'	#使用整体字符(U+00F1)
>>> s2 = 'spicy Jalapen\u0303o'		#使用拉丁字母加~的组合(U+0303)
>>> s1
'Spicy Jalapeño'
>>> s2
'spicy Jalapeño'
>>> s1 == s2	#字符串看起来一样,但是比较时返回False
False
>>> len(s1)
14
>>> len(s2)
15
>>>
>>> import unicodedata
>>> t1 = unicodedata.normalize('NFC', s1)	#NFC,NFD,NFKC,NFKD
>>> t2 = unicodedata.normalize('NFC', s2)
>>> t1 == t2
True
>>> print(ascii(t1))
'Spicy Jalape\xf1o'
>>> print(ascii(t2))
'Spicy Jalape\xf1o'
>>> 
>>> t1 = unicodedata.normalize('NFD', s1)
>>> ''.join(c for c in t1 if not unicodedata.combining(c))	#测试字符是否是音字符
'Spicy Jalapeno'
>>> 

在正则表达式中使用Unicode

>>> import re
>>> num = re.compile('\d+')	#re模块已经有了基本的支持
>>> num.match('123')
<re.Match object; span=(0, 3), match='123'>
>>> num.match('\u0661\u0662\u0663')
<re.Match object; span=(0, 3), match='١٢٣'>
>>> 

删除字符串中不需要的字符

strip()用于删除开始或结尾的字符,lstrip()和rstrip()分别从左右执行删除

>>> s.strip()
'hello world'
>>> s.lstrip()
'hello world \n'
>>> s.rstrip()
'  hello world'
>>> t = '---hello ==='
>>> t.lstrip('-')
'hello ==='
>>> t.rstrip('=')
'---hello '
>>> t.strip('-=')
'hello '
>>> t.strip(' -=')
'hello'
>>> 

审阅清理文本字符串

>>> s = 'python\fis\tawesome\r\n'
>>> s
'python\x0cis\tawesome\r\n'
>>> remap = {
	ord('\t'):'',
	ord('\f'):'',
	ord('\r'):None
	}
>>> a = s.translate(remap)
>>> a
'pythonisawesome\n'
>>> 

字符串对齐

>>> text = 'Hello World'
>>> text.ljust(20)
'Hello World         '
>>> text.rjust(20)
'         Hello World'
>>> text.center(20)
'    Hello World     '
>>> text.rjust(20, '=')
'=========Hello World'
>>> text.center(20, '*')
'****Hello World*****'
>>> format(text, '=>20s')
'=========Hello World'
>>> format(text, '>20')
'         Hello World'
>>> format(text, '<20')
'Hello World         '
>>> format(text, '*^20s')
'****Hello World*****'
>>> '{:>10s} {:>10s}'.format('Hello', 'World')
'     Hello      World'
>>> x = 1.2345
>>> format(x, '-10.2f')
'      1.23'
>>> format(x, '>10')
'    1.2345'
>>> format(x, '-10.2f')
'      1.23'
>>> 

合并拼接字符串

>>> parts = ['Is', 'Chicago', 'Not', 'Chicago?']
>>> ''.join(parts)
'IsChicagoNotChicago?'
>>> ','.join(parts)
'Is,Chicago,Not,Chicago?'
>>> ' '.join(parts)
'Is Chicago Not Chicago?'
>>> a = 'Is Chicago'
>>> b = 'Not Chicago?'
>>> a + '' + b
'Is ChicagoNot Chicago?'
>>> a = 'Hello' 'World'
>>> a
'HelloWorld'
>>> 

永远不要这样连接字符串:

>>> s = ''
>>> for p in parts:
	s += p		#每次+=都会创建一个新的字符串对象

可以利用生成器表达式

>>> data = ['ACME', 50, 91, 1]
>>> ','.join(str(d) for d in data)
'ACME,50,91,1'
>>> 

不必要的字符串连接操作:

>>>print(a + ';' + 'b' + ':' + c)	#可以,但是冗余
>>>print(a, b, c, sep=':')

字符串中插入变量

>>> s = '{name} has {n} messages.'
>>> s.format(name = 'Guido', n = 37)
'Guido has 37 messages.'
>>> 
>>> name = 'Guido'
>>> n = 37
>>> s.format_map(vars())		#在变量域查找变量,且适用于对象实例
'Guido has 37 messages.'
>>> 

format()和format_map()的一个缺陷时不能处理变量缺失

>>> s.format(name='Guido')
Traceback (most recent call last):
  File "<pyshell#341>", line 1, in <module>
    s.format(name='Guido')
KeyError: 'n'
>>> 

避免这种错误的方法是定义一个字典对象:

>>> class safesub(dict):
	def __missing__(self, key):
		return '{' + key + '}'

>>> del n
>>> s.format_map(safesub(vars()))
'Guido has {n} messages.'
>>> 

可以封装之后使用:

>>> import sys
>>> def sub(text):
	return text.format_map(safesub(sys._getframe(1).f_locals))

>>> name = 'Guido'
>>> n = 37
>>> print(sub('Hello {name}'))
Hello Guido
>>> print(sub('You have {n} messages.'))
You have 37 messages.
>>> print(sub('Your favorite color is {color}'))
Your favorite color is {color}

以指定列宽格式化字符串

>>> s = "Look into my eyes, look into my eyes,the eyes, the eyes,\
the eyes, not around the eyes, don't look around the eyes,\
look into my eyes, you're under."
>>> import textwrap
>>> print(textwrap.fill(s, 70))
Look into my eyes, look into my eyes,the eyes, the eyes,the eyes, not
around the eyes, don't look around the eyes,look into my eyes, you're
under.
>>> print(textwrap.fill(s, 40, initial_indent='    '))
    Look into my eyes, look into my
eyes,the eyes, the eyes,the eyes, not
around the eyes, don't look around the
eyes,look into my eyes, you're under.
>>> print(textwrap.fill(s, 40, subsequent_indent='    '))
Look into my eyes, look into my eyes,the
    eyes, the eyes,the eyes, not around
    the eyes, don't look around the
    eyes,look into my eyes, you're
    under.
>>> 

在字符串中处理html和xml

>>> s = 'Elements are written as "<tag>text</tag>".'
>>> import html
>>> print(s)
Elements are written as "<tag>text</tag>".
>>> print(html.escape(s))
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
>>> print(html.escape(s, quote=False))
Elements are written as "&lt;tag&gt;text&lt;/tag&gt;".
>>> 
>>> s = 'Spicy Jalapeño'
>>> s.encode('ascii', errors='xmlcharrefreplace')
b'Spicy Jalape&#241;o'
>>> 

含有编码值的原始文本,需要如下方法替换字符串:

>>> s = 'Spicy &quot;Jalape&#241;o&quot.'
>>> from html.parser import HTMLPar
>>> html.unescape(s)
'Spicy "Jalapeño".'
>>> 
>>> t = 'The prompt is &gt;&gt;&gt;'
>>> from xml.sax.saxutils import unescape
>>> unescape(t)
'The prompt is >>>'
>>> 

字符串令牌解析

暂时没用到

实现一个递归下降分析器

暂时没用到

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值