第二部分python字符串和正则
字符串⽆所不在,字符串的处理也是最常见的操作。本章节将总结和字符串处理相关的⼀切操作。主要 包括基本的字符串操作;⾼级字符串操作之正则。⽬前共有 25个⼩例⼦。
-
反转字符串
st='python' # 方法一 ''.join(reversed(st)) # 'nohtyp' # 方法二 st[::-1] # 'nohtyp'
-
字符串切片操作
# 查找替换3或5的倍数 [str("java"[i%3*4:]+"python"[i%5*6:] or i) for i in range(1,15)] ''' ['1', '2', 'java', '4', 'python', 'java', '7', '8', 'java', 'python', '11', 'java', '13', '14'] '''
-
join串联字符串
my=['1','2','java'] ','.join(my) # #⽤逗号连接字符串 '1,2,java'
-
字符串的字节长度
def str_byte_len(mystr): return (len(mystr.encode('utf-8'))) str_byte_len('i love python') # 13(个字节) str_byte_len('字符') # 6(个字节)
以下是正则部分,需要引入re模块:
import re
-
查找第一个匹配串
s = 'i love python very much' pat = 'python' r = re.search(pat,s) print(r.span()) #(7,13)
-
查找所有1的索引
s = '⼭东省潍坊市青州第1中学⾼三1班' pat = '1' r = re.finditer(pat,s) for i in r: print(i) ''' <_sre.SRE_Match object; span=(9, 10), match='1'> <_sre.SRE_Match object; span=(14, 15), match='1'> '''
-
\d匹配数字[0-9]
findall找出全部位置的所有匹配
s = '⼀共20⾏代码运⾏时间13.59s' pat = r'\d+' # +表⽰匹配数字(\d表⽰数字的通⽤字符)1次或多次 r = re.findall(pat,s) print(r) ''' ['20', '13', '59'] '''
-
匹配浮点数和整数
?表示前一个字符匹配0次或1次
s = '⼀共20⾏代码运⾏时间13.59s' pat = r'\d+\.?\d+' # ?表⽰匹配⼩数点(\.)0次或1次,这种写法有个⼩bug,不能匹配到个位数的整数 r = re.findall(pat,s) print(r) ''' ['20', '13.59'] '''
更好的写法:
pat = r'\d+\.\d+|\d+' # A|B,匹配A失败才匹配B['20', '13.59']
-
^匹配字符串的开头
s = 'This module provides regular expression matching operations similar to those found in Perl' pat = r'^[emrt]' # 查找以字符e,m,r或t开始的字符串 r = re.findall(pat,s) print(r) ''' [] [],因为字符串的开头是字符T,不在emrt匹配范围内,所以返回为空 '''
s2 = 'email for me is guozhennianhua@163.com' re.findall('^[emrt].*',s2)# 匹配以e,m,r,t开始的字符串,后⾯是多个任意字符 ''' ['email for me is guozhennianhua@163.com'] '''
-
re.l忽略大小写
s = 'That' pat = r't' r = re.findall(pat,s,re.I) # ['T', 't']
-
理解compile的作用
如果要做很多次匹配,可以先编译匹配串:
import re pat = re.compile('\W+') # \W 匹配不是数字和字母的字符 has_special_chars = pat.search('ed#2@edc') if has_special_chars: print(f'str contains special characters:{has_special_chars.group(0)}') ''' str contains special characters:# '''
### 再次使⽤pat正则编译对象 做匹配 again_pattern = pat.findall('guozhennianhua@163.com') if '@' in again_pattern: print('possibly it is an email') ''' possibly it is an email '''
-
使用()捕获单词,不想带空格
s = 'This module provides regular expression matching operations similar to those found in Perl' pat = r'\s([a-zA-Z]+)' r = re.findall(pat,s) print(r) ''' ['module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl'] '''
看到提取单词中未包括第⼀个单词,使⽤ ? 表⽰前⾯字符出现0次或1次,但是此字符还有表⽰贪⼼或⾮ 贪⼼匹配含义,使⽤时要谨慎。
s = 'This module provides regular expression matching operations similar to those found in Perl' pat = r'\s?([a-zA-Z]+)' r = re.findall(pat,s) print(r) ''' ['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl'] '''
-
split分割单词
使⽤以上⽅法分割单词不是简洁的,仅仅是为了演⽰。分割单词最简单还是使⽤ split函数。
s = 'This module provides regular expression matching operations similar to those found in Perl' pat = r'\s+' r = re.split(pat,s) print(r) ''' ['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl'] '''
### 上⾯这句话也可直接使⽤str⾃带的split函数: s.split(' ') #使⽤空格分隔 ''' ['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl'] '''
### 但是,对于风格符更加复杂的情况,split⽆能为⼒,只能使⽤正则 s = 'This,,, module ; \t provides|| regular ; ' words = re.split('[,\s;|]+',s) #这样分隔出来,最后会有⼀个空字符串 words = [i for i in words if len(i)>0] # ['This', 'module', 'provides', 'regular']