Python实践手册-字符串处理和正则表达式

1.字符串的反转

st = "this is a string"
## 方法一
''.join(reversed(st))
'gnirts a si siht'
## 方法二
st[::-1]
'gnirts a si siht'

2.字符串的串联

my_str = ['this','is','a','string']
' '.join(my_str) ## 用空格连接字符串
'this is a string'

3.字符串的字节长度

length1 = len('this is a string')
length2 = len('this is a string'.encode('utf-8'))
print(f'length1:{length1}')
print(f'length2(字节长度):{length2}')
length1:16
length2(字节长度):16
length3 = len('中国梦')
length4 = len('中国梦'.encode('utf-8'))
print(f'length3:{length3}')
print(f'length4(字节长度):{length4}')
length3:3
length4(字节长度):9

4.正则查找匹配串索引

import re

s1 = 'my name is jack,my code is 9527'
pat = 'jack'
r = re.search(pat,s1)  #re.research()查找第一个匹配串
print(r.span())
(11, 15)
s2 = '你好!我想了解下健康保险和医疗保险的区别'
pat = '保险'
r = re.finditer(pat,s2)
for i in r:
    print(i)
<re.Match object; span=(10, 12), match='保险'>
<re.Match object; span=(15, 17), match='保险'>

5.匹配数字[0-9]

findall找出全部位置的所有匹配数字,\d表示0-9这10个数字的通用字符,+表示匹配1次或多次

s = '一共20行代码运行时间13.59s'
pat = r'\d+'  
r = re.findall(pat,s)
print(r)
['20', '13', '59']

6.匹配浮点数

?表示匹配0或1次

s = '一共20行代码运行时间13.59s,8'
pat = r'\d+\.?\d+'  
r = re.findall(pat,s)
print(r)
['20', '13.59']

7.匹配字符串的开头(^)

s1 = 'my name is jack,my code is 9527'
pat = r'^[emrt]'
r = re.findall(pat,s1)
print(r)
['my name is jack,my code is 9527']

8.忽略大小写(re.I)

s = 'That'
pat = r't'
r = re.findall(pat,s,re.I)
r
['T', 't']

9.compile的作用

如果要做很多次的匹配,可以先编译匹配串(\W 匹配不是数字和字母/汉字的字符)

import re

pat = re.compile('\W+')
## 第一次调用pat正则编译对象做匹配
has_special_chars = pat.findall('ed#2@ed*&$cz中国梦!')
if has_special_chars:
    print(f'句子包含特殊字符:{has_special_chars}')

## 第二次调用pat正则编译对象做匹配
again_pattern = pat.findall("jack18588951684@gmail.com")
if  '@' in again_pattern:
    print('这可能是邮箱地址!')
句子包含特殊字符:['#', '@', '*&$', '!']
这可能是邮箱地址!

10.使用()捕获单词,不想带空格

s = 'This module provides regular expression matching operations similar to those found in Perl'
pat = r'\s([a-zA-Z]+)'
r = re.findall(pat,s)
print(r)
['module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl']

看到提取单词中未包括第一个单词,使用 ? 表示前面字符出现0次或1次,但是此字符还有表示贪心或非贪心匹配含义,使用时要谨慎。

s = 'This module provides regular expression matching operations similar to those found in Perl'
pat = r'\s?([a-zA-Z]+)' 
r = re.findall(pat,s)
print(r)
['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl']

11.英文分词

## 方法一:re.split
s = 'This module provides regular expression matching operations similar to those found in Perl'
pat = r'\s+'
r = re.split(pat,s)
print(r)
['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl']
## 方法二:str自带的split函数
res = s.split(' ')
print(res)
['This', 'module', 'provides', 'regular', 'expression', 'matching', 'operations', 'similar', 'to', 'those', 'found', 'in', 'Perl']

但是,对于风格符更加复杂的情况,split无能为力,只能使用正则

s = 'This,,,  module ; \t  provides|| regular; '
words = re.split('[,\s;|]+',s)
print(f"words:{words}")
words = [i for i in words if len(i)>0]
print(f"words(去掉空字符串后):{words}")
words:['This', 'module', 'provides', 'regular', '']
words(去掉空字符串后):['This', 'module', 'provides', 'regular']

12.从指定开始位置匹配match

match函数:从字符串指定位置开始匹配

import re

mystr = 'This'
pat = re.compile('hi')
r1 = pat.match(mystr)
print(f'r1:{r1}')
r2 = pat.match(mystr,1) # 从指定位置“1”开始匹配
print(f'r2:{r2}')
r1:None
r2:<re.Match object; span=(1, 3), match='hi'>

search函数:从字符串任意位置开始匹配

mystr = 'This'
pat = re.compile('hi')
r3 = pat.search(mystr)
print(f'r3:{r3}')
r3:<re.Match object; span=(1, 3), match='hi'>

13.替换匹配的子串

sub函数实现对匹配子串的替换

import re

content = "你好!我刚来中国不久,我想了解下你们公司都有哪些类型的保险?"
source = ['中国','保险']
target = ['China','Insurance']
for i in range(len(source)):
    pat = re.compile(source[i]) # 要被替换的部分
    content = pat.sub(target[i],content)
print(content)
你好!我刚来China不久,我想了解下你们公司都有哪些类型的Insurance?

14.贪心捕获和非贪心捕获

content='<h>ddedadsad</h><div>graph</div>bb<div>math</div>cc'
pat1 = re.compile(r"<div>(.*)</div>")  # 贪心模式(.*)
pat2 = re.compile(r"<div>(.*?)</div>") # 非贪心模式(.*?)
res1 = pat1.findall(content)
res2 = pat2.findall(content)
print(f"content:{content}")
print(f"贪心捕获结果:{res1}")
print(f"非贪心捕获结果:{res2}")
content:<h>ddedadsad</h><div>graph</div>bb<div>math</div>cc
贪心捕获结果:['graph</div>bb<div>math']
非贪心捕获结果:['graph', 'math']

15.常用元字符

元字符

16.常用通用字符

通配符

17.密码安全检查

1)密码安全要求:a)要求密码为6到20位; b)密码只包含英文字母和数字;

2)选用最保险的 fullmatch 方法,查看是否整个字符串都匹配

pat = re.compile(r'\w{6,20}') # 错误!因为\w通配符匹配的是字母,数字和下划线,题目要求不能含有下划线
pat = re.compile(r'[\da-zA-Z]{6,20}')
res1 = pat.fullmatch('qaz12') # 密码长度小于6
print(f"'res1':{res1}")
res2 = pat.fullmatch('qaz12wsxedcrfvtgb67890942234343434') # 密码长度大于20
print(f"'res2':{res2}")
res3 = pat.fullmatch('qaz_123') # 密码含有下划线
print(f"'res3':{res3}")
res4 = pat.fullmatch('n0passw0Rd') # 密码符合要求
print(f"'res4':{res4}")
'res1':None
'res2':None
'res3':None
'res4':<re.Match object; span=(0, 10), match='n0passw0Rd'>

18.爬取百度首页标题

import re 
from urllib import request

data = request.urlopen("http://www.baidu.com/").read().decode()

pat =  r'<title>(.*?)</title>'

result = re.search(pat,data)
print(result)
result.group()
<re.Match object; span=(1389, 1413), match='<title>百度一下,你就知道</title>'>





'<title>百度一下,你就知道</title>'

19.批量转化为驼峰格式

import re

def camel(s):
    s = re.sub(r"(\s|_|-)+"," ",s).title().replace(" ","")
    return s[0].lower()+s[1:]

def batch_camel(slist):
    return [camel(s) for s in slist]

s = batch_camel(['student_id', 'student\tname', 'student-add'])
print(s)
['studentId', 'studentName', 'studentAdd']
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值