python实战打卡---day7-CSDN博客

本文链接：https://blog.csdn.net/liang0502/article/details/125665656

match从字符串开始位置匹配

注意 match, search等的不同： 1) match函数

import re
### match
mystr = 'This'
pat = re.compile('hi')
pat.match(mystr) # None
pat.match(mystr,1) # 从位置1处开始匹配 # <_sre.SRE_Match object; span=(1, 3), match='hi'>

search函数 search是从字符串的任意位置开始匹配

mystr='This'
pat=re.compile('hi')
pat.search(mystr) # <_sre.SRE_Match object; span=(1, 3), match='hi'>

替换匹配的字符串

sub函数实现对匹配⼦串的替换

content="hello 12345, hello 456321"
pat=re.compile(r'\d+') #要替换的部分
m=pat.sub("666",content)
print(m) # hello 666, hello 666

贪心捕获

(.*)表⽰捕获任意多个字符，尽可能多的匹配字符

content='<h>ddedadsad</h><div>graph</div>bb<div>math</div>cc'
pat=re.compile(r"<div>(.*)</div>") #贪婪模式
m=pat.findall(content)
print(m) #匹配结果为： ['graph</div>bb<div>math'

非贪心捕获

仅添加⼀个问号( ? )，得到结果完全不同，这是⾮贪⼼匹配，通过这个例⼦体会贪⼼和⾮贪⼼的匹配的不同。见好就收
```
content='<h>ddedadsad</h><div>graph</div>bb<div>math</div>cc'
pat=re.compile(r"<div>(.*?)</div>")
m=pat.findall(content)
print(m) # ['graph', 'math']
```
常用元字符总结
‘’’
.匹配任意字符
^ 匹配字符串开始位置
$ 匹配字符串中结束的位置
- 前⾯的原⼦重复0次、1次、多次
  ? 前⾯的原⼦重复0次或者1次
- 前⾯的原⼦重复1次或多次
  {n} 前⾯的原⼦出现了 n 次
  {n,} 前⾯的原⼦⾄少出现 n 次
  {n,m} 前⾯的原⼦出现次数介于 n-m 之间
  ( ) 分组,需要输出的部分
  ‘’’
常用通用字符总结

‘’’
\s 匹配空⽩字符
\w 匹配任意字母/数字/下划线
\W 和⼩写 w 相反，匹配任意字母/数字/下划线以外的字符
\d 匹配⼗进制数字
\D 匹配除了⼗进制数以外的值
[0-9] 匹配⼀个0-9之间的数字
[a-z] 匹配⼩写英⽂字母
[A-Z] 匹配⼤写英⽂字母
‘’’

密码安全检查

密码安全要求：1)要求密码为6到20位; 2)密码只包含英⽂字母和数字

pat = re.compile(r'\w{6,20}') # 这是错误的，因为\w通配符匹配的是字母，数字和下划线，题⽬要求不能含有下划线
# 使⽤最稳的⽅法：\da-zA-Z满⾜`密码只包含英⽂字母和数字`
pat = re.compile(r'[\da-zA-Z]{6,20}')

选⽤最保险的 fullmatch⽅法，查看是否整个字符串都匹配：

pat.fullmatch('qaz12') # 返回 None, 长度⼩于6
pat.fullmatch('qaz12wsxedcrfvtgb67890942234343434') # None 长度⼤于22
pat.fullmatch('qaz_231') # None 含有下划线
pat.fullmatch('n0passw0Rd') # <_sre.SRE_Match object; span=(0, 10), match='n0passw0Rd'>

爬取百度首页标题

import re
from urllib import request
#爬⾍爬取百度⾸页内容
data=request.urlopen("http://www.baidu.com/").read().decode()
#分析⽹页,确定正则表达式
pat=r'<title>(.*?)</title>'
result=re.search(pat,data)
print(result)
result.group() # 百度⼀下，你就知道
'''
<_sre.SRE_Match object; span=(940, 964), match='<title>百度一下，你就知道</title>'>
'<title>百度一下，你就知道</title>'
'''

批量转化为驼峰格式(Camel)

数据库字段名批量转化为驼峰格式

分析过程

'''
# ⽤到的正则串讲解
# \s 指匹配： [ \t\n\r\f\v]
# A|B：表⽰匹配A串或B串
# re.sub(pattern, newchar, string):
# substitue代替，⽤newchar字符替代与pattern匹配的字符所有.
# title(): 转化为⼤写，例⼦：
# 'Hello world'.title() # 'Hello World'
# print(re.sub(r"\s|_|", "", "He llo_worl\td"))
s = re.sub(r"(\s|_|-)+", " ",
    'some_database_field_name').title().replace(" ", "")
#结果： SomeDatabaseFieldName
# 可以看到此时的第⼀个字符为⼤写，需要转化为⼩写
s = s[0].lower()+s[1:] # 最终结果
'''

整理以上分析得到如下代码:

import re
def camel(s):
    s = re.sub(r"(\s|_|-)+", " ", s).title().replace(" ", "")
    return s[0].lower() + s[1:]
# 批量转化
def batch_camel(slist):
    return [camel(s) for s in slist]
s = batch_camel(['student_id', 'student\tname', 'student-add'])
print(s) # ['studentId', 'studentName', 'studentAdd']

str1是否为str2的permutation

排序词(permutation)：两个字符串含有相同字符，但字符顺序不同。

from collections import defaultdict
def is_permutation(str1, str2):
    if str1 is None or str2 is None:
        return False
    if len(str1) != len(str2):
        return False
    unq_s1 = defaultdict(int)
    unq_s2 = defaultdict(int)
    for c1 in str1:
        unq_s1[c1] += 1
    for c2 in str2:
        unq_s2[c2] += 1
    return unq_s1 == unq_s2

这个⼩例⼦，使⽤python内置的 defaultdict，默认类型初始化为 int，计数默次数都为0. 这个解法本质是 hash map lookup。

统计出的两个defaultdict：unq_s1，unq_s2，如果相等，就表明str1、 str2互为排序词。

下面测试：

r = is_permutation('nice', 'cine')
print(r) # True
r = is_permutation('', '')
print(r) # True
r = is_permutation('', None)
print(r) # False
r = is_permutation('work', 'woo')
print(r) # False

str1是否由str2旋转而来

stringbook旋转后得到 bookstring,写⼀段代码验证 str1是否为 str2旋转得到。

思路：

转化为判断： str1是否为 str2+str2的⼦串

def is_rotation(s1: str, s2: str) -> bool:
    if s1 is None or s2 is None:
        return False
    if len(s1) != len(s2):
        return False
    def is_substring(s1: str, s2: str) -> bool:
        return s1 in s2
    return is_substring(s1, s2 + s2)
r = is_rotation('stringbook', 'bookstring')
print(r) # True
r = is_rotation('greatman', 'maneatgr')
print(r) # False

正浮点数
从⼀系列字符串中，挑选出所有正浮点数。

该怎么办？

玩玩正则表达式，⽤正则搞它！

关键是，正则表达式该怎么写呢？

有了！

¹\d*.\d*$

^ 表⽰字符串开始

[1-9] 表⽰数字1,2,3,4,5,6,7,8,9

² 连起来表⽰以数字 1-9 作为开头

\d 表⽰⼀位 0-9 的数字
- 表⽰前⼀位字符出现 0 次，1 次或多次
\d* 表⽰数字出现 0 次，1 次或多次

. 表⽰⼩数点

$ 表⽰字符串以前⼀位的字符结束

³\d*.\d*$ 连起来就求出所有⼤于 1.0 的正浮点数。

那 0.0 到 1.0 之间的正浮点数，怎么求，⼲嘛不直接汇总到上⾯的正则表达式中呢？

这样写不⾏吗： ⁴\d*.\d*$

OK!

那我们⽴即测试下呗
```
import re
recom=re.compile(r'^[0-9]\d*\.\d*$')
recom.match('000.2') # <_sre.SRE_Match object; span=(0, 5), match='000.2'>
```
结果显⽰，正则表达式 ⁵\d*.\d*$ 竟然匹配到 000.2，认为它是⼀个正浮点数~~~！！！！

晕！！！！！！

所以知道为啥要先匹配⼤于 1.0 的浮点数了吧！

如果能写出这个正则表达式，再写另⼀部分就不困难了！

0.0 到 1.0 间的浮点数： ^0.\d*[1-9]\d*$

两个式⼦连接起来就是最终的结果：

⁶\d*.\d*|0.\d*[1-9]\d*$

如果还是看不懂，看看下⾯的正则分布剖析图吧