文章目录
常见匹配模式
1.re.match()
re.match() 尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match()就返回none
常规匹配
import re
content = 'Hello 123 4567 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)
print(result)
print(result.group())
print(result.span())
范匹配
import re
content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('Hello.*Demo',content)
print(result)
print(result.group())
匹配目标
匹配123 4567
import re
content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^Hello\s(\d+\s\d+)\sWorld.*Demo$',content)
print(result)
print(result.group(1))
贪婪匹配
import re
content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$',content)
print(result)
print(result.group(1))
.*
为贪婪模式,即尽可能多地匹配
非贪婪匹配
import re
content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+\s\d+).*Demo$',content)
print(result)
print(result.group(1))
.*?
为非贪婪模式,匹配尽可能少的字符
匹配模式
import re
content = """Hello 123 4567 World_This
is a Regex Demo"""
result = re.match('^He.*?(\d+\s\d+).*?Demo$',content)
print(result)
无法匹配换行符
指定匹配模式:
import re
content = """Hello 123 4567 World_This
is a Regex Demo"""
result = re.match('^He.*?(\d+\s\d+).*?Demo$',content,re.S)
print(result)
print(result.group(1))
转义
import re
content = "the price of shirt is $9.15"
result = re.match('the price of shirt is \$9\.15',content)
print(result)
print(result.group())
尽量使用范匹配,使用括号得到匹配目标;尽量使用非贪婪模式;有换行符就用re.S
2.re.search()
re.search()扫描整个字符串并返回第一个成功的匹配
import re
content = "the price of shirt is $9.15"
result = re.search('price',content)
print(result)
print(result.group())
import re
content = """<meta name="description" content="腾讯网从2003年创立至今,已经成为集新闻信息,区域垂直生活服务、社会化媒体资讯和产品为一体的互联网媒体平台。腾讯网下设新闻、科技、财经、娱乐、体育、汽车、时尚等多个频道,充分满足用户对不同类型资讯的需求。同时专注不同领域内容,打造精品栏目,并顺应技术发展趋势,推出网络直播等创新形式,改变了用户获取资讯的方式和习惯。" />"""
result = re.search('<meta.*?content="(.*?)"\s/>',content)
print(result)
print(result.group(1))
3.re.findall()
搜索字符串,以列表形式返回全部能匹配的子串
import re
content = """<ul class="nav-main fl" bossexpo="bg_dh_1">
<li class="nav-item">
<a href="http://news.qq.com/" target="_blank" bosszone="dh_1">新闻</a>
</li>
<li class="nav-item">
<a href="http://v.qq.com/" target="_blank" bosszone="dh_2">视频</a>
</li>
<li class="nav-item">
<a href="http://new.qq.com/ch/photo/" target="_blank" bosszone="dh_3">图片</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/milite/" target="_blank" bosszone="dh_4">军事</a>
</li>
<li class="nav-item">
<a href="https://sports.qq.com/" target="_blank" bosszone="dh_5">体育</a>
</li>
<li class="nav-item">
<a href="http://sports.qq.com/nba/" target="_blank" bosszone="dh_6">NBA</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/ent/" target="_blank" bosszone="dh_7">娱乐</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/finance" target="_blank" bosszone="dh_8">财经</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/tech/" target="_blank" bosszone="dh_9">科技</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/fashion/" target="_blank" bosszone="dh_10">时尚</a>
</li>
<li class="nav-item">
<a href="http://auto.qq.com/" target="_blank" bosszone="dh_11">汽车</a>
</li>
<li class="nav-item">
<a href="http://house.qq.com/" target="_blank" bosszone="dh_12">房产</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/edu/" target="_blank" bosszone="dh_13">教育</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/cul/" target="_blank" bosszone="dh_14">文化</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/games/" target="_blank" bosszone="dh_15">游戏</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/astro/" target="_blank" bosszone="dh_16">星座</a>
</li>
</ul><!--124ab1f2c59361a8f083289f63e618ba--><!--[if !IE]>|xGv00|c8ad5e7a2a8e8bd6a70240bd0844a132<![endif]-->
<div class="nav-more fl">
<div class="more-txt" bosszone="dh_more">更多</div>
<div class="nav-sub" bossexpo="bg_dh_2">
<ul class="sub-list cf">
<li class="nav-item">
<a href="https://new.qq.com/ch/ori/" target="_blank" bosszone="dh_1_2">独家</a>
</li>
<li class="nav-item">
<a href="https://v.qq.com/tv/" target="_blank" bosszone="dh_2_2">热剧</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/antip/" target="_blank" bosszone="dh_3_2">抗肺炎</a>
</li>
<li class="nav-item">
<a href="http://new.qq.com/ch/history/" target="_blank" bosszone="dh_4_2">历史</a>
</li>
<li class="nav-item">
<a href="http://sports.qq.com/premierleague/" target="_blank" bosszone="dh_5_2">英超</a>
</li>
<li class="nav-item">
<a href="http://sports.qq.com/cba/" target="_blank" bosszone="dh_6_2">CBA</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch2/star" target="_blank" bosszone="dh_7_2">明星</a>
</li>
<li class="nav-item">
<a href="http://money.qq.com/" target="_blank" bosszone="dh_8_2">理财</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/5G/" target="_blank" bosszone="dh_9_2">5G</a>
</li>
<li class="nav-item">
<a href="http://health.qq.com/" target="_blank" bosszone="dh_10_2">健康</a>
</li>
<li class="nav-item">
<a href="http://auto.qq.com/" target="_blank" bosszone="dh_11_2">车型</a>
</li>
<li class="nav-item">
<a href="http://www.jia360.com" target="_blank" bosszone="dh_12_2">家居</a>
</li>
<li class="nav-item">
<a href="http://class.qq.com/" target="_blank" bosszone="dh_13_2">课程</a>
</li>
<li class="nav-item">
<a href="http://dajia.qq.com/" target="_blank" bosszone="dh_14_2">大家</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/comic/" target="_blank" bosszone="dh_15_2">动漫</a>
</li>
<li class="nav-item">
<a href="http://gongyi.qq.com/" target="_blank" bosszone="dh_16_2">公益</a>
</li>
<li class="nav-item">
<a href="http://tianqi.qq.com/index.htm" target="_blank" bosszone="dh_17_2">天气</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/politics/" target="_blank" bosszone="dh_18_2">政务</a>
</li>
<li class="nav-item">
<a href="https://v.qq.com/channel/variety" target="_blank" bosszone="dh_19_2">综艺</a>
</li>
<li class="nav-item">
<a href="http://news.qq.com/photon/photoex.htm" target="_blank" bosszone="dh_20_2">影展</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/world/" target="_blank" bosszone="dh_21_2">国际</a>
</li>
<li class="nav-item">
<a href="http://sports.qq.com/csocce/csl/" target="_blank" bosszone="dh_22_2">中超</a>
</li>
<li class="nav-item">
<a href="http://fans.sports.qq.com/#/" target="_blank" bosszone="dh_23_2">社区</a>
</li>
<li class="nav-item">
<a href="http://v.qq.com/movie/" target="_blank" bosszone="dh_24_2">电影</a>
</li>
<li class="nav-item">
<a href="http://stock.qq.com/" target="_blank" bosszone="dh_25_2">证券</a>
</li>
<li class="nav-item">
<a href="http://digi.tech.qq.com/" target="_blank" bosszone="dh_26_2">数码</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/baby/" target="_blank" bosszone="dh_27_2">育儿</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/visit/" target="_blank" bosszone="dh_28_2">旅游</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/life/" target="_blank" bosszone="dh_29_2">生活</a>
</li>
<li class="nav-item">
<a href="http://kid.qq.com/" target="_blank" bosszone="dh_30_2">儿童</a>
</li>
<li class="nav-item">
<a href="http://book.qq.com/" target="_blank" bosszone="dh_31_2">文学</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/omv/" target="_blank" bosszone="dh_32_2">享看</a>
</li>
<li class="nav-item">
<a href="https://new.qq.com/ch/cul_ru" target="_blank" bosszone="dh_33_2">新国风</a>
</li>
<li class="nav-item">
<a href="http://www.qq.com/map/" target="_blank" bosszone="dh_34_2">全部</a>
</li>
</ul>
"""
result = re.findall('<a\shref.*?>(.*?)</a>',content,re.S)
print(result)
4.re.sub()
替换字符串中每一个匹配的子串后返回替换后的字符串。
去除数字:
import re
content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.sub('\d+','',content)
print(result)
import re
content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.sub('\d+','replace',content)
print(result)
5.re.compile()
将正则表达式编译成正则表达式对象,以便于复用该匹配模式
import re
content = """Hello 123 4567
World_This is a Regex Demo
"""
pattern = re.compile('Hello.*Demo',re.S)
result = re.match(pattern,content)
print(result)
6.小案例
爬取豆瓣读书首页的"新书速递”栏目中的40本书(链接、作者、书名)
import requests
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get('https://book.douban.com/',headers=headers)
content = r.text
print(r.status_code)
#print(content)
pattern = re.compile('<li.*?cover.*?href="(.*?)"\stitle="(.*?)".*?info.*?author">(.*?)</div>.*?more-meta.*?title">(.*?)</h4>.*?</li>',re.S)
result = re.findall(pattern,content)
for item in result:
print(item[0])
print(item[1])
print(item[2].strip())