正则表达式的进阶用法——预查与分组_正向肯定预查不支持非定长-CSDN博客

本文链接：https://blog.csdn.net/weixin_45901207/article/details/106631955

文章目录

预查
分组

昨天刚发现正则表达式的分组用法，故在此记录。

预查

正向预查：`?=`, `?!`

?=: 检测包含此结尾的内容，但不捕获。

例：w+(?=\.com)，检测以.com结尾的字符串，但返回结果中不包含.com.

*注：需引入re库，后面不再赘述。

print(re.findall(r'\w+(?=\.com)', 'baidu.com google.com csdn.net'))

输入结果：

['baidu', 'google']

?!: 检测不包含此结尾的内容，但不捕获。

例：Windows(?!95|98)，检测不以95和98结尾的"Windows"，返回结果中不包含Windows之后的内容。

print(re.match(r'Windows(?!95|98|NT|2000)', 'Windows95'))
print(re.match(r'Windows(?!95|98|NT|2000)', 'Windows10').group)

输出：

None
Windows

负向预查：`?<=`, `?<!`/`?<!=`

?<=: 检测包含此开头的内容，但不捕获。

例：(?<=www\.)\w+，检测以www.开头的字符串，但返回结果中不包含开头。

print(re.findall(r'(?<=www\.)\w+', 'www.github.com cn.vuejs.org www.baidu.com'))

输出结果：

['github', 'baidu']

?<!和?<!=: 等价，检测不包含此开头的内容，但不捕获。此处不再举例。

练习：网页小爬虫

需求：爬取douban.com中所有用div标签包裹的内容

import requests
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
url = 'https://www.douban.com/'
http = requests.get(url, headers=headers)
reg = r'(?<=<div[^>]*>)\n*\w+\n*(?=</div>)'
result_list = re.findall(reg, http.text)
for result in result_list:
    print(result)

我们自然而然的想到左右用<?<=div...(任意字符)>...<?=/div>，然而结果却是：

re.error: look-behind requires fixed-width pattern

原来负向预查并不支持不定长字符串，我们需要找到更好的办法，不过现在我们可以先尝试不排除div标签，直接打印：

修改表达式为：reg = r'<div[^>]*>[^<]*</div>'

输出结果：

<div id="dale_anonymous_homepage_top_for_crazy_ad"></div>
<div id="dale_anonymous_homepage_right_top"></div>
<div id="dale_homepage_online_activity_promo_1"></div>
<div id="dale_anonymous_homepage_doublemint"></div>
<div class="side"></div>
<div id="dale_anonymous_homepage_movie_bottom" class="extra"></div>
<div class="author">〔日〕多利安助...</div>
<div class="author">〔日〕伊坂幸太...</div>
<div class="author">〔日〕池井户润...</div>
<div class="author">吴沚默</div>
<div class="author"></div>
...
<div class="title">你听过《东京爱情故事》吗？</div>
<div id="dale_anonymous_home_page_middle_2" class="extra"></div>
<div class="market-topic-pic"
            style="background-image:url(https://img3.doubanio.com/img/files/file-1513305186-3.jpg)">
          </div>
<div class="market-spu-pic"
            style="background-image: url(https://img3.doubanio.com/img/files/file-1546855945-0.jpg)">
          </div>
<div class="market-spu-pic"
            style="background-image: url(https://img3.doubanio.com/img/files/file-1545819571-0.jpg)">
          </div>
<div class="market-spu-pic"
            style="background-image: url(https://img9.doubanio.com/img/files/file-1513305186-4.jpg)">
          </div>
<div class="follow">
          3人关注
        </div>
<div class="datetime">
            4月4日 周六 19:30 - 21:30
        </div>
<div class="follow">
          1人关注
        </div>
<div class="datetime">
            12月21日 周六 - 4月12日 周日
        </div>
<div class="follow">
          5人关注
        </div>
<div id="dale_anonymous_home_page_bottom" class="extra"></div>

结果返回但非常杂乱，这就是我们之后需要改进的。

分组

普通分组

使用圆括号()表示分组。

这里介绍一下re.match()函数，这个函数会根据正则表达式从开头向后匹配，返回第一个符合的结果 (开头必须符合) 。它的返回值是一个object，object有group()方法和groups()方法。

group(): 返回正常匹配结果
groups(): 返回分组内容

print(re.match('(\w+)\.\w+\.(\w+)', 'www.baidu.com').group())
print(re.match('(\w+)\.\w+\.(\w+)', 'www.baidu.com').groups())

这里使用圆括号将网页URL的前缀作为一组，后缀作为一组，因此groups()会返回网页前后缀字符串。

输出结果：

www.baidu.com
('www', 'com')

命名分组

命名分组，顾名思义，可以给每个分组命名，返回值将以字典形式表示。

语法：?P<分组名>

例：

result = re.match('(?P<head>\w+)\.\w+\.(?P<tail>\w+)', 'www.baidu.com')
print(result['head'], result['tail'])

用head，tail命名网页前后缀，最终可通过访问字典的方式访问match返回值。

练习完善

之前的爬虫得到结果，却能杂乱，有个分组的知识我们就可以做出改进。

import requests
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
url = 'https://www.douban.com/'
http = requests.get(url, headers=headers)
reg = r'<div[^>]*>([^<]*)</div>'
result_list = re.findall(reg, http.text)
index = 1
for result in result_list:
    if str(result).strip():
        print(str(index) + ". " + str(result).strip())
        index += 1

只做了少许修改：

为reg标签包裹的内容加上括号（findall函数会以列表形式返回分组的所有内容，如果没有分组就返回匹配结果）
对匹配结果使用strip()方法去除多余的空字符
排除全空的结果，并为结果加上索引

输出结果：

1. 吃屎不忘拉屎人的日记
2. 〔日〕多利安助...
3. 〔日〕伊坂幸太...
4. 〔日〕池井户润...
5. 吴沚默
6. 免费
7. 免费
8. 免费
9. 免费
10. 「旅行」我想去精灵旅社度个假
11. 欧美丨在Uptown听Funk修个椰子皮
12. 日本民谣：我的歌，是用时间...
13. 「复古」音乐和情绪都不会过时
14. 给我一段音乐，推开看得见风...
15. 你听过《东京爱情故事》吗？
16. 3月25日 周三 19:30 - 21:30
17. 3人关注
18. 3月18日 周三 19:30 - 21:30
19. 3人关注
20. 4月4日 周六 19:30 - 21:30
21. 1人关注
22. 12月21日 周六 - 4月12日 周日
23. 5人关注

这次返回的结果就非常成功，简单利索。