爬取网页中链接的正则表达式不匹配“.”的问题

最新推荐文章于 2021-04-15 16:53:08 发布

阿智智

最新推荐文章于 2021-04-15 16:53:08 发布

阅读量339

点赞数

分类专栏： Python 文章标签：正则表达式匹配爬虫链接匹配

本文链接：https://blog.csdn.net/RobertChenGuangzhi/article/details/108020768

版权

Python 专栏收录该内容

44 篇文章 0 订阅

订阅专栏

问题描述

要爬虫wiki内容描述页的链接，确保这些链接指向新的内容页，为此利用正则表达式，代码如下：

# webCrawler.py
# date: 2020-08-15

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re 

# Because Wikipedia cann't be open, we use the following website 
# alternatives.
html = urlopen('https://encyclopedia.thefreedictionary.com/Kevin+Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a',href=re.compile('((?!\.).)*\+.*')):
    if 'href' in link.attrs:
        print(link['href'])

可运行的结果为：
在这里插入图片描述
结果中仍然有//起头的超链接。上述代码中(?!\.)即是不想让在链接中出现“.”，可是还是出现了。

解决方案

我是自己摸索的，将正则表达式改为：

re.compile('^((?!\.).)*\+.*$'))

即可得到正确结果，亦即将//起头的结果过滤掉了。

疑问

谁能告诉我为什么？

阿智智

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬取网页中链接的正则表达式不匹配“.”的问题

问题描述要爬虫wiki内容描述页的链接，确保这些链接指向新的内容页，为此利用正则表达式，代码如下：# webCrawler.py# date: 2020-08-15from urllib.request import urlopenfrom bs4 import BeautifulSoupimport re # Because Wikipedia cann't be open, we use the following website # alternatives.html = url
复制链接

扫一扫

专栏目录