用python写网络爬虫 -从零开始 4 用正则表达式编写链接爬虫

最新推荐文章于 2021-01-30 01:09:01 发布

weixin_34081595

最新推荐文章于 2021-01-30 01:09:01 发布

阅读量228

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/mrruning/p/7638523.html

版权

通过之前的学习，我们编写了两个基本的爬虫。但对于一些内容大的网站，我们就需要对其进行跟踪链接，利用正则表达式来确定需要下载的页面。
1.正则表达式 下载链接 ，其中  urlparse 模块用来实现相对路径转换成绝对路径，通过一个

import re
import urlparse




def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex
    """
    crawl_queue = [seed_url] # the queue of URL's to download
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                # add this link to the crawl queue
                crawl_queue.append(link)


def get_links(html):
    """Return a list of links from html
    """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


if __name__ == '__main__':
    link_crawler('http://example.webscraping.com', '/(index|view)')