Python爬虫：关于广度优先和深度优先

最新推荐文章于 2022-07-29 16:06:50 发布

鬼子口音

最新推荐文章于 2022-07-29 16:06:50 发布

阅读量543

点赞数

分类专栏：左手Python右手Go 文章标签： python redis

本文链接：https://blog.csdn.net/weixin_40287356/article/details/103981912

版权

左手Python右手Go 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

广度优先和深度优先

关于广度优先和深度优先，首先,不管是广度还是深度,都需要定义一个爬取的深度 crawl_deepth，深度优先比较容易实现显示递归嘛爬取的层次。

所谓广度优先就是要把当前页的 link 全部爬取完毕再进行下一深度的遍历，这就要给一些队列分类一般分为待爬队列, 已爬队列, 新加队列, pop队列，首先要确保访问每个深度页是的待爬队列已经清空才获取下一页的超链接，思路嘛大概可以用 while 1 来实现。

当然也可以直接写好 pop 方法直接 pop 出来到最后清空队列继续往下走。

这里有一个简单的例子可供参考，虽然没有解析函数思想可行。

定义 Redis 队列

class RedisQueue(object):
    def __init__(self, name, namespace='queue', **redis_kwargs):
        self.__db = redis.Redis(host='127.0.0.1', port=6379, db=0, password=None)
        self.key = f"{name,namespace}"

    # 返回队列大小
    def qsize(self):
        return self.__db.llen(self.key)

    # 判断队列用尽
    def empty(self):
        return self.qsize() == 0

    # rpush进去或者lpush都可以
    def put(self, item):
        self.__db.rpush(self.key, item)

    # get出来
    def get(self, block=True, timeout=None)
        if block:
            item = self.__db.blpop(self.key, timeout=timeout)
        else:
            item = self.__db.lpop(self.key)
        return item

    def get_nowait(self):
        return self.get(False)

广度优先
直接上套餐

class MyCrawler:

    def __init__(self):

        # 初始化当前抓取的深度
        self.current_deepth = 1

        # 使用种子初始化url队列
        self.vistedQueue = RedisQueue("vistedQueue")
        self.unvistedQueue = RedisQueue("unvistedQueue")

    def put_unvistedQueue(self, seeds):
        if isinstance(seeds, str):
            self.unvistedQueue.put(seeds)
        if isinstance(seeds, list):
            for seed in seeds:
                self.unvistedQueue.put(seed)
        print("成功添加到未爬队列")

    # 主函数
    def crawling(self, crawl_deepth):

        # 深度 crawl_deepth
        while self.current_deepth <= crawl_deepth:

            # 确保清空队列之后再继续 先广后深
            while not self.unvistedQueue.empty():
                # 出队列
                visitUrl = self.unvistedQueue.get_nowait().decode()
                print(f"取出url {visitUrl}")

                # 获取超链接
                links = self.getHyperLinks(visitUrl)
                print(f"页面 link 数量 {len(links)}")

                # 将url放入已访问的url中
                self.vistedQueue.put(visitUrl)
                print("当前深度: " + str(self.current_deepth))

                # 未访问的url入列
                for link in links:
                    self.unvistedQueue.put(link)
                # 深度加 1
                self.current_deepth += 1

    # 获取源码中超链接
    def getHyperLinks(self, url):
        links = []
        data = self.getPageSource(url)
        if data:
            soup = BeautifulSoup(data)
            a = soup.findAll("a", {"href": re.compile('^http|^/')})
            for i in a:
                if i["href"].find("http://") != -1:
                    links.append(i["href"])
        return links

    # get html
    def getPageSource(self, url,  headers=None, timeout=15):
        try:
            request = requests.get(url,headers,timeout=timeout)
            if request.status_code in [200,201]:
                request.encoding = request.apparent_encoding
                return request.text
        except ConnectionError:
            return None

执行主函数

if __name__=="__main__":

     # 指定爬取的深度 10
     c = MyCrawler()
     c.put_unvistedQueue(["http://www.baidu.com", "http://www.google.com"])
     c.crawling(10)

我这里用的是 Redis 作为未爬和已爬的队列（其实用内置 set 也一样）
由于我这里是 rpush(右边进) 和 lpop(左边出) 这样就达到了先广后深的爬取目的了

深度优先
比广度优先好写一点就是递归爬取只需要定义深度

import requests
import re
import time

exist_urls = []
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36',
}

def get_link(url):
    try:
        response = requests.get(url=url, headers=headers)
        response.encoding = 'UTF-8'
        html = response.text
        link_lists = re.findall('.*?<a target=_blank href="/item/([^:#=<>]*?)".*?</a>', html)
        return link_lists
    except Exception as e:
        pass
    finally:
        exist_urls.append(url)

# 当爬取深度小于10层时，递归调用主函数，继续爬取第二层的所有链接
def main(start_url, depth=1):
    link_lists = get_link(start_url)
    if link_lists:
        unique_lists = list(set(link_lists) - set(exist_urls))
        for unique_url in unique_lists:
            unique_url = 'https://baike.baidu.com/item/' + unique_url
            output = 'Depth:' + str(depth) + '\t' + start_url + '======>' + unique_url + '\n'
            print(output)

            with open('url.txt', 'a+') as f:
                f.write(unique_url + '\n')
                f.close()
            if depth < 10:
                main(unique_url, depth + 1)

执行主函数

if __name__ == '__main__':
    t1 = time.time()
    start_url = 'https://baike.baidu.com/item/%E7%99%BE%E5%BA%A6%E7%99%BE%E7%A7%91'
    main(start_url)
    t2 = time.time()
    print('总时间', t2 - t1)

以上的例子实现深度优先和广度优先的代码，其实没啥复杂的，也比较好理解。
提供思路，仅供参考。

欢迎转载，但要声明出处，不然我顺着网线过去就是一拳。
个人技术博客：http://www.gzky.live

鬼子口音

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫：关于广度优先和深度优先

广度优先和深度优先关于广度优先和深度优先，首先,不管是广度还是深度,都需要定义一个爬取的深度 crawl_deepth，深度优先比较容易实现显示递归嘛爬取的层次。所谓广度优先就是要把当前页的 link 全部爬取完毕再进行下一深度的遍历，这就要给一些队列分类一般分为待爬队列, 已爬队列, 新加队列, pop队列，首先要确保访问每个深度页是的待爬队列已经清空才获取下一页的超链接，思路嘛 ...
复制链接

扫一扫