python爬虫之Scrapy框架(三)

最新推荐文章于 2024-08-05 10:37:56 发布

ForsetiRe

最新推荐文章于 2024-08-05 10:37:56 发布

阅读量189

点赞数

文章标签： python 分布式

本文链接：https://blog.csdn.net/ForsetiRe/article/details/107567903

版权

本文介绍了Scrapy框架在Python爬虫中的应用，包括广度爬虫的队列实现，以及如何通过分布式爬虫提高效率。在分布式爬虫部分，详细阐述了单机版和联机版的实现方式，利用Redis进行数据存储和协调，实现爬虫的断电续传和多机器协同工作。

摘要由CSDN通过智能技术生成

Scrapy框架

1.爬虫队列

爬虫分为广度爬虫和深度爬虫。

广度爬虫是使用队列来存放url地址。其会在我们将一个地址传给他时，将地址存入队列，然后取出先放入的url地址，对url地址进行解析，将解析到的url地址再放入队列，这样无限循环下去，直到队列中没有url地址。

我们来看下面的一个爬虫，

from queue import Queue
import requests
import lxml.html


class DownloadItem:
    """
    下载url对象
    """
    def __init__(self, url_str, url_type):
        """
        初始化函数
        :param url_str: 网址
        :param url_type: 网址类型 0 首页 1 详情页
        """
        self.url = url_str
        self.type = url_type


download_queue = Queue()

seed_item = DownloadItem("https://tieba.baidu.com/f?fr=ala0&kw=python&tpl=5", 0)
download_queue.put(seed_item)

while not download_queue.empty():
    # 从队列中取数据
    download_item = download_queue.get()
    if download_item.type == 0:
        # 访问节点
        result = requests.get(download_item.url)
        # 解析
        parser = lxml.html.fromstring(result.text)

        posts = parser.xpath(<