scrapy 快速入门

最新推荐文章于 2024-08-20 09:16:28 发布

乐百川

最新推荐文章于 2024-08-20 09:16:28 发布

阅读量4.9w

点赞数 24

本文链接：https://blog.csdn.net/u011054333/article/details/70165401

版权

本文介绍了Scrapy的安装、快速启动爬虫、设置起始链接、数据提取、编写爬虫、页面跳转及scrapy命令的使用。通过实例展示了如何创建和运行Scrapy爬虫，以及如何处理编码问题。

摘要由CSDN通过智能技术生成

安装Scrapy

Scrapy是一个高级的Python爬虫框架，它不仅包含了爬虫的特性，还可以方便的将爬虫数据保存到csv、json等文件中。

首先我们安装Scrapy。

pip install scrapy

在Windows上安装时可能会出现错误，提示找不到Microsoft Visual C++。这时候我们需要到它提示的网站visual-cpp-build-tools下载VC++ 14编译器，安装完成之后再次运行命令即可成功安装Scrapy。

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

快速开始

第一个爬虫

以下是官方文档的第一个爬虫例子。可以看到，和我们手动使用request库和BeautifulSoup解析网页内容不同，Scrapy专门抽象了一个爬虫父类，我们只需要重写其中的方法，就可以迅速得到一个可以不断爬行的爬虫。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)