Scrapy-Spiders 项目教程

晏闻田Solitary

于 2024-09-08 08:48:23 发布

阅读量879

点赞数 24

本文链接：https://blog.csdn.net/gitblog_00240/article/details/142015081

版权

Scrapy-Spiders 项目教程

scrapy-spidersCollection of python scripts I have created to crawl various websites, mostly for lead generation projects to match keywords and collect email addresses and post URLs项目地址:https://gitcode.com/gh_mirrors/sc/scrapy-spiders

1. 项目介绍

Scrapy-Spiders 是一个基于 Scrapy 框架的开源项目，旨在帮助开发者快速构建和部署网络爬虫。Scrapy 是一个强大的 Python 爬虫框架，能够高效地从网站中提取数据。Scrapy-Spiders 项目通过提供一系列预定义的爬虫模板和示例，简化了 Scrapy 的使用过程，使得开发者能够更专注于数据提取和处理。

2. 项目快速启动

安装 Scrapy

首先，确保你已经安装了 Python 和 pip。然后，通过以下命令安装 Scrapy：

pip install scrapy

克隆项目

使用 Git 克隆 Scrapy-Spiders 项目到本地：

git clone https://github.com/dcondrey/scrapy-spiders.git

创建爬虫

进入项目目录并创建一个新的 Scrapy 爬虫：

cd scrapy-spiders
scrapy startproject myproject

编写爬虫代码

在 myproject/spiders 目录下创建一个新的爬虫文件 example_spider.py，并编写以下代码：

import scrapy
from myproject.items import MyItem

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/1.html",
        "http://www.example.com/2.html",
        "http://www.example.com/3.html",
    ]

    def parse(self, response):
        for h3 in response.xpath("//h3").getall():
            yield MyItem(title=h3)
        for href in response.xpath("//a/@href").getall():
            yield scrapy.Request(response.urljoin(href), callback=self.parse)

运行爬虫

在项目根目录下运行以下命令启动爬虫：

scrapy crawl example

3. 应用案例和最佳实践

案例1：抓取新闻网站

假设我们需要抓取某个新闻网站的所有新闻标题和链接。我们可以使用 Scrapy-Spiders 项目中的爬虫模板，快速实现这一需求。

class NewsSpider(scrapy.Spider):
    name = "news"
    start_urls = ["http://www.news-site.com"]

    def parse(self, response):
        for article in response.xpath("//article"):
            yield {
                'title': article.xpath(".//h2/text()").get(),
                'link': article.xpath(".//a/@href").get(),
            }
        next_page = response.xpath("//a[@class='next']/@href").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)