利用Scrapy进行数据爬取

最新推荐文章于 2024-01-08 15:55:33 发布

Smily^-^

最新推荐文章于 2024-01-08 15:55:33 发布

阅读量1.4k

点赞数 1

分类专栏：开源社区活跃度分析文章标签： python 爬虫 github

本文链接：https://blog.csdn.net/weixin_51903460/article/details/128199848

版权

开源社区活跃度分析专栏收录该内容

8 篇文章 0 订阅

订阅专栏

爬虫框架Scrapy简介

Scrapy 是用 Python 实现的一个为了爬取网站数据、提取结构性数据而编写的应用框架。

Scrapy 是一个功能非常强大的爬虫框架，它不仅可以用于便捷地构建HTTP请求，还有强
大的选择器用于快速、方便地解析和回应HTTP 响应。Scrapy 的优点可以简单概括为以下5点：

（1）Scrapy 使用TwisSted 异步网络框S架处理S网络通信，加快了爬取数据的速度。
（2）Scrapy S具有强大的统计及日志系统，方便查看返回内容以及统计信息。
（3）Scrapy 可同时采集多个不同网页的数据。
（4）Scrapy 支持Shell，方便独立调试。
（5）Scrapy 运用管道的方式将数据存入数据库，操作灵活，可以保存多种形式的数据。

Scrapy爬取步骤

使用Scrapy进行爬取的步骤如下：

Step1：安装Scrapy

首先，安装Scrapy（此处使用conda安装，也可以使用pip）；

conda install scrapy

Step2：新建项目

scrapy startproject projectname
cd projectname

创建的目录结构如下：

Step3：创建爬虫文件

scrapy genspider scrapytest api.github.com

Step4：编写爬虫代码

以爬取PaddlePaddle/PaddleOCR项目的commmits记录为例，代码如下：

import json
import scrapy


class CommitsSpider(scrapy.Spider):
    name = 'commits'
    allowed_domains = ['api.github.com']
    start_urls = ['https://api.github.com/repos/PaddlePaddle/PaddleOCR/commits?&page=1&per_page=100']
    current_page = 1

    def parse(self, response):
        # 转换为python中的字典
        result = json.loads(response.text)
        # content_list = []
        for data in result:
            Author = data.get('commit').get('author')
            Committer = data.get('commit').get('committer')
            Message = data.get('commit').get('message')
            dict = {
                'Author': Author,
                'Committer': Committer,
                'Message': Message,
            }
            # scrapy.Request()
            # content_list.append(dict)
            yield dict

        # print(content_list)
        # return content_list
        if result:
            self.current_page += 1
            next_url = "https://api.github.com/repos/PaddlePaddle/PaddleOCR/commits?&page=%d&per_page=100" % (
                self.current_page)
            yield scrapy.Request(url=next_url, callback=self.parse)
        pass

Step5：运行爬虫项目

scrapy crawl issues -o issues.json

获取到的json数据如下（可以根据需要自行筛选字段）：

[{"User": "andyjpaddle", "State": "closed", "Number": 8548, "CreatedAt": "2022-12-05T11:51:45Z", "UpdatedAt": "2022-12-05T11:52:11Z", "ClosedAt": "2022-12-05T11:52:11Z", "MergedAt": "2022-12-05T11:52:11Z", "Title": "[doc] fix dead link", "Body": "cp https://github.com/PaddlePaddle/PaddleOCR/pull/8547"},
{"User": "andyjpaddle", "State": "closed", "Number": 8547, "CreatedAt": "2022-12-05T11:42:47Z", "UpdatedAt": "2022-12-05T11:53:44Z", "ClosedAt": "2022-12-05T11:53:44Z", "MergedAt": "2022-12-05T11:53:44Z", "Title": "[doc] Fix dead link", "Body": null},
{"User": "WenmuZhou", "State": "closed", "Number": 8506, "CreatedAt": "2022-12-01T06:41:18Z", "UpdatedAt": "2022-12-01T08:02:48Z", "ClosedAt": "2022-12-01T08:02:47Z", "MergedAt": "2022-12-01T08:02:47Z", "Title": "mv Polygon import into func", "Body": null},
{"User": "andyjpaddle", "State": "closed", "Number": 8500, "CreatedAt": "2022-11-30T11:40:17Z", "UpdatedAt": "2022-12-05T11:29:58Z", "ClosedAt": "2022-11-30T11:41:48Z", "MergedAt": "2022-11-30T11:41:47Z", "Title": "[doc] add en doc", "Body": "cp"},
{"User": "andyjpaddle", "State": "closed", "Number": 8497, "CreatedAt": "2022-11-30T11:20:26Z", "UpdatedAt": "2022-12-05T11:29:56Z", "ClosedAt": "2022-11-30T11:35:50Z", "MergedAt": "2022-11-30T11:35:50Z", "Title": "[doc] add finetune en doc", "Body": "add finetune en doc"},
{"User": "LDOUBLEV", "State": "closed", "Number": 8491, "CreatedAt": "2022-11-30T04:43:22Z", "UpdatedAt": "2022-11-30T04:43:34Z", "ClosedAt": "2022-11-30T04:43:34Z", "MergedAt": "2022-11-30T04:43:34Z", "Title": "[bug] fix config", "Body": null}]