Python 通过 Scrapy 爬取 CSDN 文章信息

最新推荐文章于 2024-04-04 01:27:20 发布

_虹猫少侠

最新推荐文章于 2024-04-04 01:27:20 发布

阅读量579

点赞数

分类专栏： Python实践文章标签： python scrapy

本文链接：https://blog.csdn.net/qq_28537277/article/details/87297464

版权

本文详细介绍了如何使用Python的Scrapy框架爬取CSDN上的单页面文章信息。从新建项目、分析页面、编写爬虫到执行程序，每个步骤都有清晰的说明，旨在帮助读者通过实践学习Scrapy爬虫开发。文章提醒读者在实际操作中注意浏览器看到的元素与爬虫获取的元素可能存在差异，并提供了源代码下载链接。

摘要由CSDN通过智能技术生成

本文主要利用 Scrapy 框架实现一个网路爬虫，爬取 CSDN 单页面文章的一些信息。写爬虫不是目的，通过实践来学习才是。

提示：Scrapy 安装请参考 Scrapy在Windows平台的安装

新建项目

创建项目
scrapy startproject blog
切换项目目录
cd blog
创建爬虫文件
scrapy genspider csdn blog.csdn.net

分析页面

我们要爬取的页面为用户的文章列表页面。
例如：https://blog.csdn.net/qq_28537277

我们要爬取的关键信息如下图标记。
在这里插入图片描述
浏览器进入开发者模式。Ctrl+F。通过 xpath 定位每个元素位置。

爬虫

在 items.py 中定义字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class BlogItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    article_type = scrapy.Field()
    article_title = scrapy.Field()
    create_date = scrapy.Field()
    read_num = scrapy.Field()
    comment_num = scrapy.Field