【爬虫学习笔记day44】5.2. (scrapy案例二)阳光热线问政平台爬虫

最新推荐文章于 2021-11-14 15:48:23 发布

汪雯琦

最新推荐文章于 2021-11-14 15:48:23 发布

阅读量378

点赞数

分类专栏：【爬虫】文章标签：列表 python java js html

本文链接：https://blog.csdn.net/qq_35456045/article/details/104111349

版权

本文详细记录了使用Scrapy框架爬取阳光热线问政平台的投诉帖子信息，包括帖子编号、URL、标题和内容。通过items.py定义数据结构，spiders/sunwz.py编写Spider，CrawlSpider版本进一步扩展，pipelines.py处理数据，settings.py配置项目，以及在main.py中进行调试执行。

摘要由CSDN通过智能技术生成

5.2. (scrapy案例二)阳光热线问政平台爬虫

在这里插入图片描述

阳光热线问政平台

http://wz.sun0769.com/index.php/question/questionType?type=4

爬取投诉帖子的编号、帖子的url、帖子的标题，和帖子里的内容。

items.py

import scrapy

class DongguanItem(scrapy.Item):
    # 每个帖子的标题
    title = scrapy.Field()
    # 每个帖子的编号
    number = scrapy.Field()
    # 每个帖子的文字内容
    content = scrapy.Field()
    # 每个帖子的url
    url = scrapy.Field()

spiders/sunwz.py

Spider 版本

# -*- coding: utf-8 -*-

import scrapy
from dongguan.items import DongguanItem

class SunSpider(CrawlSpider):
    name = 'sun'
    allowed_domains = ['wz.sun0769.com']
    url = 'http://wz.sun0769.com/index.php/question/questionType?type=4&page='
    offset = 0
    start_urls = [url + str(offset)]

    def parse(self, response):
        # 取出每个页面里帖子链接列表
        links = response.xpath("//div[@class='greyframe']/table//td/a[@class='news14']/@href").extract()
        # 迭代发送每个帖子的请求，调用parse_item方法处理
        for link in links:
            yield scrapy.Request(link, callback = self.parse_item)
        # 设置页码终止条件，并且每次发送新的页面请求调用parse方法处理
        if self.offset <= 71130:
            self.offset += 30
            yield scrapy.</