scrapy携带参数post爬取

最新推荐文章于 2023-04-18 10:27:59 发布

风启新尘

最新推荐文章于 2023-04-18 10:27:59 发布

阅读量529

点赞数

分类专栏：爬虫 python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_49265805/article/details/120080050

版权

python 同时被 2 个专栏收录

52 篇文章 0 订阅

订阅专栏

爬虫

51 篇文章 0 订阅

订阅专栏

个人笔记

# -*- coding: utf-8 -*-
import scrapy
import re
import time
from datetime import datetime, date, timedelta
from scrapy.http import Request
from fzggw.utils import *
from fzggw.items import FgwNewsItem
from fzggw.save_images import save_img
from fzggw.constants import *
import snowflake.client
from fzggw.replace_emoji import remove_emoji
import json



class GjcxGgwSpider(scrapy.Spider):
    name = 'gjcx_ggw'
    start_urls = ['http://sc.ndrc.gov.cn//policy/advancedQuery?']

    def get_form_data(self, page):
          return  {
                    'pageNum':f"{page}",
                    'pageSize':'10',
                    'timeStageId':'',
                    'areaFlagId':'',
                    'areaId':'',
                    'zoneId':'',
                    'businessPeopleId':'',
                    'startDate':'',
                    'endDate':'',
                    'industryId':'',
                    'unitName':'',
                    'issuedno':'',
                }

    def start_requests(self):
        form_data = self.get_form_data(1)
        yield scrapy.FormRequest(method='post', formdata=form_data,
                                 url=self.start_urls[0],dont_filter=True)

    def parse(self, response):
        yesterday = (date.today() + timedelta(days=-1)).strftime("%Y-%m-%d")
        today = date.today().strftime("%Y-%m-%d")
        back_html = response.text
        item_json = json.loads(back_html.strip())
        item_list = item_json['data']['list']
        for info in item_list:
            item = FgwNewsItem()
            item["title"] = ''.join(info['title'])
            item["url"] = (
                    "http://sc.ndrc.gov.cn//zhengcekuDetail.html?id=" + info["id"]
                )
            item['data'] = info['publishtime']

            print(item)

        depth = response.meta.get('depth', 0)
        print(depth)
        page = depth + 2
        print(page)
        yield scrapy.FormRequest(method='post', formdata=self.get_form_data(page),
                                 url=self.start_urls[0],dont_filter=True)

在这里插入图片描述
scrapy有可以直接post的scrapy.FormRequest(method=‘post’，不需要指定。当然指定也没错。
dont_filter=True 这个好像是过滤什么的，具体我也忘了。建议带上，不然会出bug。

此为翻页代码

运行

# import time
import time
import os

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())


if __name__ == '__main__':
    while True:
        os.system('py -m scrapy crawl hunan')
        os.system('py -m scrapy crawl zhibang')
        time.sleep(7200)

风启新尘

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
scrapy携带参数post爬取

个人笔记class GjcxGgwSpider(scrapy.Spider): name = 'gjcx_ggw' start_urls = ['http://sc.ndrc.gov.cn//policy/advancedQuery?'] def get_form_data(self, page): return { 'pageNum':f"{page}", 'pageSize'
复制链接

扫一扫