scrapy 出现404处理

第一种解决策略:

from scrapy.http import Request
from scrapy.spider import BaseSpider


class MySpider(BaseSpider):
    handle_httpstatus_list = [404, 500]
    name = "my_crawler"

    start_urls = ["http://github.com/illegal_username"]

    def parse(self, response):
        if response.status in self.handle_httpstatus_list:
            return Request(url="https://github.com/kennethreitz/", callback=self.after_404)

    def after_404(self, response):

print response.url

转载至 stackoverflow

http://stackoverflow.com/questions/16909106/scrapyin-a-request-fails-eg-404-500-how-to-ask-for-another-alternative-reque

第二种解决策略:

from scrapy.spider import BaseSpider
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

class MySpider(BaseSpider):
    handle_httpstatus_list = [404] 
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = [
        'http://www.example.com/thisurlexists.html',
        'http://www.example.com/thisurldoesnotexist.html',
        'http://www.example.com/neitherdoesthisone.html'
    ]

    def __init__(self, category=None):
        self.failed_urls = []

    def parse(self, response):
        if response.status == 404:
            self.crawler.stats.inc_value('failed_url_count')
            self.failed_urls.append(response.url)

    def handle_spider_closed(spider, reason):
        self.crawler.stats.set_value('failed_urls', ','.join(spider.failed_urls))

    def process_exception(self, response, exception, spider):
        ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)
        self.crawler.stats.inc_value('downloader/exception_count', spider=spider)
        self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

    dispatcher.connect(handle_spider_closed, signals.spider_closed)

地址为http://stackoverflow.com/questions/13724730/how-to-get-the-scrapy-failure-urls

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
在使用Scrapy进行爬取时,如果遇到了404错误,可以使用以下策略进行处理。首先,在Scrapy的Spider类中定义一个`handle_httpstatus_list`列表,其中包含了你希望处理的HTTP状态码,比如404和500。然后,在`parse`方法中,检查响应的状态码是否在`handle_httpstatus_list`中,如果是,返回一个新的请求对象,以便向其他URL发送请求。你可以在新的请求对象中设置回调函数,用来处理返回的响应。在这个回调函数里,你可以编写相应的逻辑来处理404错误,比如发送另一个请求或者执行其他操作。综上所述,以上是处理Scrapy爬取过程中遇到404错误的一种解决策略。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [scrapy 出现404处理](https://blog.csdn.net/u012122743/article/details/52596129)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 50%"] - *3* [scrapy 爬取酷狗T500音乐](https://download.csdn.net/download/huangwencai123/11142791)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT0_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值