[scrapy] DEBUG: Filtered offsite request to

[scrapy 常见问题整理] DEBUG: Filtered offsite request to

使用scrapy爬取豆瓣TOP250电影信息在进行自动翻页爬取的时候,出现了一个问题,解析自动翻页之后网页爬取时没有获取到数据。

测试代码:

# -*- coding: utf-8 -*-
import scrapy
from douDanMovie.items import DoudanmovieItem
from scrapy import Request

class DoubanSpiderSpider(scrapy.Spider):
    name = "douban_spider"
    allowed_domains = ["www.douban.com"]
    start_urls = (
        'https://movie.douban.com/top250',
    )
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
    }
    def start_requests(self):
        url = 'https://movie.douban.com/top250'
        yield Request(url, headers=self.headers)
    def parse(self, response):
        item = DoudanmovieItem()
        movies = response.xpath('//ol[@class="grid_view"]/li')
        for movie in movies:
            item['ranking'] = movie.xpath('.//div[@class="pic"]/em/text()').extract()[0]
            item['movie_name'] = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract()[0]
            item['score'] = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
            item['score_num'] = movie.xpath('.//div[@class="star"]/span/text()').re(r'(\d+)人评价')[0]
            yield item
        next_url = response.xpath('//span[@class="next"]/a/@href').extract()
        #此处解析的 next_url数据正常
        if next_url:
          next_url = 'https://movie.douban.com/top250' + next_url[0]
          yield Request(url = next_url,headers=self.headers)

错误信息:

2018-11-24 12:06:01 [scrapy] DEBUG: Filtered offsite request to 'movie.douban.com': <GET https://movie.douban.com/top250?start=25&filter=>
2018-11-24 12:06:01 [scrapy] INFO: Closing spider (finished)
2018-11-24 12:06:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 301,

问题分析:
由于在allowed_domains中定义了“www.douban.com”,在进行翻页二次解析的时候域名“https://movie.douban.com/top250?start=25&filter=>
”和allowed_domains中定义的不一致,因此将该域名给过滤掉了

问题解决:

  1. 将allowed_domains = [‘www.douban.com’]更改为allowed_domains = [‘豆瓣.com’] 即更换为对应的一级域名
  2. 在进行二次request的时候,通过将dont_filter设置为True,不样将request给过滤掉。
    如下为Request的定义:
    class scrapy.http.Request(url[, callback, method=‘GET’, headers, body, cookies, meta, encoding=‘utf-8’, priority=0, dont_filter=False, errback])
    dont_filter 参数说明,其默认为False
    dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
  • 4
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值