scrapy传递 item时的数据不匹配和一些注意事项

始識

已于 2022-04-21 10:47:22 修改

阅读量1k

点赞数 3

文章标签： python 爬虫

于 2021-12-17 10:08:46 首次发布

本文链接：https://blog.csdn.net/Zuko_chen/article/details/121989757

版权

用scrapy框架大多是为了完成一些列表页和详情页的请求这个时候需要发起两个请求一个parse 一个parse_detail，这个时候通常会使用yield 来发起一个请求，并通过 callback 回调函数，可有时候会出现数据对应不上的问题

这个时候需要检查你的代码不要多写yield 尤其是发起两个请求不要多写yield item 不然直接传给item 会导致请求错误

错误！！！！

正确：

并且在传值item时会出现获取到最后一个item的情况，而且是循环调用最后一个，就像是上面yield 这一部分是个for循环，但是下面的parse方法不再循环内，所以就只能一直调用到最后一个item.

所以我们可以使用 copy.deepcopy 深拷贝

因为深拷贝完全拷贝了父对象及其子对象。所以再统一传值给 parse_detail 由parse_detail 传值给item

代码如下

 def parse(self, response, **kwargs):
        # 导入item都西昂
        item = CurrenyItem()
        # 完成列表页提取 获得一个selector 列表对象
        ul_resp =response.xpath('//*[@id="pc"]//div[@class="nyrtct"]/ul/li')
        # 遍历
        for li in ul_resp:
            # 拿到链接地址
            item['title_url'] = li.xpath('./a/@href').extract_first()
            # 拿到链接标题
            item["title_name"] = li.xpath('./a/@title').extract_first()
            # 拿到标题i时间
            item["title_date"] = li.xpath('./span/text()').extract_first()
            # 强转 url 这步可以忽略
            item['title_url'] = str(item['title_url'])
            # 传值
            yield scrapy.Request(
                url=item['title_url'],
                callback=self.parse_detail,
                # 用深拷贝的方式 复制子对象 等 
                meta={'item': copy.deepcopy(item)})
            
        # 完成下一页链接的的提取
        next_url = response.xpath('//*[@id="pc"]//div[@class="show_page"]/a[@class="next"]/@href').extract_first()
        # 如果下页不为空 则一直重复回调
        if next_url is not None:
            yield scrapy.Request(url=next_url,callback=self.parse)

    def parse_detail(self, response):
        # 获得item
        item = response.meta['item']
        # 拿到详情页正文内容
        item['content_html'] = response.xpath('//*[@id="pc"]/div/div/div[@class="main1"]').extract_first()
        # 如果精确匹配匹配不到 就换一个xpath 
        if item['content_html'] is None:
            item['content_html'] = response.xpath('/html/body/div[4]/div[2]').extract_first()
        # 强转字符串 可忽略
        item['content_html'] = str(item['content_html'])
        # 打印是否成功
        print(item['title_name'],">>>>>>>>>ok")
        # 统一传值给item
        yield item