python避免重复导入模块_Python自定义scrapy中间模块避免重复采集的方法

该博客介绍了Scrapy爬虫中间件`IgnoreVisitedItems`的实现,用于过滤已访问过的页面。中间件通过请求的元信息判断是否已访问,并存储访问ID以确保不重复处理。它有助于提高爬虫效率,防止数据冗余。
摘要由CSDN通过智能技术生成

from scrapy import log

from scrapy.http import Request

from scrapy.item import BaseItem

from scrapy.utils.request import request_fingerprint

from myproject.items import MyItem

class IgnoreVisitedItems(object):

"""Middleware to ignore re-visiting item pages if they

were already visited before.

The requests to be filtered by have a meta['filter_visited']

flag enabled and optionally define an id to use

for identifying them, which defaults the request fingerprint,

although you'd want to use the item id,

if you already have it beforehand to make it more robust.

"""

FILTER_VISITED = 'filter_visited'

VISITED_ID = 'visited_id'

CONTEXT_KEY = 'visited_ids'

def process_spider_output(self, response, result, spider):

context = getattr(spider, 'context', {})

visited_ids = context.setdefault(self.CONTEXT_KEY, {})

ret = []

for x in result:

visited = False

if isinstance(x, Request):

if self.FILTER_VISITED in x.meta:

visit_id = self._visited_id(x)

if visit_id in visited_ids:

log.msg("Ignoring already visited: %s" % x.url,

level=log.INFO, spider=spider)

visited = True

elif isinstance(x, BaseItem):

visit_id = self._visited_id(response.request)

if visit_id:

visited_ids[visit_id] = True

x['visit_id'] = visit_id

x['visit_status'] = 'new'

if visited:

ret.append(MyItem(visit_id=visit_id, visit_status='old'))

else:

ret.append(x)

return ret

def _visited_id(self, request):

return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值