python queue 查询是否在队列中_如何检查任务是否已经在python Queue中?

I'm writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished processing page, it grabs the next one from the queue. I'm using an array for the pages I've already visited to filter the links I add to the queue, but if there are more than one threads and they get the same links on different pages, they put duplicate links to the queue. So how can I find out whether some url is already in the queue to avoid putting it there again?

解决方案

If you don't care about the order in which items are processed, I'd try a subclass of Queue that uses set internally:

class SetQueue(Queue):

def _init(self, maxsize):

self.maxsize = maxsize

self.queue = set()

def _put(self, item):

self.queue.add(item)

def _get(self):

return self.queue.pop()

As Paul McGuire pointed out, this would allow adding a duplicate item after it's been removed from the "to-be-processed" set and not yet added to the "processed" set. To solve this, you can store both sets in the Queue instance, but since you are using the larger set for checking if the item has been processed, you can just as well go back to queue which will order requests properly.

class SetQueue(Queue):

def _init(self, maxsize):

Queue._init(self, maxsize)

self.all_items = set()

def _put(self, item):

if item not in self.all_items:

Queue._put(self, item)

self.all_items.add(item)

The advantage of this, as opposed to using a set separately, is that the Queue's methods are thread-safe, so that you don't need additional locking for checking the other set.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值