在爬虫中多线程以及队列的应用笔记

最新推荐文章于 2021-10-03 20:34:14 发布

feiyy404

最新推荐文章于 2021-10-03 20:34:14 发布

阅读量368

点赞数

分类专栏：重学爬虫深入源码

本文链接：https://blog.csdn.net/Enjolras_fuu/article/details/107382458

版权

重学爬虫同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

深入源码

2 篇文章 0 订阅

订阅专栏

为何使用多线程

为了提高抓取数据效率：
有些网站对访问速度有限制, 这样网站可以可以开启多个线程, 每一个线程使用一
个代理, 去提取页面的一部分内容。
在这里插入图片描述

关于守护线程

import threading
import time


def task():
    print("我是需要使用多线程去完成的任务")
    time.sleep(30)
    print("线程任务结束")


def main():
    th1 = threading.Thread(target=task)
    th1.start()
    print("Main over")


main()

按照如上的写法，主线程会一直等到子线程结束后退出。
如果需要主线程结束后，子线程跟随退出，无论子线程是否完成，那么需要通过setDaemon 方法将子线程设置为守护模式：

import threading
import time


def task():
    print("我是需要使用多线程去完成的任务")
    time.sleep(30)
    print("线程任务结束")


def main():
    th1 = threading.Thread(target=task)
    th1.setDaemon(True)
    th1.start()
    print("Main over")


main()

守护线程是为了主线程能及时回收子线程，子线程不重要，主线程结束，子线程结束。

队列的用法

put_nowait

import queue
import traceback

q = queue.Queue(maxsize=100)


def queue_test1():
    for i in range(100):
        q.put(i)

    item = {}
    try:
        q.put_nowait(item)  # 不等待直接放，队列满的时候会报错
    except Exception:
        print(traceback.print_exc())

固定了队列的长度，队列已满的时间继续向队列中添加元素，使用 put_nowait 不等待直接放，队列满的时候会报错：
在这里插入图片描述

put

直接使用 put 的话，队列满了会阻塞等待：

def queue_test2():
    for i in range(100):
        q.put(i)

    item = "ruiyang"
    q.put(item)  # 放入数据，队列满的时候阻塞等待

get 与 get_nowait 与之同理。

使用 q.qsize() 方法可知队列中现存数据的个数。

关于 join 与 task_done

queue 底层是一个其他语言实现的 deque。

import threading
from collections import deque
from time import monotonic as time


class Full(Exception):
    # 队列满的异常
    'Exception raised by Queue.put(block=0)/put_nowait().'
    pass


class Empty(Exception):
    'Exception raised by Queue.get(block=0)/get_nowait().'
    pass


class Queue:
    '''Create a queue object with a given maximum size.

    If maxsize is <= 0, the queue size is infinite.
    '''

    def __init__(self, maxsize=0):
        # 队列的最大个数
        self.maxsize = maxsize
        self._init(maxsize)

        # 线程锁
        self.mutex = threading.Lock()

        # Notify not_empty whenever an item is added to the queue; a thread waiting to get is notified then.
        # 当一个项目被添加到队列中时通知not_empty;然后通知等待获取的线程。
        # 在 threading 模块中, Condition被称为条件变量，除了提供与Lock类似的acquire和release方法外，还提供了wait和notify方法。
        self.not_empty = threading.Condition(self.mutex)

        # Notify not_full whenever an item is removed from the queue; a thread waiting to put is notified then.
        # 当从队列中删除项时，通知not_full;然后通知等待put的线程。
        self.not_full = threading.Condition(self.mutex)

        # Notify all_tasks_done whenever the number of unfinished tasks drops to zero; thread waiting to join() is notified to resume
        # 当未完成任务的数量减少到零时通知all_tasks_done;等待join()的线程被通知继续执行
        self.all_tasks_done = threading.Condition(self.mutex)
        self.unfinished_tasks = 0

    def _init(self, maxsize):
        self.queue = deque()

    def _qsize(self):
        # 获取队列的长度
        return len(self.queue)

    def put_nowait(self, item):
        '''Put an item into the queue without blocking.

        Only enqueue the item if a free slot is immediately available.
        Otherwise raise the Full exception.
        '''
        return self.put(item, block=False)

    # Put a new item in the queue
    def _put(self, item):
        self.queue.append(item)

    def put(self, item, block=True, timeout=None):
        '''Put an item into the queue.

        If optional args 'block' is true and 'timeout' is None (the default),
        block if necessary until a free slot is available. If 'timeout' is
        a non-negative number, it blocks at most 'timeout' seconds and raises
        the Full exception if no free slot was available within that time.
        Otherwise ('block' is false), put an item on the queue if a free slot
        is immediately available, else raise the Full exception ('timeout'
        is ignored in that case).
        '''
        with self.not_full:
            if self.maxsize > 0:

                if not block:
                    if self._qsize() >= self.maxsize:
                        raise Full

                elif timeout is None:
                    while self._qsize() >= self.maxsize:
                        self.not_full.wait()    # 阻塞等待

                elif timeout < 0:
                    raise ValueError("'timeout' must be a non-negative number")
                else:
                    endtime = time() + timeout
                    while self._qsize() >= self.maxsize:
                        remaining = endtime - time()
                        if remaining <= 0.0:
                            raise Full
                        self.not_full.wait(remaining)

            self._put(item)
            # 在放入的时候 增加队列的 unfinished_tasks 属性
            self.unfinished_tasks += 1
            self.not_empty.notify()

    # Get an item from the queue
    def _get(self):
        return self.queue.popleft()

    def get(self, block=True, timeout=None):
        '''Remove and return an item from the queue.

        If optional args 'block' is true and 'timeout' is None (the default),
        block if necessary until an item is available. If 'timeout' is
        a non-negative number, it blocks at most 'timeout' seconds and raises
        the Empty exception if no item was available within that time.
        Otherwise ('block' is false), return an item if one is immediately
        available, else raise the Empty exception ('timeout' is ignored
        in that case).
        '''
        with self.not_empty:
            if not block:
                if not self._qsize():
                    raise Empty
            elif timeout is None:
                while not self._qsize():
                    self.not_empty.wait()
            elif timeout < 0:
                raise ValueError("'timeout' must be a non-negative number")
            else:
                endtime = time() + timeout
                while not self._qsize():
                    remaining = endtime - time()
                    if remaining <= 0.0:
                        raise Empty
                    self.not_empty.wait(remaining)
            item = self._get()
            self.not_full.notify()
            return item

    def get_nowait(self):
        '''Remove and return an item from the queue without blocking.

        Only get an item if one is immediately available. Otherwise
        raise the Empty exception.
        '''
        return self.get(block=False)

    def empty(self):
        '''Return True if the queue is empty, False otherwise (not reliable!).

        This method is likely to be removed at some point.  Use qsize() == 0
        as a direct substitute, but be aware that either approach risks a race
        condition where a queue can grow before the result of empty() or
        qsize() can be used.

        To create code that needs to wait for all queued tasks to be
        completed, the preferred technique is to use the join() method.
        '''
        with self.mutex:
            return not self._qsize()

    def full(self):
        '''Return True if the queue is full, False otherwise (not reliable!).

        This method is likely to be removed at some point.  Use qsize() >= n
        as a direct substitute, but be aware that either approach risks a race
        condition where a queue can shrink before the result of full() or
        qsize() can be used.
        '''
        with self.mutex:
            return 0 < self.maxsize <= self._qsize()

    def qsize(self):
        '''Return the approximate size of the queue (not reliable!).'''
        with self.mutex:
            return self._qsize()

    def task_done(self):
        '''Indicate that a formerly enqueued task is complete.

        Used by Queue consumer threads.  For each get() used to fetch a task,
        a subsequent call to task_done() tells the queue that the processing
        on the task is complete.

        If a join() is currently blocking, it will resume when all items
        have been processed (meaning that a task_done() call was received
        for every item that had been put() into the queue).

        Raises a ValueError if called more times than there were items
        placed in the queue.
        '''
        with self.all_tasks_done:
            unfinished = self.unfinished_tasks - 1
            if unfinished <= 0:
                if unfinished < 0:
                    raise ValueError('task_done() called too many times')
                self.all_tasks_done.notify_all()
            self.unfinished_tasks = unfinished

    def join(self):
        '''Blocks until all items in the Queue have been gotten and processed.

        The count of unfinished tasks goes up whenever an item is added to the
        queue. The count goes down whenever a consumer thread calls task_done()
        to indicate the item was retrieved and all work on it is complete.

        When the count of unfinished tasks drops to zero, join() unblocks.
        '''
        with self.all_tasks_done:
            while self.unfinished_tasks:
                self.all_tasks_done.wait()


def main():
    q = Queue(10)
    print(q)
    for i in range(10):
        q.put(i)
    print(q)

    for j in range(10):
        print(q.get())
        # q.task_done()

    # q.join()


if __name__ == "__main__":
    main()

最简部分如上，可以理解为：

Queue.task_done() 在完成一项工作之后，Queue.task_done()函数向任务已经完成的队列发送一个信号. 

Queue.join() 实际上意味着等到队列为空，再执行别的操作. 

如果线程里每从队列里取一次，但没有执行task_done()，则join无法判断队列到底有没有结束，在最后执行个join()是等不到结果的，会一直挂起。

可以理解为，每task_done一次 就从队列里删掉一个元素，这样在最后join的时候根据队列长度是否为零来判断队列是否结束，从而执行主线程。

将队列用于线程之间的数据通信

from queue import Queue
import threading


def add_to_queue():
    for i in range(0, 100):
        print("存入队列: {}".format(i))
        q.put(i)


def get_from_queue():
    # 但是在我们获取队列元素的时候, 我们并不知道队列中放了几个元素,
    # 这个时候我们就会使用while的死循环来获取,知道取完为止
    # for i in range(0, 100):
    while True:
        print("从队列中取出: {}".format(q.get()))
        q.task_done()


q = Queue()
# 创建线程
t1 = threading.Thread(target=add_to_queue)
# 设置为守护线程
t1.setDaemon(True)

t2 = threading.Thread(target=get_from_queue)
t2.setDaemon(True)

# 启动线程
t2.start()
t1.start()

# 队列加入主线线程, 等待队列中任务完成为止
q.join()

关于多线程的锁的问题

线程不加锁例子，最简单的是银行中我们的存款。

import threading

# 假定这是你的银行存款:
balance = 0


def change_it(n):
    # 先存后取，结果应该为0:
    global balance
    balance = balance + n
    balance = balance - n


def run_thread(n):
    for i in range(10000000):
        change_it(n)


t1 = threading.Thread(target=run_thread, args=(5,))
t2 = threading.Thread(target=run_thread, args=(8,))

t1.start()
t2.start()

t1.join()
t2.join()
print(balance)

执行 chang_it 的时间，先存后取，不管执行多少次，我们的账户始终为 0。

但是当我们开多个线程去做 chang_it 操作的时候，有可能线程 1 存了 5 块，还还没来得及取出，线程 2 拿到了执行权，存入 8 块。

最后运行的结果就可能不为 0。

原因：

比较浅层的讲 ， 因为计算和赋值是两步，所以如果两个线程同时执行到计算还没到赋值那一步，就会少一次计算，如果两个线程都是加的话，最后肯定会少加一部分. 

所以得给两步操作加锁就能保证线程安全. 

更进一步，  balance += n 是一个不断变内存引用的操作。 然后某一次指向操作的时间发现已经有线程在操作，但是 Python 不会等，还是对这一块创建了引用，然后某个操作就会执行在不是最终变量指向的内存上, 落空执行。 （希望大家指正讲解这里 ... )

对多个线程的执行进行加锁：

import threading

balance = 0
lock = threading.Lock()


def change_it(n):
    # 先存后取，结果应该为0:
    global balance
    balance = balance + n
    balance = balance - n


def run_thread(n):
    for i in range(100000):
        # 先要获取锁:
        lock.acquire()
        try:
            # 放心地改吧:
            change_it(n)
        finally:
            # 改完了一定要释放锁:
            lock.release()


t1 = threading.Thread(target=run_thread, args=(5,))
t2 = threading.Thread(target=run_thread, args=(8,))
t1.start()
t2.start()
t1.join()
t2.join()
print(balance)

当多个线程同时执行lock.acquire()时，只有一个线程能成功地获取锁，然后继续执行代码，其他线程就继续等待直到获得锁为止。

获得锁的线程用完后一定要释放锁，否则那些苦苦等待锁的线程将永远等待下去，成为死线程。所以我们用try…finally来确保锁一定会被释放。

锁的好处就是确保了某段关键代码只能由一个线程从头到尾完整地执行，坏处当然也很多，首先是阻止了多线程并发执行，包含锁的某段代码实际上只能以单线程模式执行，效率就大大地下降了。其次，由于可以存在多个锁，不同的线程持有不同的锁，并试图获取对方持有的锁时，可能会造成死锁，导致多个线程全部挂起，既不能执行，也无法结束，只能靠操作系统强制终止。

RLock允许在同一线程中被多次acquire。而Lock却不允许这种情况。注意：如果使用RLock，那么acquire和release必须成对出现，即调用了n次acquire，必须调用n次的release才能真正释放所占用的琐。

出现死锁的情况：

import threading
lock = threading.Lock()
# Lock对象
lock.acquire()
lock.acquire()
# 产生了死琐。
lock.release()
lock.release()

或者：

import threading

m_lock = threading.Lock()


def h():
    with m_lock:
        g()
        print('h')


def g():
    with m_lock:
        print('g')


h()
g()

这时就会用到可重入锁：

import threading
rLock = threading.RLock()
# RLock对象
rLock.acquire()
print("1")
rLock.acquire()
print("2")
# 在同一线程内，程序不会堵塞。
rLock.release()
print("3")
rLock.release()

import threading

m_lock = threading.RLock()


def h():
    with m_lock:
        g()
        print('h')


def g():
    with m_lock:
        print('g')


h()
g()

Lock 与 RLock，他们的区别在于：
（1）Lock是可用的最低级别的同步指令，一个线程只能请求一次，而RLock是可以被一个线程请求多次的同步指令。
（2）当Lock处于锁定状态时，不被特定的线程所拥有，而RLock使用了“拥有的线程”和“递归等级”的概念，因此处于锁定状态时，可以被线程拥有。

最后是关于线程同步的条件变量 Condition.

可以把Condiftion理解为一把高级的琐，它提供了比Lock, RLock更高级的功能，允许我们能够控制复杂的线程同步问题。
threadiong.Condition在内部维护一个琐对象（默认是RLock），可以在创建Condigtion对象的时候把琐对象作为参数传入。
Condition也提供了acquire, release方法，其含义与琐的acquire, release方法一致，其实它只是简单的调用内部琐对象的对应的方法而已。
Condition还提供wait方法、notify方法、notifyAll方法(特别要注意：这些方法只有在占用琐(acquire)之后才能调用，否则将会报RuntimeError异常。)：

acquire()/release()：获得/释放 Lock
wait([timeout]):线程挂起，直到收到一个notify通知或者超时（可选的，浮点数，单位是秒s）才会被唤醒继续运行。
wait()必须在已获得Lock前提下才能调用，否则会触发RuntimeError。调用wait()会释放Lock，直至该线程被Notify()、NotifyAll()或者超时线程又重新获得Lock.
notify(n=1):通知其他线程，那些挂起的线程接到这个通知之后会开始运行，默认是通知一个正等待该condition的线程,最多则唤醒n个等待的线程。
notify()必须在已获得Lock前提下才能调用，否则会触发RuntimeError。notify()不会主动释放Lock。
notifyAll(): 如果wait状态线程比较多，notifyAll的作用就是通知所有线程（这个一般用得少）

现在写个捉迷藏的游戏来具体介绍threading.Condition的基本使用。假设这个游戏由两个人来玩，一个藏(Hider)，一个找(Seeker)。
游戏的规则如下：

游戏开始之后，Seeker先把自己眼睛蒙上，蒙上眼睛后，就通知Hider；
Hider接收通知后开始找地方将自己藏起来，藏好之后，再通知Seeker可以找了；
Seeker接收到通知之后，就开始找Hider。Hider和Seeker都是独立的个体，在程序中用两个独立的线程来表示，在游戏过程中，两者之间的行为有一定的时序关系，我们通过Condition来控制这种时序关系。

import threading
import time


def Seeker(cond, name):
    time.sleep(2)
    cond.acquire()
    print('%s :我已经把眼睛蒙上了！'% name)
    cond.notify()
    cond.wait()

    for i in range(3):
        print('%s is finding!!!'% name)
        time.sleep(2)

    print('%s :我赢了！' % name)
    cond.notify()
    cond.release()


def Hider(cond, name):
    cond.acquire()
    cond.wait()

    for i in range(2):
        print('%s is hiding!!!'% name)
        time.sleep(3)
    print('%s :我已经藏好了，你快来找我吧！'% name)
    cond.notify()
    cond.wait()

    print('%s :被你找到了，唉~^~!' % name)
    cond.release()


if __name__ == '__main__':
    cond = threading.Condition()
    seeker = threading.Thread(target=Seeker, args=(cond, 'seeker'))
    hider = threading.Thread(target=Hider, args=(cond, 'hider'))
    seeker.start()
    hider.start()

feiyy404

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
在爬虫中多线程以及队列的应用笔记

为何使用多线程为了提高抓取数据效率：有些网站对访问速度有限制, 这样网站可以可以开启多个线程, 每一个线程使用一个代理, 去提取页面的一部分内容。关于守护线程import threadingimport timedef task(): print("我是需要使用多线程去完成的任务") time.sleep(30) print("线程任务结束")def main(): th1 = threading.Thread(target=task) t
复制链接

扫一扫

专栏目录