爬虫学习笔记——多线程

最新推荐文章于 2024-11-13 17:26:18 发布

梁会计，不识数。

最新推荐文章于 2024-11-13 17:26:18 发布

阅读量320

点赞数

分类专栏：爬虫笔记文章标签： python 爬虫多线程

本文链接：https://blog.csdn.net/The_accounting/article/details/98480220

版权

爬虫笔记专栏收录该内容

2 篇文章 1 订阅

订阅专栏

文章目录

多线程爬虫

多线程爬虫

多线程，多进程的概念

线程：os中每个任务就是一个进程
线程：一个进程有多个线程，每个线程做一件事
一个进程有多个线程就是多线程
单核计算机也可以实现多进程和多线程
线程和进程切换由操作系统决定（这是一个缺点）
线程不安全：在多线程中，变量是共享的，多个线程同时操作一个变量会引发变量异常（加变量锁，使用队列）
GIL全局解释器锁：每个进程一把锁，启动线程先加锁，结束线程释放锁
复杂程序分类：CPU密集型和IO密集型
IO操作耗时约长，多线程效率越高（爬虫请求内容，绝对的IO密集，所以使用多线程异步效率更高）
同步：前一操作在执行完才能执行下一操作
异步：所有操作同时启动，主线程挂起，等待所有子线程完成
区别：CPU在当前阻塞状态下可以去做其他事情

threading模块

threading是python中专门用来写多线程的模块。

一个简单的例子：

import time
import threading


def encoding():
    for i in range(4):
        print("正在编码{}".format(i))
        time.sleep(1)

def writing():
    for i in range(4):
        print("正在写字{}".format(i))
        time.sleep(1)

def single_thread():
    start = time.time()
    encoding()
    writing()
    end = time.time()
    print("single_thread耗时", end='')
    print(end - start)

def muliti_thread():
	# 将函数名传递给Thread类的参数target，创建线程
    t1 = threading.Thread(target=encoding) 
    t2 = threading.Thread(target=writing)
    t1.start()
    t2.start()


if __name__ == '__main__':
    single_thread()
    muliti_thread()

使用Thread类创建多线程

threading.enumerate()查看当前线程数量
threading.current_thread()查看当前线程信息
继承自threaing.Thread类
为了更好的封装代码，继承自threaing.Thread类，然后实现run方法，线程就会自动运行run方法中的代码。

import threading
import time

class CodingThreading(threading.Thread):
    def run(self):
        for i in range(4):
            print("正在编码 %s" %threading.current_thread())
            time.sleep(1)

class WritingThreading(threading.Thread):
    def run(self):
        for i in range(4):
            print("正在写字 %s" % threading.current_thread())
            time.sleep(1)

def main():
    t1 = CodingThreading()
    t2 = WritingThreading()
    t1.start()
    t2.start()

    
if __name__ == '__main__':
    main()

# [output]:
正在编码 <CodingThreading(Thread-1, started 10144)>
正在写字 <WritingThreading(Thread-2, started 10676)>
正在编码 <CodingThreading(Thread-1, started 10144)>
正在写字 <WritingThreading(Thread-2, started 10676)>
正在编码 <CodingThreading(Thread-1, started 10144)>
正在写字 <WritingThreading(Thread-2, started 10676)>
正在编码 <CodingThreading(Thread-1, started 10144)>
正在写字 <WritingThreading(Thread-2, started 10676)>

多线程共享全局变量和锁机制

多个线程同时操作一个变量，引发变量异常。

import threading

# 定义一个全局变量
VALUE = 0

def add_value():
    global VALUE
    for i in range(100000):
        VALUE += 1
    print("value: %d" %VALUE)

def main():
    for i in range(2):
        t = threading.Thread(target=add_value)
        t.start()

if __name__ == '__main__':
    main()

# [output]:
value: 143104
value: 177848

全局变量锁只加在多线程会共同修改变量时，多线程共同访问一个变量不会出现任何问题。

import threading

VALUE = 0
# 加一个全局变量锁
glock = threading.Lock()

def add_value():
    global VALUE
    glock.acquire() # 锁上
    for i in range(100000):
        VALUE += 1
    glock.release() # 释放
    print("value: %d" %VALUE)

def main():
    for i in range(2):
        t = threading.Thread(target=add_value)
        t.start()

if __name__ == '__main__':
    main()

# [output]:
value: 100000
value: 200000

生产者和消费者模式

Lock版的生产者和消费者模式

生产者和消费者模式是多线程开发中常用的一种模式。生产者的线程专门用来生产一些数据，然后存放在一个中间变量中。消费者再从变量中取出数据进行消费。因此，这些中间变量经常会是一些全局变量，要使用锁来保证数据完整性。

一个简单的例子：

import threading
import random
import time

gMONEY = 1000
gLOCK = threading.Lock()
gTOTAL = 10
gTIMES = 0

class Producer(threading.Thread):
    def run(self):
        global gMONEY
        global gTIMES
        while gTIMES < gTOTAL:
            money = random.randint(100, 1000)
            gLOCK.acquire()
            gMONEY += money
            print("%s生产了%d元钱, 总计%d元钱" %(threading.current_thread(), money, gMONEY))
            gTIMES += 1
            gLOCK.release()
            time.sleep(1)

class Consumer(threading.Thread):
    def run(self):
        global gMONEY
        while True:
            money = random.randint(100, 1000)
            gLOCK.acquire()
            if gMONEY >= money: # 判断钱是否足够消费
                gMONEY -= money
                print("%s消费了%d元钱, 剩余%d元钱" % (threading.current_thread(), money, gMONEY))
            elif gTIMES >= gTOTAL: # 判断生产者是否听停止生产
                gLOCK.release()
                break
            else:
                print("卡刷爆了！")
            gLOCK.release()
            time.sleep(1)

def main():
    for i in range(3):
        c = Consumer(name="消费者线程%d" %i)
        c.start()

    for i in range(4):
        p = Producer(name="生产者线程%d" %i)
        p.start()


if __name__ == '__main__':
    main()

Lock版的生产者与消费者模式可以正常运行，但是有一个不足，在消费者中，总是使用while死循环和全局变量锁。而上锁是一个十分消耗CPU的行为。

Condition版的生产者和消费者模式

为了弥补上述Lock版的不足，可以使用threading.Condition实现。threading.Condition可以在没有数据时处于阻塞等待状态。有了合适的数据，可以使用notify相关的函数来通知其他处于等待状态的线程。免去了无用的上锁解锁操作，提高性能。
threading.Condition继承自threading.Lock，可以在修改全局数据的时候进行上锁并在修改完毕后解锁，以下简单介绍一些常用的方法：
1. acquire 上锁
2. release 解锁
3. wait 让当前线程处于阻塞等待状态，并且会释放锁。可以被其他线程使用notify或notify_all唤醒。
4. notify 通知某一个出租wait状态的线程，默认是第一个。
5. notify_all 通知所有处于wait状态的线程。notify和notify_all不会释放锁，并且需要在release之前被调用。
Condition版的代码：

import threading
import random
import time

gMONEY = 1000
gCONDITION = threading.Condition()
gTOTAL = 10
gTIMES = 0

class Producer(threading.Thread):
    def run(self):
        global gMONEY
        global gTIMES
        while gTIMES < gTOTAL:
            money = random.randint(100, 1000)
            gCONDITION.acquire()
            gMONEY += money
            print("%s生产了%d元钱, 总计%d元钱" %(threading.current_thread(), money, gMONEY))
            gTIMES += 1
            # 调用notify_all方法，唤醒所有正在阻塞的线程
            gCONDITION.notify_all()
            gCONDITION.release()
            time.sleep(1)

class Consumer(threading.Thread):
    def run(self):
        global gMONEY
        while True:
            money = random.randint(100, 1000)
            gCONDITION.acquire()
            # 如果使用if判断，可能被通知后条件判断仍为False，使用while循环更加安全
            while gMONEY < money:
                if gTIMES >= gTOTAL:
                    gCONDITION.release()
                    # break只能终止这一层的while，return返回整个函数可以停止外层的循环
                    return
                print("卡刷爆了！")
                # 让线程处于阻塞状态
                gCONDITION.wait()
            gMONEY -= money
            print("%s消费了%d元钱, 剩余%d元钱" % (threading.current_thread(), money, gMONEY))
            gCONDITION.release()
            time.sleep(1)

def main():
    for i in range(3):
        c = Consumer(name="消费者线程%d" %i)
        c.start()

    for i in range(4):
        p = Producer(name="生产者线程%d" %i)
        p.start()

if __name__ == '__main__':
    main()

Queue线程安全队列

queue模块是py内置的线程安全模块，它提供了同步的、线程安全队列类，包括FIFO（先进先出）队列Queue，LIFO（后进先出）队列LifoQueue。这些队列都使用了锁原语，可以使用队列实现线程间的同步。

Queue(maxsize) 创建一个Queue(类)队列
qsize() 返回队列大小
empty() 判断队列是否为空
full() 判断队列是否为满
get(block=True) 从队列中取出最后一个数据，参数block设置当前线程是否为阻塞式的，默认为True。
put(block=True) 将一个数据放在队列中

多线程爬虫示例

小小总结一下：

所有线程都是threading.Thread的子类，改写run方法
多线程异步爬虫，将整个爬虫项目分为两部分，生产者和消费者
生产者是获取数据并将数据添加进Queue队列
消费者从队列中获取数据并处理，最后输出我们想要的结果
在本例中，生产者从网站中获取img的url，消费者将图片下载至本地

import requests
from lxml import etree
from threading import Thread
from queue import Queue


class Producer(Thread):
    def __init__(self, page_queue, img_queue, *args, **kwargs):
        super(Producer, self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue
        self.headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'
        }

    def run(self):
        while True:
        	# 判断页面队列是否为空，如果为空，队列中的所有url都已经访问过，break循环，所有生产者线程关闭
            if self.page_queue.empty():
                break
            url = self.page_queue.get()
            self.get_page(url=url)

    def get_page(self, url):
        res = requests.get(url=url, headers=self.headers)
        print("请求斗图啦\t%d" %res.status_code)
        text = res.text
        html = etree.HTML(text)
        div = html.xpath("//div[@class='page-content text-center']//img[starts-with(@class,'img')]/@data-original")
        for img in div:
            self.img_queue.put(img)


class Consumer(Thread):
    def __init__(self, page_queue, img_queue, *args, **kwargs):
        super(Consumer, self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue
        self.headers = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'
        }

    def run(self):
        while True:
        	# 判断页面队列和图片下载连接队列是否为空，如果全为空，则所有下载任务完成，break循环，所有生产者线程关闭
            if self.page_queue.empty() and self.img_queue.empty():
                break
            img = self.img_queue.get()
            self.get_img(img)

    def get_img(self, img):
        res = requests.get(url=img, headers=self.headers)
        print("下载图片\t%d" %res.status_code)
        jpg = res.content
        with open(img[-20:], 'wb') as file:
            file.write(jpg)


def main():
    page_queue = Queue(100) # 一共请求页面100个，不要让线程在等待数字循环浪费时间
    img_queue = Queue(1000) # 队列最大容量自己估摸着办
    for i in range(1, 101):
        url = "https://www.doutula.com/photo/list/?page={}".format(i)
        page_queue.put(url)

    for i in range(5):
        t = Producer(page_queue, img_queue)
        t.start()

    for i in range(5):
        t = Consumer(page_queue, img_queue)
        t.start()


if __name__ == '__main__':
    main()

GIL全局解释锁

Cpython解释器在执行多线程时，多核cpu中只能利用一核。同一时间只有一个线程在执行。
为了同一时间线程唯一，Cpython中有一个GIL(Global Intepreter Lock)全局解释器锁。因为Cpython的内存管理不是线程安全的，所以GIL全局解释器锁是有必要的。当然，不是所有的py解释器都有GIL，这里不一一列出了。
虽然如此，但是在执行IO密集型操作时，python多线程效率依然很高。而对于CPU密集型操作则可以使用多进程来提高效率。