Python多线程爬虫学习及其问题

最新推荐文章于 2023-03-23 08:48:34 发布

村长诚不欺我

最新推荐文章于 2023-03-23 08:48:34 发布

阅读量247

点赞数 1

分类专栏：爬虫问题

本文链接：https://blog.csdn.net/weixin_42240407/article/details/89347902

版权

爬虫问题专栏收录该内容

2 篇文章 0 订阅

订阅专栏

threading模块

在python中threading模块专门提供用来做多线程编程的模块，threading模块当中最常用的类是thread。多线程为了同步完成多项任务，通过提高资源使用效率来提高系统的效率

创建线程：

theading.Thread(target)
“target” is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
"target"是run（）方法调用的可调用对象。默认为None，表示不调用任何内容。
查看当前线程名字：
threading.current_thread()：可以查看当前线程信息
继承自threading.Thread类：
为了让线程更好的封装。可以使用threading模块下的Thread类，继承自这个类，然后实现run方法，线程就会自动运行run方法中的代码。·

class CodingThreading(threading.Thread):

    def run(self):
        for x in range(10):
            print("coding…… %s" % threading.current_thread())
            time.sleep(1)

class DrawingThread(threading.Thread):

    def run(self):
        for x in range(10):
            print("drawing…… %s" % threading.current_thread())

def main():
    t1 = CodingThreading()
    t2 = DrawingThread()

    t1.start()
    t2.start()

if __name__ == '__main__':
    main()

多线程共享全局变量:
多线程都是在一个进程下运行的，所以在进程中的全局变量所有线程都可以共享的。造成的问题在于因为线程执行的顺序是无序的，有可能会造成数据错误。例：

import threading

tickets = 0

def get_ticket():
    global tickets
    for x in range(1000000):
        tickets += 1
    print("ticket:%d" % tickets)

def main():
    for x in range(2):
        t = threading.Thread(target=get_ticket)
        t.start()

if __name__ == '__main__':
    main()

加锁操作：
因为多线程多数据修改有影响，所以采用加锁方式对数据操作进行加锁

import threading

tickets = 0
gLock = threading.Lock()

def get_ticket():
    global tickets
    gLock.acquire()  # 获取锁，用于线程同步
    for x in range(1000000):
        tickets += 1
    gLock.release()  # 释放锁，开启下一个线程
    print("ticket:%d" % tickets)

def main():
    for x in range(2):
        t = threading.Thread(target=get_ticket)
        t.start()

if __name__ == '__main__':
    main()

Lock版生产者与消费者模式：

import threading
import random
import time

globalMoney = 1000
globalLock = threading.Lock()
produceTimes = 10

class Producer(threading.Thread):
    def run(self):
        global globalMoney, produceTimes
        while True:
            money = random.randint(100, 1000)
            globalLock.acquire()
            if produceTimes == 0:
                globalLock.release()
                break
            globalMoney += money
            produceTimes -= 1
            print("%s生产了%d元，剩余%d元" % (threading.current_thread(), money, globalMoney))
            # if produceTimes == 0:
            #     globalLock.release()
            #     break
            globalLock.release()
            time.sleep(1)

class Consumer(threading.Thread):
    def run(self):
        global globalMoney
        while True:
            money = random.randint(100, 1000)
            globalLock.acquire()
            if money <= globalMoney:
                globalMoney -= money
                print("%s消费了%d元，剩余%d元" % (threading.current_thread(), money, globalMoney))
            else:
                if globalMoney < money:
                    globalLock.release()
                    break
                print("%s消费者准备消费%d元，剩余%d元，余额不足" % (threading.current_thread(), money, globalMoney))
            globalLock.release()
            time.sleep(1)

def main():
    for x in range(3):
        y = Consumer(name="消费者%d" % x)
        y.start()

    for x in range(5):
        t = Producer(name="生产者%d" % x)
        t.start()

if __name__ == '__main__':
    main()

condition版生产者与消费者模式：
Lock版生产者与消费者模式可以运行，但存在不足之处，在消费者中，总是通过while True死循环并且上锁的方式去判断钱是否足够。上锁是一个很消耗CPU资源的操作，因此这种方式不是最好的，还有更好的方式来处理。
用threading.Condition来实现。Threading.Condition可以在没有数据的时候处于阻塞等待状态。一旦有适合的数据，还可以使用notify相关的函数来通知处于等待状态的线程。这样就可以不用做无用的上锁和解锁操作。可以提高程序的性能。首先对threading.Condition相关的函数进行介绍，threading.Condition类似threading.Lock，可以在修改全局数据的时候进行上锁，也可以在修改完毕后进行解锁。以下是常用的函数：

acquire：上锁
release：解锁
wait：将当前线程处于等待状态，并且会释放锁，可以被其他线程使用notify和notify_all函数唤醒后会继续等待上锁，上锁后继续执行下面的代码。
notify：通知某个正在等待的线程，默认是第一个等待的线程。
notify_all：通知所有正在等待的线程。notify和notify_all不会释放锁。并且需要在release之前调用。
Condition版的生产者与消费者模式代码：

import threading
import random
import time

globalMoney = 1000
globalCondition = threading.Condition()
globalTimes = 0
globalTotalTimes = 10

class Producer(threading.Thread):
    def run(self):
        global globalMoney, globalTimes
        while True:
            money = random.randint(100, 1000)
            globalCondition.acquire()
            if globalTimes >= globalTotalTimes:
                globalCondition.release()
                break
            globalMoney += money
            print("%s生产了%d元，剩余%d元" % (threading.current_thread(), money, globalMoney))
            globalTimes += 1
            globalCondition.notify_all()
            globalCondition.release()
            time.sleep(0.5)

class Consumer(threading.Thread):
    def run(self):
        global globalMoney
        while True:
            money = random.randint(100, 1000)
            globalCondition.acquire()
            # if money > globalMoney:  # 若使用if条件判断，线程唤醒时不会立即执行后续代码，而是线程进入到后续线程队列等待
            #     globalCondition.wait()  # 此时余额不足，让线程进入等待状态
            while money > globalMoney:  # 使用while可以让等待的线程唤醒后立即执行后续代码
                if globalTimes > globalTotalTimes:
                    globalCondition.release()
                    return
                print("%s准备消费%d元，剩余%d元，余额不足" % (threading.current_thread(), money, globalMoney))
                globalCondition.wait()  # 此时余额不足，让线程进入等待状态
            globalMoney -= money
            print("%s消费了%d元，剩余%d" % (threading.current_thread(), money, globalMoney))
            globalCondition.release()
            time.sleep(0.5)

def main():
    for i in range(3):
        y = Consumer(name="消费者%d" % i)
        y.start()
    for i in range(5):
        t = Producer(name="生产者%d" % i)
        t.start()

if __name__ == '__main__':
    main()

当时学习到这里的时候疑惑了，在线程进入wait状态后，在出现notify_all唤醒线程调用的情况下，如果采用if判断满足线程运行的条件，则阻塞的线程会加入到后续线程队列当中不会立即执行。而采用while循环判断，则会立即执行被唤醒的阻塞线程，问题在于这里采用if和while的区别在哪里？
然后上知乎……emmmmm本来很基础的问题，放到这里就不明白了
在这里插入图片描述

Queue线程安全队列：
在线程中，访问一些全局变量，加锁是一个经常的过程，如果想把一些数据存储到某个队列中，那么python内置了一个线程安全的模块：queue模块。python中的queue模块提供了同步的、线程安全的队列类，包括FIFO（先进先出）队列queue，LIFO（后进先出）队列LifoQueue。这些队列都实现了锁原语（原子操作，即要不不做，要么都做完），能够在多线程中直接使用。可以使用队列来实现线程间的同步，相关函数：

初始化Queue(maxsize)：创建一个先进先出的队列。
qsize()：返回队列的大小。
empty()：判断队列是否为空。
full()：判断队列是否满。
get()：从队列当中取出最后一个数据。
put()：将一个数据放到队列中去。
使用生产者与消费者模式多线程下载表情包：

from lxml import etree
import requests
from urllib import request
import re
from queue import Queue
import threading

PAGE_NUM = 51  # Crawl 50 pages

class Producer(threading.Thread):
    headers = {
        "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
    }

    def __init__(self, page_queue, img_queue, *args, **kwargs):
        super(Producer, self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue

    def run(self):
        while True:
            if self.page_queue.empty():
                break
            url = self.page_queue.get()
            self.parse_html(url=url)

    def parse_html(self, url):
        """
        网页数据解析并获取图片
        :param url: 需要解析的网页url
        :return: None
        """
        response = requests.get(url=url, headers=self.headers)
        text = response.content.decode(encoding="utf-8")
        html = etree.HTML(text=text)
        imgs = html.xpath("//div[@class='page-content text-center']//img[@class!='gif']")
        # # frame_pic = html.xpath("//div[@class='page-content text-center']/div")[0]
        # frame_pic = html.xpath("//*[@id='pic-detail']/div/div[3]/div[2]/ul/li/div")[0]
        # """
        # 这里不知道为什么，尽管我已经匹配到指定的div下，但匹配该指定div下的img标签时还是会匹配其他div标签下的img标签
        # # 刚开始的写法
        # img_tags = frame_pic.xpath("//img[@class!='gif']")
        # 由于会匹配到其他div下的img标签，只好在img上多写几个class属性值限制
        # """
        # img_tags = frame_pic.xpath(
        #     "//img[@class!='gif' and @class!='gif' and @class!='img-responsive' and @class!='footer-logo']"
        # )
        for img_tag in imgs:
            pic_url = img_tag.get("data-original")  # 获取img url
            suffix = re.search("(\.bmp|\.jpg|\.png|\.tif|\.gif|\.jpeg)!", pic_url).group(1)  # 匹配文件后缀名
            pic_name = img_tag.get("alt")  # 获取图片名字
            pic_name = re.sub("[/*\":|<>]", "", pic_name)  # 去除图片名字中在windows下的非法文件名字符
            filename = pic_name + suffix  # 拼接图片名和文件后缀为文件名
            self.img_queue.put((pic_url, filename))

class Consumer(threading.Thread):
    def __init__(self, page_queue, img_queue, *args, **kwargs):
        super(Consumer, self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.img_queue = img_queue

    def run(self):
        while True:
            if self.img_queue.empty() and self.page_queue.empty():
                break
            pic_url, filename = img_queue.get()
            request.urlretrieve(pic_url, "images/%s" % filename)
            print(filename+"下载完成")

if __name__ == '__main__':
    page_queue = Queue(100)
    img_queue = Queue(1000)
    for x in range(1, 51):
        page_url = "https://www.doutula.com/photo/list/?page=%d" % x
        page_queue.put(page_url)

    for x in range(5):
        t = Producer(page_queue=page_queue, img_queue=img_queue, name="生产者线程%s" % x)
        t.start()
    for x in range(5):
        y = Consumer(page_queue=page_queue, img_queue=img_queue, name="消费者线程%s" % x)
        y.start()