爬虫高级应用（16. 多线程和多进程爬虫）

最新推荐文章于 2024-07-15 08:30:00 发布

川野先生

最新推荐文章于 2024-07-15 08:30:00 发布

阅读量762

点赞数

分类专栏：高级爬虫案例教程文章标签：爬虫 python

本文链接：https://blog.csdn.net/to_upper/article/details/124205268

版权

高级爬虫案例教程专栏收录该内容

16 篇文章 8 订阅

订阅专栏

内容概括

进程和线程的区别
在Python中实现线程
为线程传递参数
线程类
用线程锁和信号量同步线程
生产者和消费者
多线程的实现
通过真实项目演示如何用多线程和多线程实现爬虫应用

干货

16.1 single_thread单线程

使用Python单线程调用两个函数：fun1和fun2，这两个函数中使用sleep函数休眠一定时间，如果用单线程调用这两个函数，那么会顺序执行这两个函数

from time import sleep,ctime
def fun1():
    print('开始巡行fun1：',ctime())
    # 休眠4秒
    sleep(4)
    print('fun1巡行完毕：',ctime())
def fun2():
    print('开始巡行fun2：', ctime())
    # 休眠2秒
    sleep(2)
    print('fun2巡行完毕：', ctime())
def main():
    print('开始运行时间：',ctime())
    # 在单线程中调用fun1和fun2
    fun1()
    fun2()
    print('结束运行时间：',ctime())
if __name__ == '__main__':
    main()

16.2 _thread线程库

使用_thread模块中的start_new_thread函数会直接开启一个线程，该函数的第1个参数需要制定一个函数，
可以把这个函数成为线程函数，当线程启动时会自动调用这个函数。第2个参数是给线程函数传递的参数必须是元组类型

import _thread as thread
from time import sleep, ctime
def fun1():
    print('开始巡行fun1：',ctime())
    # 休眠4秒
    sleep(4)
    print('fun1巡行完毕：',ctime())
def fun2():
    print('开始巡行fun2：', ctime())
    # 休眠2秒
    sleep(2)
    print('fun2巡行完毕：', ctime())
def main():
    print('开始运行时间：',ctime())
    # 启动一个线程运行fun1函数
    thread.start_new_thread(fun1, ())
    # 启动一个线程运行fun2函数
    thread.start_new_thread(fun2, ())
    # 休眠6秒
    sleep(6)
    print('结束时间：',ctime())
if __name__ == '__main__':
    main()

从结果中可以看出，fun1休眠的4秒钟中，fun2没有闲着，趁虚而入，先fun1一步结束了函数的运行

16.3 multi_thread_args带参数的多线程

利用for循环和start_new_thread函数启动8个线程，并未每一个线程函数传递不同的参数值，然后在线程函数中输出传入的参数值

import random
from time import sleep
import _thread as thread
# 线程函数，其中a和b是通过start_new_thread函数传入的参数
def fun(a,b):
    print(a,b)
    # 随机休眠一个的时间（1~4s）
    sleep(random.randint(1,5))
# 启动8个线程
for i in range(8):
    # 为每一个线程函数传入2个参数值
    thread.start_new_thread(fun,(i+1,'a'*(i+1)))
# 通过从终端输出入一个字符串的方式让线程暂停
input()

结论：线程之间会相互抢占资源，趁虚而入

16.4 lock线程锁

锁lock的的用法

 allocate_lock函数用于创建对象，然后使用锁对象的acquire方法获取锁
如果不需要锁了，可以用锁对象的release方法释放锁。
如果要判断锁是否被释放，可以使用锁对象的locked方法

启动2个线程，创建2个锁，再运行线程函数之前，获取2个锁，意味着锁处于锁定状态，在启动时将2个锁对象分别传入2个线程各自的锁对象，
当线程函数执行完，会调用锁对象的release方法释放锁。在main函数最后，使用while循环和locked方法判断这2个锁是否已经释放
只要有一个没有释放，while就不会退出，如果都释放了，则立刻结束

import _thread as thread
from time import sleep, ctime
# 线程函数，index是一个整数类型的索引、sec是休眠时间（单位：秒），lock是锁对象
def fun(index,sec,lock):
    print('开始执行{}执行时间：{}'.format(index,ctime()))
    # 休眠sec秒
    sleep(sec)
    print('执行结束{}执行时间：{}'.format(index, ctime()))
    # 释放锁对象
    lock.release()

def main():
    lock1 = thread.allocate_lock()
    # 获得锁，即上锁
    lock1.acquire()
    # 启动第1个线程，并传入第1个锁对象，10是索引，4是休眠时间，lock1是锁对象
    thread.start_new_thread(fun,(10,4,lock1))

    lock2 = thread.allocate_lock()
    lock2.acquire()
    # 启动第2个线程，并传入第2个锁对象，20是索引，2是休眠时间，lock2是锁对象
    thread.start_new_thread(fun, (20, 2, lock2))
    # 使用while循环和locked方法判断lock1和lock2是否被释放
    # 只要有一个没有释放，while循环就不会退出
    while lock1.locked() or lock2.locked():
        pass

if __name__ == '__main__':
    main()

16.5 threating另一个线程库

threading模块中Thread类的实例是一个执行线程的对象
_thread模块可以看作线程的面向过程版本，而Thread类可以看作线程的面向对象版本
Thread类的构造方法的target关键字执行线程函数，通过args关键字参数指定传给线程函数的参数。然后调用start方法启动线程

import threading
from time import sleep,ctime
# 线程函数，index是一个整数类型的索引、sec是休眠时间（单位：秒）
def fun(index,sec):
    print('开始执行{}执行时间：{}'.format(index,ctime()))
    # 休眠sec秒
    sleep(sec)
    print('执行结束{}执行时间：{}'.format(index, ctime()))

def main():
    # 创建第1个Thread对象，通过target关键字参数指定线程函数fun，传入索引10和休眠时间4s
    thread1 = threading.Thread(target=fun,args=(10,4))
    thread1.start()
    # 创建第2个Thread对象，通过target关键字参数指定线程函数fun，传入索引20和休眠时间2s
    thread2 = threading.Thread(target=fun, args=(20, 2))
    thread2.start()
    # 等待第1个线程函数执行完毕
    thread1.join()
    # 等待第2个函数执行完毕
    thread2.join()

if __name__ == '__main__':
    main()

总结：join方法不需要手动释放锁，threating库更加方便了

16.6 thread_obj对象的线程默认启动方法

target关键字参数不仅可以是一个函数，还可以是一个对象，
类中必须有一个__call__方法，线程启动时会自动调用线程对象__call__方法

import threading
from time import sleep,ctime
# 线程对象对应的类
class Mythread(object):
    # func表示线程函数，args表示线程函数的参数
    def __init__(self,func,args):
        # 将线程函数与线程函数的参数赋给当前的类的成员变量
        self.func = func
        self.args = args
    # 线程启动时会调用该方法
    def __call__(self):
        # 调用线程函数，并将元组类型的参数分解为单个的参数值传入线程函数
        self.func(*self.args)

# 线程函数
def fun(index,sec):
    print('开始执行{}执行时间：{}'.format(index,ctime()))
    # 休眠sec秒
    sleep(sec)
    print('执行结束{}执行时间：{}'.format(index, ctime()))

def main():
    print('执行开始时间：',ctime())
    # 创建第1个Thread对象，通过target关键字参数指定线程函数fun，传入索引10和休眠时间4s
    thread1 = threading.Thread(target = Mythread(fun,(10, 4)))
    thread1.start()
    # 创建第2个Thread对象，通过target关键字参数指定线程函数fun，传入索引20和休眠时间2s
    thread2 = threading.Thread(target = Mythread(fun,(20, 2)))
    thread2.start()
    # 创建第3个Thread对象，通过target关键字参数指定线程函数fun，传入索引20和休眠时间2s
    thread3 = threading.Thread(target = Mythread(fun,(30, 1)))
    thread3.start()
    # 等待执行完毕
    thread1.join()
    thread2.join()
    thread3.join()
    print('所有的线程函数均执行完毕：',ctime())

if __name__ == '__main__':
    main()

16.7 thread_inherit继承线程类

从Thread类继承的子类MyThread，重写父类的构造方法和run方法。
最后通过MyThread类创建并启动两个线程，并使用join方法等待这两个线程结束后再退出程序

import threading
from time import ctime,sleep
# 从Thread类派生的子类
class MyThread(threading.Thread):
    # 重写父类的构造方法，其中func是线程函数，args是传入线程函数的参数，name是线程名
    def __init__(self,func,args,name=''):
        # 调用父类的构造方法，并传入响应的参数值
        super().__init__(target=func,name=name,args=args)
        # 重写父类的run方法
        def run(self):
            self._target(*self.args)

# 线程函数
def fun(index,sec):
    print('开始执行{}执行时间：{}'.format(index,ctime()))
    # 休眠sec秒
    sleep(sec)
    print('执行结束{}执行时间：{}'.format(index, ctime()))

def main():
    print('k开始：',ctime())
    # 创建第1个线程，并指定线程名为“线程1”
    thread1 = MyThread(fun,(10,4),"线程1")
    # 创建第2线程，并指定线程名为“线程2”
    thread2 = MyThread(fun, (20, 2), "线程2")
    thread1.start()
    thread2.start()
    print(thread1.name)
    print(thread2.name)
    thread1.join()
    thread2.join()

    print('结束：',ctime())

if __name__ == '__main__':
    main()

16.8 lock_demo线程锁对象

在线程函数中使用for循环输出线程名和循环变量的值，并通过线程锁将这段代码编程原子操作（原子性：不可中断）
这样就只有当前线程的for循环执行完，其他线程函数的for循环才会重新获得线程锁权限并执行

from atexit import register
import random
from threading import Thread,Lock,currentThread
from time import sleep,ctime
# 创建线程锁对象
lock = Lock()
def fun():
    # 获取线程锁权限
    lock.acquire()
    # for循环已经变成了原子操作:因为获取了锁，所以其他的线程无法抢走for的cpu资源，故for循环可以完整运行结束
    for i in range(5):
        print('Thread Name={} i={}'.format(currentThread().name,i))
        # 休眠一段时间4s
        sleep(random.randint(1,5))
    # 释放线程锁，其他线程函数可以获得这个线程锁的权限
    lock.release()

def main():
    # 通过循环启动了三个线程
    for i in range(3):
        Thread(target=fun).start()

# 当程序结束时会调用这个函数
@register
def exit():
    print('线程执行完毕：',ctime())
if __name__ == '__main__':
    main()

16.9 semaphore信号量（资源）

这里涉及到一些操作系统上的知识点，semaphore资源，本质上是由操作系统给各个进程（线程分配资源），这时候就需要考虑到资源的使用，不能产生死锁的情况，否则会引起故障。
这里不做细说，读者可以看看博主的操作系统专栏，会有很多收获。

使用BoundedSemaphore类的实例，acquire方法和release方法获取资源（-1）和释放资源（+1）

from threading import BoundedSemaphore
Max = 3
# 创建信号量对象，并设置计数器的最大值，计数器不能超过这个值
semaphore = BoundedSemaphore(Max)
print(semaphore._value)
# 申请资源-1
semaphore.acquire()
print(semaphore._value)
semaphore.acquire()
print(semaphore._value)
semaphore.acquire()
print(semaphore._value)
# 当计数器为0时，不能再获取资源，所以acquire方法会返回False
print(semaphore.acquire(False))
print(semaphore._value)

# 释放资源+1
semaphore.release()
print(semaphore._value)
semaphore.release()
print(semaphore._value)
semaphore.release()
print(semaphore._value)
# 抛出异常，但计数器到达最大值，不能再释放资源
semaphore.release()

acquire方法的参数值为False，当计数器为0时不会阻塞，而是直接返回False，表示没有获取资源，如果成功获得资源，会返回True。

16.10 semaphore_lock资源小例子——糖果机

模拟一个糖果机补充糖果和用户获取糖果的过程，糖果机有5个槽
如果发现每个槽都没有糖果了，需要补充新的糖果
当5个槽都满了，就无法补充新的糖果了
如果5个槽都是空的，顾客也就无法购买糖果了。
为了方便，本例假设顾客一次会购买整个槽的糖果，每次补充整个槽的糖果

from atexit import register
from random import randrange
from threading import BoundedSemaphore,Lock,Thread
from time import sleep,ctime
# 创建线程锁
lock = Lock()
# 定义糖果机的槽数，也是信号量计数器的最大值
MAX = 5
# 创建信号量对象，并指定计数器的最大值
candytray = BoundedSemaphore(MAX)
# 给糖果机的槽补充新的糖果（每次只补充一个槽）
def refill():
    # 获取线程锁，将补充糖果的操作变成原子操作
    lock.acquire()
    print('重新添加糖果...',end=' ')
    try:
        # 为糖果机的槽补充糖果 计数器+1
        candytray.release()
    except ValueError:
        print('糖果机都满了，无法添加')
    else:
        print('成功添加糖果')
    # 释放线程锁
    lock.release()

# 顾客购买糖果
def buy():
    # 获取线程锁，将购买糖果的操作变为原子操作
    lock.acquire()
    print('购买糖果...',end=' ')
    # 顾客购买糖果 计数器-1，如果购买失败（5个机器都没有糖果了），返回False
    if candytray.acquire(False):
        print('糖果购买成功')
    else:
        print('糖果机为空，无法购买糖果')
    lock.release()

# 产生多个补充糖果的动作
def producer(loops):
    for i in range(loops):
        refill()
        sleep(randrange(3))
# 产生多个购买糖果的动作
def consumer(loops):
    for i in range(loops):
        buy()
        sleep(randrange(3))

def main():
    print('开始：',ctime())
    # 产生一个2-5的随机数
    nloops = randrange(2,6)
    print('糖果机共有%d个槽！' % MAX)
    # 开始一个线程，用于执行consumer函数
    Thread(target=consumer, args=(randrange(nloops, nloops + MAX + 2),)).start()
    # 开始一个线程，用于执行producer函数
    Thread(target=producer, args=(nloops,)).start()

@register
def exit():
    print('程序执行完毕：',ctime())

if __name__ == '__main__':
    main()

10.11 producer_consumer经典案例：消费者和生产者

使用queue模块来提供线程间通信的机制，生产者和消费者共享一个队列
生产者生产商品，消费者消费商品


from random import randrange
from time import sleep,time,ctime
from threading import Lock,Thread
from queue import Queue
# 创建线程锁对象
lock = Lock()
# 从Thread派生的子类
class MyThread(Thread):
    def __init__(self,func,args):
        super().__init__(target=func,args=args)

# 向队列中添加商品
def wrireQ(queue):
    # 获取线程锁
    lock.acquire()
    print('产生了一个对象，并将其添加到队列中',end=' ')
    # 向队列中添加商品
    queue.put('商品')
    print("队列尺寸",queue.qsize())
    # 释放线程锁
    lock.release()

# 从队列中获取商品
def readQ(queue):
    # 获取线程锁
    lock.acquire()
    # 从队列中获取商品
    val = queue.get(1)
    print('消费了一个对象，队列尺寸：',queue.qsize())
    # 释放线程锁
    lock.release()

# 生成若干生产者
def writer(queue,loops):
    for i in range(loops):
        wrireQ(queue)
        sleep(randrange(1,4))

# 生成若干消费者
def reader(queue,loops):
    for i in range(loops):
        readQ(queue)
        sleep(randrange(1,4))

funcs = [writer,reader]
nfuncs = range(len(funcs))

def main():
    nloops = randrange(2,6)
    q = Queue(32)

    threads = []
    # 创建2个线程运行writer函数和reader函数
    for i in nfuncs:
        t = MyThread(funcs[i],(q,nloops))
        threads.append(t)
    # 开始线程
    for i in nfuncs:
        threads[i].start()

    # 等待2个线程结束
    for i in nfuncs:
        threads[i].join()
    print('所有的工作都结束')

if __name__ == '__main__':
    main()

16.12 multi_process多进程

进程和线程的区别：

进程：资源分配的最小单位
线程：调度的最小单位
而一个进程又可以拥有多个线程，简单来说，就是操作系统不能直接将资源分配给线程，而是分配给进程，由进程创建线程来使用资源。

进程池介绍：

如果建立的进程比较多，可以使用multiprocessing模块的进程池（Pool类），通过Pool类构造方法的processes参数，可以指定创建的进程数
Pool类有一个map方法，用于将回调函数与要给回调函数传递的数据管理起来


from multiprocessing import Pool
import time
# 线程回调函数
def get_value(value):
    i = 0
    while i < 3:
        # 休眠1秒
        time.sleep(1)
        print(value,i)
        i += 1

if __name__ == '__main__':
    # 产生5个值，供多进程获取
    values = ['value{}'.format(str(i)) for i in range(0,5)]
    # 创建4个进程
    pool = Pool(processes=4)
    # 将进程回调函数与values关联
    pool.map(get_value,values)

在这里插入图片描述

程序运行过程中，通过任务管理器查看python进程，会发现多了5个python进程，其中一个是主进程，另外4个是通过Pool创建的子进程

16.13 实操案例：多线程抓取豆瓣音乐Top250排行榜

本例使用4个线程同时抓取不同的页面，进行分析
创建一个存储URL的池，一个列表。
获取这个列表中URL的工作有get_url函数完成，该函数通过线程锁进行了同步
由于在获取URL后，会将这个URL从列表中删除，所以在多线程环境下必须对这个列表进行同步，否则会出现脏数据

import threading
import datetime
import requests
from bs4 import BeautifulSoup
import re
import time
# 记录开始时间
starttime = datetime.datetime.now()
# 创建线程锁
lock = threading.Lock()
# 从URL列表中获取URL，这是一个同步函数
def get_url():
    global urls
    # 获取URl之前，加资源锁
    lock.acquire()
    if len(urls) == 0:
        lock.release()
        return ""
    else:
        url = urls[0]
        # 提取一个URL后，将整个URL从列表中删除
        del urls[0]

    # 完成工作后，释放锁
    lock.release()
    return url

# 请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
}

def get_url_music(url,thread_name):
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    aTags = soup.find_all("a", attrs={"class": "nbg"})
    for aTag in aTags:
        get_music_info(aTag['href'], thread_name)

def get_music_info(url,thread_name):
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    name = soup.find(attrs={'id': 'wrapper'}).h1.span.text
    author = soup.find(attrs={'id': 'info'}).find('a').text
    styles = re.findall('<span class="pl">流派:</span>&nbsp;(.*?)<br />', html.text, re.S)
    if len(styles) == 0:
        style = '未知'
    else:

        style = styles[0].strip()
    time = re.findall('发行时间:</span>&nbsp;(.*?)<br />', html.text, re.S)[0].strip()
    publishers = re.findall('<span class="pl">出版者:</span>&nbsp;(.*?)<br />', html.text, re.S)
    if len(publishers) == 0:
        publisher = '未知'
    else:
        publisher = publishers[0].strip()

    score = soup.find(class_='ll rating_num').text
    info = {
        'name': name,
        'author': author,
        'style': style,
        'time': time,
        'publisher': publisher,
        'score': score
    }
    print(thread_name, info)

# 这是一个线程类
class SpiderThread(threading.Thread):
    def __init__(self,name):
        threading.Thread.__init__(self)
        # name是线程名
        self.name = name
    def run(self):
        while True:
            # 线程一旦运行，就会不断从URL列表中获取URL，知道列表为空
            url = get_url()
            if url != "":
                get_url_music(url,self.name)
            else:
                break

if __name__ == '__main__':
    url_index = 0
    urls = ['https://music.douban.com/top250?start={}'.format(str(i)) for i in range(0, 100, 25)]
    print(len(urls))
    # 创建新线程
    thread1 = SpiderThread('thread1')
    thread2 = SpiderThread('thread2')
    thread3 = SpiderThread('thread3')
    thread4 = SpiderThread('thread4')

    # 开启线程
    thread1.start()
    thread2.start()
    thread3.start()
    thread4.start()
    thread1.join()
    thread2.join()
    thread3.join()
    thread4.join()
    print("退出爬虫")
    endtime = datetime.datetime.now()
    print('需要时间：', (endtime - starttime).seconds, '秒')

16.14 实操案例2：使用多进程Pool抓取豆瓣音乐Top250

import requests
from bs4 import BeautifulSoup
import re
from multiprocessing import Pool

# 请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
}

def get_url_music(url):
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    aTags = soup.find_all("a", attrs={"class": "nbg"})
    for aTag in aTags:
        get_music_info(aTag['href'])

def get_music_info(url):
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    name = soup.find(attrs={'id': 'wrapper'}).h1.span.text
    author = soup.find(attrs={'id': 'info'}).find('a').text
    styles = re.findall('<span class="pl">流派:</span>&nbsp;(.*?)<br />', html.text, re.S)
    if len(styles) == 0:
        style = '未知'
    else:

        style = styles[0].strip()
    time = re.findall('发行时间:</span>&nbsp;(.*?)<br />', html.text, re.S)[0].strip()
    publishers = re.findall('<span class="pl">出版者:</span>&nbsp;(.*?)<br />', html.text, re.S)
    if len(publishers) == 0:
        publisher = '未知'
    else:
        publisher = publishers[0].strip()

    score = soup.find(class_='ll rating_num').text
    info = {
        'name': name,
        'author': author,
        'style': style,
        'time': time,
        'publisher': publisher,
        'score': score
    }
    print(info)

if __name__ == '__main__':
    urls = ['https://music.douban.com/top250?start={}'.format(str(i)) for i in range(0,100,25)]
    print(len(urls))
    pool = Pool(processes=4)
    pool.map(get_url_music,urls)