爬虫学习----提升爬虫的速度

最新推荐文章于 2025-04-10 10:07:18 发布

勤奋的小学生

最新推荐文章于 2025-04-10 10:07:18 发布

阅读量3.4k

点赞数 2

分类专栏：网络爬虫文章标签： Python 网络爬虫多线程多进程多协程

本文链接：https://blog.csdn.net/gyt15663668337/article/details/86345690

版权

网络爬虫专栏收录该内容

9 篇文章

订阅专栏

本文深入探讨了网络爬虫的加速技巧，通过对比单线程、多线程、多进程和多协程爬虫，详细解析了每种方法的工作原理及优缺点。实操演示了如何使用Python的线程、进程和协程库来提高爬虫效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、并发和并行，同步和异步的概念

前面我们已经学习了网络爬虫的基本操作，下面，我们将会学习提升爬虫的速度，提升爬虫的速度有三种：多线程爬虫、多进程爬虫、多协程爬虫。在学习具体的操作之前，我们先来了解一下并发和并行，同步和异步的概念。

并发：在同一时间段内发生若干事件的情况，就是说任务一个接着一个执行，一个执行完成后执行下一个任务。
并行：在同一个时刻发生若干事件的情况。就是说同一时刻多个任务一起执行。
同步：是并发或并行的任务不是独自运行的，任务之间有一定的交替顺序，需要在一个任务得到结果后，另一个任务开始执行
异步：是并发或并行的任务可以独立运行，一个任务的运行不受另一个任务影响。无需等待可同时运行。

二、多线程爬虫

多线程爬虫是以并发的方式执行的。多个线程并不能真正的同时执行，而是通过进程的快速切换加快网络爬虫速度的。Python有一个GIL锁，一个线程的执行包括获取GIL，执行代码直到挂起和释放GIL。某个线程如果想执行，必须获取GIL锁，当多个任务到来时，会先竞争GIL锁，之后才会执行。但是对于爬虫来说，网络爬虫是IO密集型，多线程能够有效地提升效率，因为单线程下有IO操作会进行IO等待。会造成不必要的时间浪费，而开启多线程在线程A等待时自动切换到线程B，可以不浪费CPU资源。从而提升程序执行的效率。

Python多线程对于IO密集型代码比较友好，网络爬虫能够在获取网页的过程中使用多线程，从而加快速度。

Python多线程的两种方法：

（1）函数式：调用_thread模块中的start_new_thread()函数产生新线程

import _thread
import time
# 为线程定义一个函数
def print_time(threadName, delay):
    count = 0
    while count < 3:
        time.sleep(delay)
        count += 1
        print(threadName, time.ctime())
# 添加新线程
_thread.start_new_thread(print_time, ("Thread-1", 1))
_thread.start_new_thread(print_time, ("Thread-2", 2))
print("Main Finished")

_thread提供了低级别、原始的线程，相比于threading模型，功能比较有限。

（2）类包装式：调用Threading库创建线程，从threading.Thread继承。

threading模块则提供了Thread类来处理线程，包括以下方法：

run()：用以表示线程活动的方法
start()：启动线程活动
join([time])：等待至线程终止，阻塞调用线程直至线程的join()方法被调用为止。
isAlive()：返回线程是否是活动的
getName()：返回线程名
setName()：设置线程名

import threading
import time

class myThread(threading.Thread):
    def __init__(self, name, delay):
        threading.Thread.__init__(self)
        self.name = name
        self.delay = delay
    def run(self):
        print("Starting " + self.name)
        print_time(self.name, self.delay)
        print("Exiting " + self.name)
def print_time(threadName, delay):
    counter = 0
    while counter < 3:
        time.sleep(delay)
        print(threadName, time.ctime())
        counter += 1
threads = []
# 创建新线程
thread1 = myThread("Thread-1", 1)
thread2 = myThread("Thread-2", 2)

# 开启新线程
thread1.start()
thread2.start()

# 添加线程到线程列表
threads.append(thread1)
threads.append(thread2)

# 等待所有线程完成
for t in threads:
    t.join()
    
print("Exiting Main Thread")

下面使用线程比对不同的效率。

第一个是简单的单线程向200个网站发送请求：

这种执行方法就是200个网站，依次向网站发送请求，一共是花费了207秒。

第二个是将200个网站分为四组，由四个线程去执行。

import threading
import requests
import time

# 链接地址构造
link_list = []
with open('aa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

# 构造线程
start = time.time()
class myThread(threading.Thread):
    def __init__(self, name, link_range):
        threading.Thread.__init__(self)
        self.name = name
        self.link_range = link_range
    def run(self):
        print("Starting " + self.name)
        crawler(self.name, self.link_range)
        print("Exiting " + self.name)

def crawler(threadName, link_range):
    for i in range(link_range[0], link_range[1] + 1):
        try:
            r = requests.get(link_list[i], timeout=20)
            print(threadName, r.status_code, link_list[i])
        except Exception as e:
            print(threadName, 'Error: ', e)

thread_list = []
link_range_list = [(0,50), (51,100), (101,150), (151,200)]

# 创建新线程
for i in range(1, 5):
    thread = myThread("Thread-" + str(i), link_range_list[i-1])
    thread.start()
    thread_list.append(thread)
    
# 等待所有线程完成
for thread in thread_list:
    thread.join()

end = time.time()
print("简单多线程爬虫的总时间为：", end-start)
print("Exiting Main Thread")

这种方法的一个缺陷是当某A线程执行完之后，线程B还没有执行完，这样的话，线程A就被闲置。浪费了资源。

第三种是将200个网站放在一个队列中，然后4个线程分别取队列中去任务。

import threading
import requests
import time
import queue as Queue

# 链接地址构造
link_list = []
with open('aa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()
class myThread(threading.Thread):
    def __init__(self, name, q):
        threading.Thread.__init__(self)
        self.name = name
        self.q = q
    def run(self):
        print("Starting " + self.name)
        while True:
            try:
                crawler(self.name, self.q)
            except:
                break
        print("Exiting ",self.name)

def crawler(threadName, q):
    url = q.get(timeout=2)
    try:
        r = requests.get(url, timeout=20)
        print(q.qsize(), threadName, r.status_code, url)
    except Exception as e:
        print(q.qsize(), threadName, url, 'Error', e)

threadList = ["Thread-1", "Thread-2", "Thread-3", "Thread-4"]
workQueue = Queue.Queue(200)
threads = []

# 创建新线程
for tName in threadList:
    thread = myThread(tName, workQueue)
    thread.start()
    threads.append(thread)
    
# 填充队列
for url in link_list:
    workQueue.put(url)
    
# 等待所有线程完成
for t in threads:
    t.join()
    
end = time.time()
print("Queue多线程爬虫的总时间为：", end-start)
print("Exiting Main Thread")

这样就不会有线程先执行完，出现闲置的情况，从结果也可以看出。第三种方法时最快的。

三、多进程爬虫

Python的多线程只能运行在单核上，以并发的方式异步执行。因此，多线程爬虫不能充分地发挥多核CPU的资源。而多进程则可以利用CPU多核，进程数取决于计算机CPU的处理器个数。由于运行在不同的核上，各个进程的运行时并行的。使用multiprocess库有两种方法：一种是Process+Queue的方法，一种是使用Pool+Queue的方法。

1. 使用multiprocessing的多进程爬虫

当进程数大于CPU的内核数量时，等待运行的进程会等到其他进程运行完让出内核为止。因此，我们要知道自己电脑的CPU核心数量。

from multiprocessing import cpu_count
print(cpu_count())

结果为4，说明本机的CPU核心数为4.接下来，我们开启三个进程，向200个网页发送请求。

from multiprocessing import Process, Queue
import time
import requests

link_list = []
with open('aa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()
class MyProcess(Process):
    def __init__(self, q):
        Process.__init__(self)
        self.q = q
    def run(self):
        print("Starting ", self.pid)
        while not self.q.empty():
            crawler(self.q)
        print("Exiting ", self.pid)

def crawler(q):
    url = q.get(timeout=2)
    try:
        r = requests.get(url, timeout=20)
        print(q.qsize(), r.status_code, url)
    except Exception as e:
        print(q.qsize(), url, 'Eerror', e)
if __name__ == '__main__':
    ProcessName = ["Process-1", "Process-2", "Process-3"]
    workQueue = Queue(1000)

    # 填充对列
    for url in link_list:
        workQueue.put(url)

    for i in range(0, 3):
        p = MyProcess(workQueue)
        p.daemon = True
        p.start()
        p.join()
    end = time.time()
    print('Process + Queue 多线程爬虫的总时间为：', end-start)
    print("Main process Ended!")

2. 使用Pool + Queue的多进程爬虫

第二种方法是使用Pool方法，Pool就是进程池，可以提供指定数量的进程供用户调用。当有新的请求提交到Pool中时，如果池还没有满，就会创建一个新的进程用来执行该请求：但如果池中的进程数已经达到规定的最大值，该请求就会继续等待，知道池中有进程结束才能够创建新的进程。下面了解一下阻塞和非阻塞的概念，关注的是程序在等待调用结果时的状态：

阻塞：等到回调结果出来，在有结果之前，当前进程会被挂起。

非阻塞：添加进程后，不一定非要等到结果就可以添加其他进程运行。

我们使用Pool的非阻塞方法和Queue获取网页数据。

from multiprocessing import Pool, Manager
import time
import requests

link_list = []
with open('aa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()
def crawler(q, index):
    Process_id = 'Process-' + str(index)
    while not q.empty():
        url = q.get(timeout=2)
        try:
            r = requests.get(url, timeout=20)
            print(Process_id, q.qsize(), r.status_code, url)
        except Exception as e:
            print(Process_id, q.qsize(), url, 'Error: ', e)

if __name__ == '__main__':

    manager = Manager()
    workQueue = manager.Queue(1000)
    # 填充队列
    for url in link_list:
        workQueue.put(url)
    pool = Pool(processes=3)
    for i in range(4):
        pool.apply_async(crawler, args=(workQueue, i))
    print("Strated process")
    pool.close()
    pool.join()
    end = time.time()
    print("Pool + Queue 多进程爬虫的总时间为：", end-start)
    print('Main process Ended!')

阻塞方法，只需要修改一行代码。将apply_async改成apply.

pool.apply(crawler, args=(workQueue, i))

可以看到，首先是进程0在运行，等他运行结束后，才会运行进程1.

四、多协程爬虫

协程是一种用户态的轻量级线程

优点：

协程像一种在程序级别模拟系统级别的进程，由于是单线程，少了上下文切换，系统消耗少，速度快。
协程方便切换控制流，简化了编程模型
协程有高扩展行和高并发性，一个CPU支持上万协程都不是问题。

缺点：

协程本质是一个单线程，不能同时使用单个CPU的多核，要配合进程是使用。
长时间阻塞的IO操作时不要用协程，可能会阻塞整个程序

import gevent
from gevent.queue import Queue, Empty
import time
import requests

from gevent import monkey # 把下面有可能有IO操作的单独坐上标记
monkey.patch_all()  # 将IO转为异步执行的函数

link_list = []
with open('aa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()
def crawler(index):
    Process_id = 'Process-' + str(index)
    while not workQueue.empty():
        url = workQueue.get(timeout=2)
        try:
            r = requests.get(url, timeout=20)
            print(Process_id, workQueue.qsize(), r.status_code, url)
        except Exception as e:
            print(Process_id, workQueue.qsize(), url, 'Error: ', e)
def boss():
    for url in link_list:
        workQueue.put_nowait(url)

if __name__ == '__main__':
    workQueue = Queue(1000)
    gevent.spawn(boss).join()
    jobs = []
    for i in range(10):
        jobs.append(gevent.spawn(crawler, i))
    gevent.joinall(jobs)

    end = time.time()
    print('gevent + Queue 多协程爬虫的总时间为：', end-start)
    print('Main Ended!')

from gevent import monkey # 把下面有可能有IO操作的单独坐上标记
monkey.patch_all()  # 将IO转为异步执行的函数

实现爬虫的并发能力，如果没有这两句，就变成依次抓取。

总结：这篇文章学习了多线程，多进程，多协程的网络爬虫，各有优点，是提升网络爬虫速度很好的工具。

上一篇文章：爬虫学习----数据存储

下一篇文章：爬虫学习----反爬虫问题

注意：本篇学习笔记，是总结唐松老师的《Python网络爬虫从入门到实践》这本书的内容，如果想了解书中详细内容，请自行购买