Python-提升爬虫速度三种方式

最新推荐文章于 2024-07-04 09:45:54 发布

MOVEBXEAX

最新推荐文章于 2024-07-04 09:45:54 发布

阅读量7.7k

点赞数 8

分类专栏： Python爬虫文章标签：爬虫 python 速度

本文链接：https://blog.csdn.net/qq_37246800/article/details/100748770

版权

Python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、提升爬虫速度

提示爬虫速度主要有3种方法：

多线程爬虫
多进程爬虫
多协程爬虫

二、并发与并行、同步和异步

并发是指在一个时间段内发生若干事件的情况。
并行是指在同一个时刻发生若干事件的情况。

当使用单核CPU时，多个工作任务就是以并发的方式运行的，因为只有一个CPU时，各个任务会分别占用CPU的一段时间依次执行。这种情况下，各个任务的时间段很短、经常切换，所以我们会感觉是“同时”进行，其实不是的。在使用多核CPU时，各个核的任务能够同时进行，这才是真正的同时运行，就是并行。

类似于要完成吃完一碗米饭和一碗小炒肉的任务。“并发”是一个人吃，这个人吃一口菜然后吃一口饭，由于切换的很快，会感觉是“同时”吃菜和饭；“并行”就是两个人同时吃，一个人吃饭，另一个人吃菜。

同步就是并发或者并行的各个任务不是独立运行的，任务之间有一定的交替顺序，可能在运行完一个任务得到结果后，另一个任务才会开始运行。就好比接力赛跑，要拿到交接棒之后下个选手才可以开始跑。
异步是并发或者并行的各个任务可以独立运行，一个任务的运行不受另一个任务的影响，任务直接就像跑步比赛的各个选手在不同的赛道比赛一样，跑步的速度不受其他赛道选手的影响。

三、多线程爬虫

多线程爬虫是以并发的方式执行的。也就是并不能真正的同时执行，而是通过进程的快速切换加快网络爬虫的速度。
Python中的GIL（Global Interpreter Lock，全局解释器锁），一个线程的执行过程包括获取GIL、执行代码直到挂起和释放GIL。并且Python进程中，只有一个GIL，拿不到GIL的线程就能允许进入CPU执行。

每次释放GIL时，线程之间会进行锁竞争，而切换线程会消耗资源。由于GIL的存在，Python中一个进程永远只能同时执行一个线程（拿到GIL的线程），这就是多核CPU上Python的多线程效率不高的原因。

Python的多线程对于IO密集型代码比较友好，网络爬虫能够在获取网页的过程中使用多线程，从而加快速度。

例子，多线程的方式抓取1000个网页，并开启5个线程：

import threading
import requests
import time
import queue as Queue


link_list = []
with open('alexa.txt', 'r',) as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', "")
        link_list.append(link)

start = time.time()


class myThread(threading.Thread):
    def __init__(self, name, q):
        super(myThread, self).__init__()
        self.name = name
        self.q = q

    def run(self):
        print('Starting ' + self.name)
        while True:
            try:
                crawler(self.name, self.q)
            except:
                break
        print('Exiting ' + self.name)


def crawler(threadName, q):
    url = q.get(timeout=2)
    try:
        r = requests.get(url, timeout=20)
        print(threadName, r.status_code)
    except Exception as e:
        print(threadName, 'Error', e)


threadList = ['Thread-1', 'Thread-2', 'Thread-3', 'Thread-4', 'Thread-5']
workQueue = Queue.Queue(1000)
threads = []


for tName in threadList:
    thread = myThread(tName, workQueue)
    thread.start()
    threads.append(thread)

for url in link_list:
    workQueue.put(url)

for t in threads:
    t.join()

end = time.time()
print('简单多线程爬虫的总时间为：', end-start)
print('Exiting Main Thread')

使用队列的方法，可以加快线程的利用率。

四、多进程爬虫

多进程爬虫可以利用CPU的多核，进程数取决于计算机CPU的处理器个数。由于运行在不同的核上，各个进程的运行是并行的。在Python中，如果我们要用多进程，需要用multiprocessing这个库。

使用multiprocessing的两种方法：

Process+Queue
Pool+Queue

当进程数量大于CPU的内核数量时，等待运行的进程会等到其他进程运行完毕让出内核为止。所以，单核CPU是无法进行多进程并行的。

Process+Queue

例子1：使用3个进程，抓取1000个网页：

from multiprocessing import Process,Queue
import time
import requests


link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()


class MyProcess(Process):
    def __init__(self, q):
        super(MyProcess, self).__init__()
        self.q = q

    def run(self):
        print('Starting ', self.pid)
        while not self.q.empty():
            crawler(self.q)
        print('Exiting ', self.pid)


def crawler(q):
    url = q.get(timeout=2)
    try:
        r = requests.get(url, timeout=2)
        print(q.qsize(), r.status_code, url)
    except Exception as e:
        print(q.qsize(), url, 'Error: ', e)


if __name__ == '__main__':
    ProcessNames = ['Process-1', 'Process-2', 'Process-3']
    workQueue = Queue(1000)

    for url in link_list:
        workQueue.put(url)

    for i in range(0, 3):
        p = MyProcess(workQueue)
        p.daemon = True
        p.start()

    p.join()


    end = time.time()
    print('Process + Queue :', end-start)
    print('Main process Ended!')

上述代码中，p.daemon = True，每个进程都可以单独设置它的属性，设置为True时，当父进程结束后，子进程就会自动被终止。

Pool+Queue

当被操作对象数目不大时，可以直接利用上述方法进行动态生成多个进程，但是如果进程数量很多，手动设置进程数量太麻烦，使用pool进程池可以提高效率。

pool可以提供指定数量的进程供用户调用。
阻塞和非堵塞关注的是程序在等待调用结果时返回的状态。堵塞要等到回调结果出来，在有结果之前，当前进程会被挂起。非堵塞为添加进程后，不一定非要等到结果出来就可以添加其他进程运行。
例子2：使用pool+process的方式，抓取1000个网页：

from multiprocessing import Pool, Manager
import time
import requests


link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()


def crawler(q, index):
    Process_id = 'Process-' + str(index)
    while not q.empty():
        url = q.get(timeout=2)
        try:
            r = requests.get(url, timeout=20)
            print(Process_id, q.qsize(), r.status_code, url)
        except Exception as e:
            print(Process_id, q.qsize(), url, 'Errpr', e)


if __name__ == '__main__':
    manager = Manager()
    workQueue = manager.Queue(1000)

    for url in link_list:
        workQueue.put(url)

    pool = Pool(processes=3)
    for i in range(4):
        pool.apply_async(crawler, args=(workQueue, i))

    print('Started process')
    pool.close()
    pool.join()

    end = time.time()
    print('Pool + Queue :', end-start)
    print('Main process Ended!')

Queue的使用方式就需要改变，这里用到multiprocessing中的Manager，使用manager=Manager（）和workQueue=manager.Queue（1000）来创建队列。这个队列对象可以在父进程与子进程间通信。
使用pool.apply_async（target=func,args=（args））实现。

五、多协程爬虫

协程是一种用户态的轻量级线程，使用协程有许多好处：

协程像一种在程序级别模拟系统级别的进程，由于是单线程并且少了上下文切换，因此相对来说系统消耗很少。
协程方便切换控制流，简化了编程模型。协程能保留上一次调用时的状态，每次进程重入时，就相当于进入了上一次调用的状态。
协程的高扩展性和高并发性，一个CPU支持上万协程都不是问题，所以很适合用于高并发处理。

协程也有缺点：

协程的本质是一个单线程，不能同时使用单个CPU的多核，需要和进程配合才能运行在多核CPU上。
长时间的阻塞的IO操作时不要用协程，因为可能会阻塞整个程序。

Python协程需要使用gevent库。
例子：使用多协程抓取1000个网页：

import gevent
from gevent.queue import Queue, Empty
import time
import requests

from gevent import monkey


monkey.patch_all()

link_list = []
with open('alexa.txt', 'r') as file:
    file_list = file.readlines()
    for eachone in file_list:
        link = eachone.split('\t')[1]
        link = link.replace('\n', '')
        link_list.append(link)

start = time.time()


def crawler(index):
    Process_id = 'Process-' + str(index)
    while not workQueue.empty():
        url = workQueue.get(timeout=2)
        try:
            r = requests.get(url, timeout=20)
            print(Process_id, workQueue.qsize(), r.status_code, url)
        except Exception as e:
            print(Process_id, workQueue.qsize(), url, 'Error:', e)

def boss():
    for url in link_list:
        workQueue.put_nowait(url)


if __name__ == '__main__':
    workQueue = Queue(1000)

    gevent.spawn(boss).join()
    jobs = []
    for i in range(10):
        jobs.append(gevent.spawn(crawler, i))
    gevent.joinall(jobs)

    end = time.time()
    print('gevent + Queue :', end-start)
    print('Main Ended!')

from gevent import monkey
monkey.patch_all()
这两句代码可以实现爬虫的并发能力。gevent库中的monkey能把可能有IO操作的单独坐上标记，将IO变成可以异步执行的函数。

MOVEBXEAX

关注

8
点赞
踩
60

收藏

觉得还不错? 一键收藏
0
评论
Python-提升爬虫速度三种方式

一、提升爬虫速度提示爬虫速度主要有3种方法：多线程爬虫多进程爬虫多协程爬虫二、并发与并行、同步和异步并发是指在一个时间段内发生若干事件的情况。并行是指在同一个时刻发生若干事件的情况。当使用单核CPU时，多个工作任务就是以并发的方式运行的，因为只有一个CPU时，各个任务会分别占用CPU的一段时间依次执行。这种情况下，各个任务的时间段很短、经常切换，所以我们会感觉是“同时”进行，其...
复制链接

扫一扫

专栏目录