同步、多进程、协程之下的爬虫对比

最新推荐文章于 2024-10-07 01:36:35 发布

Sirius_Cao

最新推荐文章于 2024-10-07 01:36:35 发布

阅读量306

点赞数

文章标签： python

本文链接：https://blog.csdn.net/qq_44371305/article/details/105229980

版权

使用python编写爬虫获取微博热搜前十的标题，分别使用同步、多进程、协程的方式实现，比较其各自的实现方式的差异，当然执行效率也是大相径庭

一、使用同步的方式，逐个爬取并输出热搜标题：

此方式最为简单，逐个爬取并处理，效率是最低的。

import time
import requests
from bs4 import BeautifulSoup

def get_title(url):
    try:
    	#由于新浪微博具有反爬措施，需要一个header字典储存cookie信息，可在网站的页面按f12查看并复制。
        header={"user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
        r=requests.get(url,timeout=30,headers=header)#爬取页面信息
        r.encoding=r.apparent_encoding#获得编码方式
        soup=BeautifulSoup(r.text,'html.parser')#使用Beautiful处理
        print(soup.find(attrs={'class':'title'}).string)#输出含有属性class:'title'的标签的非属性字符串，即页面的标题
    except:
        print('error')
#页面链接列表
urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]
#逐个执行
def main(urls):
    for url in urls:
        get_title(url)

start=time.time()
main(urls)
end=time.time()
print('run time is %.5f'%(end-start))#输出运行所花费的时间

运行得到微博热搜前十的title，可以看到运行所花费的时间为7.73s 在这里插入图片描述

二、接下来使用多进程的方式实现该爬虫

现在使用mutiprocessing.Pool实现多进程爬虫，由于我的电脑CPU是4核，所以这里进程池我就设的4（p=Pool(4)）

import multiprocessing
from multiprocessing import Pool
import time
import requests
from bs4 import BeautifulSoup

def get_title(url):
    try:
        header={"user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
        r=requests.get(url,timeout=30,headers=header)
        r.encoding=r.apparent_encoding
        soup=BeautifulSoup(r.text,'html.parser')
        print(soup.find(attrs={'class':'title'}).string)
    except:
        print('error...')

urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]

def main(urls):
    p=Pool(4)#设置进程池为4
    for url in urls:
        p.apply_async(get_title,args=[url])#创建多个进程，并发执行
    p.close()
    p.join()# 运行完所有子进程才能顺序运行后续程序
#把实际执行功能的代码封装成一个函数,然后加入到if __name__ == '__main__':中执行，否则将报错。
if __name__=='__main__':
    start=time.time()
    main(urls)
    end=time.time()
    print('run time is %.5f'%(end-start))

这里需要注意，Windows环境下要把实际执行功能的代码封装成一个函数,然后加入到if __name__=='__main__'中执行，否则将报错RuntimeError：reeze_support()，如下所示：

    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

而在Linux下就不会出现该错误,原因是Windows下没有fork调用。
使用多进程的爬虫运行结果如下：
可以看到运行时间为4.56s，效率有所提高。在这里插入图片描述
下图为多进程爬虫的执行流程图（图片来自网络）

三、使用协程的方式

由于request不是awaitable，无法放在await后面，官方专门提供了一个aiohttp库，用来实现异步网页请求等功能，可以看作异步版的requests库，需要我们手动安装。安装命令：pip install aiohttp

import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
header={"user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
sem=asyncio.Semaphore(10)# 信号量，控制协程数，防止爬的过快
async def get_title(url,header):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.request('GET',url,headers=header) as result:
                try:
                    text=await result.text()
                    soup=BeautifulSoup(text,'html.parser')
                    print(soup.find(attrs={'class':'title'}).string)
                except:
                    print('error...')

urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]

def main(urls,header):
    loop=asyncio.get_event_loop()#获取事件循环
    tasks=[get_title(url,header) for url in urls]#生成任务列表
    loop.run_until_complete(asyncio.wait(tasks))#激活协程

if __name__=='__main__':
    start=time.time()
    main(urls,header)
    end=time.time()
    print('run time is %.5f'%(end-start))

执行结果如下：
仅仅使用了0.82秒，效率极高。在这里插入图片描述
解释：
1、semaphore是限制同时工作的协同程序数量的同步工具
2、在协程中使用aiohttp的ClientSession()中的request来请求网页

异步爬虫不同于多进程爬虫，它使用单线程(即仅创建一个事件循环，然后把所有任务添加到事件循环中)就能并发处理多任务。在轮询到某个任务后，当遇到耗时操作(如请求URL)时，挂起该任务并进行下一个任务，当之前被挂起的任务更新了状态(如获得了网页响应)，则被唤醒，程序继续从上次挂起的地方运行下去。极大的减少了中间不必要的等待时间
————————————————
版权声明：本文为CSDN博主「SL_World」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/SL_World/article/details/86633611

下图为协程爬虫的执行流程图（图片来自网络）在这里插入图片描述