同步、多进程、协程之下的爬虫对比

使用python编写爬虫获取微博热搜前十的标题,分别使用同步、多进程、协程的方式实现,比较其各自的实现方式的差异,当然执行效率也是大相径庭

一、使用同步的方式,逐个爬取并输出热搜标题:

此方式最为简单,逐个爬取并处理,效率是最低的

import time
import requests
from bs4 import BeautifulSoup

def get_title(url):
    try:
    	#由于新浪微博具有反爬措施,需要一个header字典储存cookie信息,可在网站的页面按f12查看并复制。
        header={"user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
        r=requests.get(url,timeout=30,headers=header)#爬取页面信息
        r.encoding=r.apparent_encoding#获得编码方式
        soup=BeautifulSoup(r.text,'html.parser')#使用Beautiful处理
        print(soup.find(attrs={'class':'title'}).string)#输出含有属性class:'title'的标签的非属性字符串,即页面的标题
    except:
        print('error')
#页面链接列表
urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]
#逐个执行
def main(urls):
    for url in urls:
        get_title(url)

start=time.time()
main(urls)
end=time.time()
print('run time is %.5f'%(end-start))#输出运行所花费的时间

运行得到微博热搜前十的title,可以看到运行所花费的时间为7.73s在这里插入图片描述

二、接下来使用多进程的方式实现该爬虫

现在使用mutiprocessing.Pool实现多进程爬虫,由于我的电脑CPU是4核,所以这里进程池我就设的4(p=Pool(4))

import multiprocessing
from multiprocessing import Pool
import time
import requests
from bs4 import BeautifulSoup

def get_title(url):
    try:
        header={"user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
        r=requests.get(url,timeout=30,headers=header)
        r.encoding=r.apparent_encoding
        soup=BeautifulSoup(r.text,'html.parser')
        print(soup.find(attrs={'class':'title'}).string)
    except:
        print('error...')

urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]

def main(urls):
    p=Pool(4)#设置进程池为4
    for url in urls:
        p.apply_async(get_title,args=[url])#创建多个进程,并发执行
    p.close()
    p.join()# 运行完所有子进程才能顺序运行后续程序
#把实际执行功能的代码封装成一个函数,然后加入到if __name__ == '__main__':中执行,否则将报错。
if __name__=='__main__':
    start=time.time()
    main(urls)
    end=time.time()
    print('run time is %.5f'%(end-start))

这里需要注意,Windows环境下要把实际执行功能的代码封装成一个函数,然后加入到if __name__=='__main__'中执行,否则将报错RuntimeError:reeze_support(),如下所示:

    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

而在Linux下就不会出现该错误,原因是Windows下没有fork调用。
使用多进程的爬虫运行结果如下:
可以看到运行时间为4.56s,效率有所提高在这里插入图片描述
下图为多进程爬虫的执行流程图(图片来自网络)在这里插入图片描述

三、使用协程的方式

由于request不是awaitable,无法放在await后面,官方专门提供了一个aiohttp库,用来实现异步网页请求等功能,可以看作异步版的requests库,需要我们手动安装。安装命令:pip install aiohttp

import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
header={"user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
sem=asyncio.Semaphore(10)# 信号量,控制协程数,防止爬的过快
async def get_title(url,header):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.request('GET',url,headers=header) as result:
                try:
                    text=await result.text()
                    soup=BeautifulSoup(text,'html.parser')
                    print(soup.find(attrs={'class':'title'}).string)
                except:
                    print('error...')

urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]

def main(urls,header):
    loop=asyncio.get_event_loop()#获取事件循环
    tasks=[get_title(url,header) for url in urls]#生成任务列表
    loop.run_until_complete(asyncio.wait(tasks))#激活协程

if __name__=='__main__':
    start=time.time()
    main(urls,header)
    end=time.time()
    print('run time is %.5f'%(end-start))

执行结果如下:
仅仅使用了0.82秒,效率极高在这里插入图片描述
解释:
1、semaphore是限制同时工作的协同程序数量的同步工具
2、在协程中使用aiohttpClientSession()中的request来请求网页

异步爬虫不同于多进程爬虫,它使用单线程(即仅创建一个事件循环,然后把所有任务添加到事件循环中)就能并发处理多任务。在轮询到某个任务后,当遇到耗时操作(如请求URL)时,挂起该任务并进行下一个任务,当之前被挂起的任务更新了状态(如获得了网页响应),则被唤醒,程序继续从上次挂起的地方运行下去。极大的减少了中间不必要的等待时间
————————————————
版权声明:本文为CSDN博主「SL_World」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/SL_World/article/details/86633611

下图为协程爬虫的执行流程图(图片来自网络)在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值