爬虫 - 高性能爬虫 - 降低IO阻塞

最新推荐文章于 2024-07-30 21:19:19 发布

LSYHhhhh

最新推荐文章于 2024-07-30 21:19:19 发布

阅读量1.1k

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_33961117/article/details/86626950

版权

本文探讨了如何使用asyncio、gevent、grequests、twisted和tornado等工具来降低网络爬虫中的IO阻塞问题。详细介绍了asyncio模块支持Http协议请求的方法，包括自定义报头和使用aiohttp及requests.get的封装。同时，提到了gevent和tornado框架在优化IO阻塞方面的作用。

摘要由CSDN通过智能技术生成

一、asyncio模块 - 检测网络IO，实现应用程序级别切换

1-1 原始方式 - 仅支持TCP级别请求

1-2 使 asyncio 支持Http协议请求

1-2-1 自定义报头

1-2-2 aiohttp模块封装报头

1-2-3 requests.get方法封装报头

二、gevent模块优化io阻塞

三、grequests模块

四、twisted 框架 - 自动检测IO并切换

五、tornado

一、asyncio模块 - 检测网络IO，实现应用程序级别切换

1-1 原始方式 - 仅支持TCP级别请求

import asyncio

@asyncio.coroutine
def task(task_id,senconds):
    print('%s is start' %task_id)
    yield from asyncio.sleep(senconds) #只能检测网络IO,检测到IO后切换到其他任务执行
    print('%s is end' %task_id)

tasks=[task(task_id="任务1",senconds=3),task("任务2",2),task(task_id="任务3",senconds=1)]
# 使用循环，提交多个任务，单进程内进行IO切换
loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
loop.close()



# 同上 第二种写法

1-2 使 asyncio 支持Http协议请求

1-2-1 自定义报头

注意！：在发送请求时，需要SSL时，必须确保库内存在pyopenssl模块

import asyncio
import requests
import uuid

user_agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'

def parse_page(host,res):
    print('%s 解析结果 %s' %(host,len(res)))
    with open('%s.html' %(uuid.uuid1()),'wb') as f:
        f.write(res)

@asyncio.coroutine
def get_page(host,port=80,url='/',callback=parse_page,ssl=False):
    print('下载 http://%s:%s%s' %(host,port,url))

    # 步骤一（IO阻塞）：发起tcp链接，是阻塞操作，因此需要yield from
    # 若请求页面为ssl证书的安全网页，则切换请求网页的端口号
    if ssl:
        port=443
    recv,send=yield from asyncio.open_connection(host=host,port=443,ssl=ssl)

    # 步骤二：封装http协议的报头，因为asyncio模块只能封装并发送tcp包，因此这一步需要我们自己封装http协议的包
    request_headers="""GET %s HTTP/1.0\r\nHost: %s\r\nUser-agent: %s\r\n\r\n""" %(url,host,user_agent)
    # requset_headers="""POST %s HTTP/1.0\r\nHost: %s\r\n\r\nname=egon&password=123""" % (url, host,)
    request_headers=request_headers.encode('utf-8')

    # 步骤三（IO阻塞）：发送http请求包 - 请求头数据
    # 将数据交给操作系统
    send.write(request_headers)
    # 发送数据