二. Tornado 异步非阻塞客户端

最新推荐文章于 2021-02-20 16:00:43 发布

tianv5

最新推荐文章于 2021-02-20 16:00:43 发布

阅读量285

点赞数

本文链接：https://blog.csdn.net/sunt2018/article/details/88850689

版权

tornado 异步非阻塞IO

尽量使用async和coroutine装饰器。
是因为async 已经变为python的关键字

tornado 提供的同步客户端

from  tornado import httpclient

# 同步的,基本不会使用，使用同步，不如直接使用requests
http_client = httpclient.HTTPClient()
response = http_client.fetch("http://www.tornadoweb.org/en/stable/")
print(response.body.decode("utf8"))
http_client.close()

tornado 提供的异步客户端

from  tornado import httpclient

async def f():
    
    http_client =httpclient.AsyncHTTPClient()
    response = await httpclient.http_client.fetch("http://www.tornadoweb.org/en/stable/")
    print(response.body.decode("utf8"))

"""
f()协程是不能直接调用的。
启动协程一定要有一个事件循环的。

"""

启动事件循环

if __name__ == '__main__':
    import tornado
    # 获取到一个全局的IOLoop ，单例模式的。
    io_loop = tornado.ioloop.IOLoop.current()
    # run_sync方法，可以在运行完某个协程后，自己停止事件循环
    io_loop.run_sync(f)
    
"""
使用其他启动事件循环 IOLoop的方法

为什么可以用asyncio的事件循环？ 因为他和tornado其实是一个事件循环，继承（python3）。
    
    import asyncio
    asyncio.ensure_future(f()) # 将f()放入事件循环IOLoop中
    asyncio.get_event_loop().run_forever() # 启动，不会停止，forever.
    
"""

使用AsyncHTTPClient() 实现异步爬虫

"""
使用tornado的异步客户端，编写一个简单的并发爬虫
"""
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from tornado import gen,httpclient,ioloop,queues

base_url = "http://www.tornadoweb.org/en/stable/"
concurrency = 3

async def get_url_links(url):
    response = await httpclient.AsyncHTTPClient().fetch("http://www.tornadoweb.org/en/stable/")
    html = response.body.decode("utf8")
    soup = BeautifulSoup(html) # 异步涉及到IO，内存CPU异步效果不明显，同步即可
    links = [urljoin(base_url,a.get("href")) for a in soup.find_all("a",href=True)]
    return links

async def main():
    seen_set = set() # 记录已经抓取过的URL
    q = queues.Queue() # 使用tornado的queues，是因为这是一个异步的。不使用自带的
    
    async def fetch_url(current_url):
    # 生产者
        if current_url in seen_set:
            return
        print("获取了url",current_url)
        seen_set.add(current_url)
        
        next_url = await get_url_links(current_url)
        for new_url in next_url:
            if new_url.startswith(base_url):#如果是base_url域名下的
                await q.put(new_url) # 放不进去，取不出来的时候，tornado的queue会将cpu切换出去
    
    async def worker():
        async for url in q: # q实现了 __aiter__ __anext__的魔法函数
            if url is None:
                break
            try:
                await fetch_url(url)
            except Exception as e:
                print("exception```````````")
            finally:
                q.task_done() # 告诉这个q,
    
    # 放入初始url到队列
    await q.put(base_url)
    # 启动协程 3个。
    workers = gen.multi([worker() for _ in range(concurrency)])
    await q.join()
    
    for _ in range(concurrency):
        await q.put(None)
    
    await workers

if __name__ == '__main__':
    io_loop = ioloop.IOLoop.current()
    io_loop.run_sync(main)

tianv5

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
二. Tornado 异步非阻塞客户端

tornado 异步非阻塞IO尽量使用async和coroutine装饰器。是因为async 已经变为python的关键字tornado 提供的同步客户端from tornado import httpclient# 同步的,基本不会使用，使用同步，不如直接使用requestshttp_client = httpclient.HTTPClient()response = ht...
复制链接

扫一扫