【aiohttp】

最新推荐文章于 2024-09-11 10:51:29 发布

fenfyue

最新推荐文章于 2024-09-11 10:51:29 发布

阅读量181

点赞数

文章标签： python 开发语言爬虫

本文链接：https://blog.csdn.net/fenfyue/article/details/129617074

版权

异步爬虫（协程）

协程

协程是运行在线程基础之上的，协程的出现一是因为当线程的数量较多时线程会占用非常多的内存，二是过多的线程切换时会占用大量的系统时间，协程切换时在用户态。
100个线程每个线程上运行一个协程则可以处理10000个任务。这样运行任务的开销远比运行10000个线程的小。
协程只有和异步IO结合起来才能发挥最大的威力，原因是有可能在协程中同样会调用到阻塞线程的任务，所以需要避免这种情况的出现，还有一个办法是基于编程语言的原生支持。

用于编写单线程并发代码的库，通过套接字和其他方式多路复用I/O访问资源、正在运行的网络客户端和服务器以及其他相关原语。

ClientSession

在网络请求中，一个请求就是一个会话，aiohttp使用ClientSession管理会话。

async with aiohttp.ClientSession() as session:
        async with session.get('http://httpbin.org/get') as resp

get()的必须参数只能是str类和yarl.URL的实例，还有别的如put delete head options patch等
可以将参数设置到参数中，传入到url中。

params = {'key1': 'value1', 'key2': 'value2'}
async with session.get('http://httpbin.org/get',
                       params=params) as resp:
    expect = 'http://httpbin.org/get?key2=value2&key1=value1'
    assert str(resp.url) == expect

获取响应内容

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://httpbin.org/get') as resp:
            print(resp.status)
            print(await resp.text(encoding=utf-8))
            await resp.read()  #可以访问非文本内容格式，访问图片，这种形式返回的是二进制也可以读取到
            
"""输出结果：
200
<!doctype html>
<html lang="zh-CN">
<head>
.....

请求的自定义

自定义Headers

请求时可以自定义Headers，让服务器认为是一个浏览器。

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko)"
                      " Chrome/78.0.3904.108 Safari/537.36"
    }
await session.post(url, headers=headers)

自定义cookie

url = 'http://httpbin.org/cookies'
cookies = {'cookies_are': 'working'}
async with ClientSession(cookies=cookies) as session:
    async with session.get(url) as resp:
        assert await resp.json() == {
           "cookies": {"cookies_are": "working"}}

和asyncio结合使用

asyncio被用作多个提供性能Python异步框架的基础，包括网络服务和网站服务,数据库连接库，分布式任务队列等。往往是构建IO密集型和高级结构化网络代码的最佳选择。

import asyncio
from datetime import datetime

import aiohttp
from lxml import etree
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit"
                         "/537.36 (KHTML, like Gecko) "
                         "Chrome/72.0.3626.121 Safari/537.36"}


async def get_movie_url():
    req_url = "https://movie.douban.com/chart"
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url=req_url, headers=headers) as response:
            result = await response.text()
            result = etree.HTML(result)
        return result.xpath("//*[@id='content']/div/div[1]/div/div/table/tr/td/a/@href")


async def get_movie_content(movie_url):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url=movie_url, headers=headers) as response:
            result = await response.text()
            result = etree.HTML(result)
        movie = dict()
        name = result.xpath('//*[@id="content"]/h1/span[1]//text()')
        author = result.xpath('//*[@id="info"]/span[1]/span[2]//text()')
        movie["name"] = name
        movie["author"] = author
    return movie

if __name__ == '__main__':
    start = datetime.now()
    loop = asyncio.get_event_loop()
    movie_url_list = loop.run_until_complete(get_movie_url())
    tasks = [get_movie_content(url) for url in movie_url_list]
    movies = loop.run_until_complete(asyncio.gather(*tasks))
    print(movies)
    print("异步用时为：{}".format(datetime.now() - start))