【python爬虫系列】11异步爬虫

最新推荐文章于 2024-03-31 23:26:58 发布

ZEVIN LI

最新推荐文章于 2024-03-31 23:26:58 发布

阅读量385

点赞数 1

文章标签： python 多线程编程语言

本文链接：https://blog.csdn.net/ai_linnglong/article/details/104612524

版权

第十一节：异步爬虫
注意：python版本3.6.0及以上才可以
11.1.异步简介

异步模型是事件驱动模型的基础。异步活动的执行模型可以只有一个
单一的主控制流，能在单核心系统和多核心系统中运行。在并发执行的异步模型中，许多任务被穿插在同一时间线上，所有的任务都由一个控制流执行(单线程)。任务的执行可能被暂停或恢复，中间的这段时间线程将会去执行其他任务。

携程初步：
协程就是一个函数，只是它满足以下几个特征:
有I/0依赖的操作。
可以在进行1/0操作时暂停。
无法直接执行。
它的作用就是对有大量I/O 操作的程序进行加速。

Python协程属于可等待对象,因此可以在其他协程中被等待

基本的异步代码1：
import asyncio

async def main():
	print("hello...")
	await asyncio.sleep(1)
	print("....world")

asyncio.run(main())

输出：

hello...
....world

import asyncio
async def net():
	return 1

async def main():
	#net()  #错误方法
	a=await net()
	print(a) 


asyncio.run(main())

输出：1

为了可读性和理解：我们可以这么写

async def net():
	return 1

async def main():
	task=asyncio.create_task(net())
	await task

asyncio.run(main())
我们用睡眠模仿一-下耗时的I0操作
async def hello(i):
	print("hello",i)
	await asyncio.sleep(3)
	print("word",i)

if __name__ == '__main__':
	tasks=[]
	for i in range(4):
		tasks.append(hello(i))
	loop=asyncio.get_event_loop()
	loop.run_until_complete(asyncio.wait(tasks))

注意这里的sleep和我们的原生的方法，我们常用的import time time.sleep不一样，原生方法里sleep的话就是真的停止运行，且不会让其他任务执行，而这里的sleep具有等待的意思，是让其他的task执行。
输出：

hello 2
hello 3
hello 1
hello 0
word 2
word 1
word 3
word 0

通过输出结果我们就能了解异步了，并且这里的输出我们可以看到，hello并不是从0开始的，也就是输出是不可控的，这也是异步的一个重要特点。

异步爬虫基本代码
i

mport asyncio
import aiohttp
from bs4 import BeautifulSoup

headers={'Referer': 'http://xiaohua.zol.com.cn/lengxiaohua/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36'}

async def crawl(i):
	url="http://xiaohua.zol.com.cn/lengxiaohua/{}.html".format(i)
	async with aiohttp.ClientSession(headers=headers)as session:
		async with session.get(url) as resp:
			print(resp.status)
			text=await resp.text()  #获取页面html需要时间 所以结果是穿插输出的

	print(text)
		
if __name__ == '__main__':
	loop=asyncio.get_event_loop()
	tasks=[crawl(i) for i in range(1,10)]
	loop.run_until_complete(asyncio.wait(tasks))

用异步请求库aiohttp取代requests
安装方法：pip install aiohttp
添加headers
本质上没有多大区别
只是串行变成了异步
11.2异步使用线程池和进程池（concurrent. futures模块）
我们结合这节课的异步知识和上节的多线程多进程知识
这个模块可以和异步连接
具有线程池和进程池
管理并发编程
处理非确定性的执行流程
同步功能

10个线程

import asyncio
import aiohttp
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def crawl(i):
	print("1")

async def main():
	loop=asyncio.get_event_loop()
	tasks=[]
	with ThreadPoolExecutor(max_workers=10) as t:
		for i in range(10):
			tasks.append(loop.run_in_executor(
				t,crawl,i
				))

if __name__ == '__main__':
	loop=asyncio.get_event_loop()
	loop.run_until_complete(main())
	loop.close()