近日需要实现一个站点的爬虫,尝试了下aiohtp结合asyncio来实现,也参考了网上相关资料。
第一回合 异步并发居然和同步一样工作
代码如下:
async def fetch_get(session, url):
asyncio.sleep(random.randint(3,6))
# print('get:', url)
async with session.get(url) as response:
return await response.text(encoding='utf-8')
async def result_get(session, url):
pass
async def fetch_main():
async with aiohttp.ClientSession() as session:
shelf_text = await fetch_get(session, FBASE_URL)
shelf_html = etree.HTML(shelf_text)
urls = parse(shelf_html)
for url in urls:
await result_get(session, url)
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())
运行后会发现还是一个个URL获取内容,并没有达到并发效果。
第二回合 添加完全部任务并发执行
async def fetch_get(session, url):
asyncio.sleep(random.randint(3,6))
# print('get:', url)
async with session.get(url) as response:
return await response.text(encoding='utf-8')
async def result_get(session, url):
pass
async def fetch_main():
async with aiohttp.ClientSession() as session:
shelf_text = await fetch_get(session, FBASE_URL)
shelf_html = etree.HTML(shelf_text)
urls = parse(shelf_html)
tasks = []
for url in urls:
task = asyncio.ensure_future(result_get(session, url))
tasks.append(task)
await asyncio.wait(tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())
这种模式下,如果urls中很多,就会连续不停添加异步任务。
第三回合 逐步添加任务并发执行
sem = asyncio.Semaphore(30)
async def fetch_get(session, url):
asyncio.sleep(random.randint(3,6))
async with sem:
async with session.get(url) as response:
return await response.text(encoding='utf-8')
async def result_get(session, url):
pass
async def fetch_main():
async with sem:
async with aiohttp.ClientSession() as session:
shelf_text = await fetch_get(session, FBASE_URL)
shelf_html = etree.HTML(shelf_text)
urls = parse(shelf_html)
tasks = []
part_tasks = []
for index,url in enumerate(urls):
if index % 15 == 0:
asyncio.sleep(240)
part_tasks = []
task = asyncio.ensure_future(result_get(session, url))
tasks.append(task)
part_tasks.append(task)
if index % 15 == 0:
await asyncio.wait(part_tasks)
await asyncio.wait(tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())
目前用的就是这种模式。设置为每15个url添加后会开始异步执行,并等待240秒后再开始,总并发连接数为30。
如果通过一个异步任务获取URL放数据库中,再通过另一个异步任务从数据库中获取URL来获取结果。可以在获取结果的异步任务中使用一个while循环,每次从数据库中取出一定量的URL添加至异步执行的任务中,直到数据库中全部URL执行完成为止。