python 协程爬虫教程_python学习笔记——普通爬虫和协程爬虫

最新推荐文章于 2021-02-07 16:49:12 发布

weixin_39618456

最新推荐文章于 2021-02-07 16:49:12 发布

阅读量102

点赞数

文章标签： python 协程爬虫教程

普通爬虫和协程爬虫

普通爬虫逻辑：

import time

from lxml import etree

import requests

urls = [

'https://henan.qq.com/a/20190822/001115.htm',

'https://henan.qq.com/a/20190822/001128.htm',

'https://henan.qq.com/a/20190822/001086.htm',

'https://henan.qq.com/a/20190822/001764.htm',

'https://henan.qq.com/a/20190822/001163.htm',

'https://henan.qq.com/a/20190822/001169.htm',

'https://henan.qq.com/a/20190822/001196.htm',

'https://henan.qq.com/a/20190822/001278.htm'

]

url = 'https://henan.qq.com/a/20190822/001764.htm'

def get_titles(url,cnt):

reponse = requests.get(url)

html = reponse.content

title = etree.HTML(html).xpath('//*[@id="Main-Article-QQ"]/div[2]/div[1]/div[2]/div[1]/h1/text()')

print('第%d个title:%s' % (cnt,''.join(title)))

if __name__ == '__main__':

start1 = time.time()

i = 0

for url in urls:

i = i + 1

start = time.time()

get_titles(url,i)

print('第%d个title爬取耗时:%.5f秒' % (i,float(time.time() - start)))

print('爬取总耗时:%.5f秒' % float(time.time() - start1))

get_titles()函数首先使用requests模块发起了一个get请求，获取html的页面源码。

然后利用etree中的xpath解析出想要获取到的内容。

xpath('')中使用的是xpath语法，可以准确的定位获取到的内容。可以在审查元素中直接右键Copy->Copy Xpath复制出xpath代码，然后使用时在尾部加上text()就可以。

协程爬虫

import time

from lxml import etree

import aiohttp

import asyncio

urls = [

'https://henan.qq.com/a/20190822/001115.htm',

'https://henan.qq.com/a/20190822/001128.htm',

'https://henan.qq.com/a/20190822/001086.htm',

'https://henan.qq.com/a/20190822/001764.htm',

'https://henan.qq.com/a/20190822/001163.htm',

'https://henan.qq.com/a/20190822/001169.htm',

'https://henan.qq.com/a/20190822/001196.htm',

'https://henan.qq.com/a/20190822/001278.htm',

]

titles = []

sem = asyncio.Semaphore(10)

async def get_title(url):

with(await sem):

async with aiohttp.ClientSession() as session:

async with session.request('GET',url) as resp:

html = await resp.read()

title = etree.HTML(html).xpath('//*[@id="Main-Article-QQ"]/div[2]/div[1]/div[2]/div[1]/h1/text()')

print(''.join(title))

def main():

loop = asyncio.get_event_loop()

tasks = [get_title(url) for url in urls]

loop.run_until_complete(asyncio.wait(tasks))

loop.close()

if __name__ == '__main__':

start = time.time()

main()

print('总耗时: %.5f秒' % float(time.time()-start))

在协程中，由于requests库提供的相关方法不是awaitable，使得无法放在await后面，因此无法在协程中直接使用requests库进行请求。

为了解决这个问题，官方提供了一个aiohttp库，实现异步网页请求等功能。

使用aiohttp模块请求网页：

import aiohttp

async with aiohttp.ClintSession() as session:

async with session.get('http://test.com') as resp:

print(resp.status)

print(await resp.text())

其中async with是封装了异步实现功能的异步上下文管理器

这段代码中将aiohttp模块中的ClientSession方法命名为session，并且将ClientResponse对象命名为resp,ClientSession.get()协程的必须参数为一个URL。再通过resp.text()获取到网页的全部内容。

weixin_39618456

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 协程爬虫教程_python学习笔记——普通爬虫和协程爬虫

普通爬虫和协程爬虫普通爬虫逻辑：import timefrom lxml import etreeimport requestsurls = ['https://henan.qq.com/a/20190822/001115.htm','https://henan.qq.com/a/20190822/001128.htm','https://henan.qq.com/a/20190822/00108...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。