500lines之crawler学习（二）

最新推荐文章于 2023-02-09 08:18:16 发布

格物致理，

最新推荐文章于 2023-02-09 08:18:16 发布

阅读量717

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qiuqiuit/article/details/86671749

版权

python 专栏收录该内容

28 篇文章 3 订阅

订阅专栏

一、先看看两个函数的用法：

normalized = urllib.parse.urljoin(response.url, url)
defragmented, frag = urllib.parse.urldefrag(normalized)

参考文章：

https://pythoncaff.com/docs/pymotw/urllibparse-split-urls-into-components/149

urllib.parse.urljoin(base, url, allow_fragments=True)

可以用它从相对地址的片段中创建出绝对 URLs 地址。

print(urljoin('http://www.example.com/path/file.html','anotherfile.html'))
print(urljoin('http://www.example.com/path/file.html','../anotherfile.html'))
print(urljoin('http://www.example.com/path/','/subpath/file.html'))
print(urljoin('http://www.example.com/path/','subpath/file.html'))

输出结果：

http://www.example.com/path/anotherfile.html

http://www.example.com/anotherfile.html

http://www.example.com/subpath/file.html

http://www.example.com/path/subpath/file.html

urllib.parse.urldefrag(url)

如果url包含一个片段标识符，则返回一个没有片段标识符的修改过的url（URL with no fragment），并且这个片段标识符作为单独的字符串。

如果url中没有片段标识符，则返回未修改的url和一个空字符串（Fragment identifier）。

from urllib.parse import urldefrag

original = 'http://netloc/path;param?query=arg#frag'
print('original:', original)
d = urldefrag(original)
print('url     :', d.url)
print('fragment:', d.fragment)

输出结果：

original: http://netloc/path;param?query=arg#frag

url : http://netloc/path;param?query=arg

fragment: frag

二、关于aiohttp

参考文章：

https://segmentfault.com/a/1190000012141784

基本用法如下：

import aiohttp,asyncio

async def fetch(session,url):
    async with session.get(url,allow_redirects=False) as resp:
        print(resp.status)
        return await resp.text() #text()方法是协程方法，所以需要用await关键字

async def main(loop):
    url = 'https://xkcd.com'
    async with aiohttp.ClientSession() as session:
        html = await fetch(session,url)
#        print(html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()

从Python 3.5开始引入了async和await，都是针对coroutine的，要使用新语法，只要做两步替换：

把@asyncio.coroutine替换为async；
把yield from替换为await。

全局函数 request()

如果request，代码如下：

async def main():
    url = 'https://xkcd.com'
    async with aiohttp.request("GET",url) as resp:
        html = await resp.text(encoding="utf-8")
        print(html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

request() 只是 ClientSession 的一个简单封装，其步骤大致为：

创建 TCPConnector；
创建 ClientSession；
调用 ClientSession._request()。

多个url时：

async def fetch_url(url):
    async with aiohttp.request("GET",url) as resp:
        html = await resp.text(encoding="utf-8")
        print(html)
        
tasks = [fetch_url('https://xkcd.com'),fetch_url('http://www.baidu.com')]

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

gather 起聚合的作用，把多个 futures 包装成单个 future，因为 loop.run_until_complete 只接受单个 future。

一个 request 用一个 session，太浪费；通常是一个 application 用一个 session。