使用urllib的parse模块。（4个重要方法）

最新推荐文章于 2023-12-15 21:05:22 发布

灵剑山真人

最新推荐文章于 2023-12-15 21:05:22 发布

阅读量412

点赞数 1

文章标签： python 小程序 github php html

本文链接：https://blog.csdn.net/weixin_45850939/article/details/104758233

版权

爬虫——斗师专栏收录该内容

10 篇文章 1 订阅

订阅专栏

parse；一个工具模块，提供了许多URL处理方法，比如拆分、解析、合并等。

一：urlparse/urlunparse

解析网址和构成网址

1：

from urllib.parse import urlparse

result=urlparse('https://prodev.jd.com/mall/active/4G5xap7fUEzJkqP4ZqmpEc7xtV7v/index.html')
print(type(result),result,sep='\n')

<class 'urllib.parse.ParseResult'> 
ParseResult(scheme='https', netloc='prodev.jd.com', path='/mall/active/4G5xap7fUEzJkqP4ZqmpEc7xtV7v/index.html', params='', query='', fragment='')

urlparse把一个URL拆分成不同的部分。

scheme://netloc/path;params?query#fragment
#一般的url都会符合这个规则，利用urlparse()方法可以将它拆分开来。

2：

用“序号法”和”属性法“可以提取ParseResult的值。（实际上是一个元组）

print(result[0],result.scheme,result[1],result.netloc,sep='\n')
https
https
prodev.jd.com
prodev.jd.com

3：

from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

http://www.baidu.com/index.html;user?a=6#comment

urlunparse把不同的部分组合成一个URL，它的参数必须是一个可迭代对象，可以是列表、元组或其他数据类型。它的长度必须是6。

二：urlsplit/urlunsplit

和urlparse/urlunparse一样，用于解析和构成网址。

1：

from urllib.parse import urlsplit
result=urlsplit('https://prodev.jd.com/mall/active/4G5xap7fUEzJkqP4ZqmpEc7xtV7v/index.html')
print(result)

SplitResult(scheme='https', netloc='prodev.jd.com', path='/mall/active/4G5xap7fUEzJkqP4ZqmpEc7xtV7v/index.html', query='', fragment='')

urlsplit这个方法和urlparse（）非常相似，只不过它不再单独解析params这一部分，只返回5个结果，params合并到paths中。

2：

用序号和属性访问值。

3：

from urllib.parse import urlunsplit
data=['http','www.baidu.com','index.html','user','a=6',]
print(urlunsplit(data))

http://www.baidu.com/index.html?user#a=6

urlunsplit把不同的部分组合成一个URL，它的参数必须是一个可迭代对象，可以是列表、元组或其他数据类型。它的长度必须是5。

对应于unsplit把URL解析成5个部分。如果给它六个参数会报错：

ValueError: too many values to unpack (expected 6)

三：urljoin

使用这个方法，需要提供一个基础链接和一个新的链接。如果新的链接里缺少scheme或netloc或path，它就会在基础链接那边相应地拿过来装上，并返回。基础链接中的params和query和fragment是不起作用的。

from urllib.parse import urljoin
base_url='https://blog.csdn.net/'
new_url='weixin_45850939?orderby=UpdateTime'
print(urljoin(base_url,new_url))

https://blog.csdn.net/weixin_45850939?orderby=UpdateTime

from urllib.parse import urljoin
base_url='https://courses.gdut.edu.cn/lingxiaoyun/view.php?id=353'
new_url='/course/view.php?id=353'
print(urljoin(base_url,new_url))

https://courses.gdut.edu.cn/course/view.php?id=353

四：quote/unquote

将内容转化为URL编码的格式或进行URL解码。应用场景：

URL带有中文参数时，有时可能会导致乱码的问题，此时用这个方法可以将中文字符转化为URL编码。

from urllib.parse import quote,unquote
key_word='爱因斯坦'
url='https://cn.bing.com/?mkt=zh-CN' + quote(key_word)
print(url)
url='https://cn.bing.com/?mkt=zh-CN%E7%88%B1%E5%9B%A0%E6%96%AF%E5%9D%A6'
print(unquote(url))

https://cn.bing.com/?mkt=zh-CN%E7%88%B1%E5%9B%A0%E6%96%AF%E5%9D%A6
https://cn.bing.com/?mkt=zh-CN爱因斯坦

灵剑山真人

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
使用urllib的parse模块。（4个重要方法）

注：本文由小云同学及其助手崔庆才同学撰写。
复制链接

扫一扫

专栏目录