掌握urllib.parse模块及常用函数

最新推荐文章于 2024-07-06 01:58:07 发布

是垚不si壵

最新推荐文章于 2024-07-06 01:58:07 发布

阅读量1k

点赞数

分类专栏： Python 爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/ktsmeb/article/details/119271529

版权

爬虫同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

Python

3 篇文章 0 订阅

订阅专栏

一、功能介绍

Python 中的 urllib.parse 模块提供了很多解析和组建 URL 的函数。

二、常用函数

urlparse

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

urlstring: url地址（字符串类型）
scheme: 协议类型
allow_fragments: 当值为True时，它们被解析为路径，参数或查询组件的一部分，并 fragment 在返回值中设置为空字符串。当值为False时，无法识别片段标识符。

quote

URL允许使用的字符有

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789
-_.~!*'();:@&=+$,/?#[]

因此当一些URL中包含了其他字符(比如汉字)时,就需要对其进行编码。
首先介绍一个提供URL编码、解码的网站在线URL编码解码，了解一下对URL编码的结果的样子，如图：
在空白栏中输入“图片”，点击【编码】，生成结果如下

在这里插入图片描述利用代码实现对URL的编码，使用quote函数，结果与在线网站一致

import urllib.parse

str = "图片"
print(urllib.parse.quote(str))

还有另一个函数urllib.parse.quote_plus()，两个函数的区别是第一个不会对url字符串中的‘/’进行编码，而quote_plus函数会对‘/’进行编码。试一试就明白了。

urlparse

urlparse() 函数可以将 URL 解析成 ParseResult 对象。对象中包含了六个元素，分别为：

协议（scheme） 
域名（netloc） 
路径（path） 
路径参数（params） 
查询参数（query） 
片段（fragment）

若还想了解主机名、用户名等信息，也可以利用代码：

对象名.username,对象名.password,对象名.hostname,对象名.port

完整代码

from urllib.parse import urlparse

url='https://home.firefoxchina.cn/'

parsed_result=urlparse(url)

print('parsed_result 包含了',len(parsed_result),'个元素')
print(parsed_result)

print('scheme  :', parsed_result.scheme)
print('netloc  :', parsed_result.netloc)
print('path    :', parsed_result.path)
print('params  :', parsed_result.params)
print('query   :', parsed_result.query)
print('fragment:', parsed_result.fragment)
print('username:', parsed_result.username)
print('password:', parsed_result.password)
print('hostname:', parsed_result.hostname)
print('port    :', parsed_result.port)

三、实例

结合urlopen，在爬虫过程中可能使用到以下代码：

# encoding : utf-8
"""
@author: LY
@contact: 13904442175@163.com
@software: PyCharm
@file: url_parse_code.py
"""
import urllib.request
import urllib.parse
import string

def get_method_params():

    url = "http://www.baidu.com/s?wd="
    #拼接字符串(汉字)
    #python可以接受的数据
    #https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

    name = "图片"
    final_url = url+name
    print(final_url)
    #代码发送了请求
    #网址里面包含了汉字;ascii是没有汉字的;url转译
    #将包含汉字的网址进行转译
    encode_new_url = urllib.parse.quote(final_url,safe=string.printable)
    print(encode_new_url)
    # 使用代码发送网络请求
    response = urllib.request.urlopen(encode_new_url)
    print(response)
    #读取内容
    data = response.read().decode()
    print(data)
    #保存到本地
    with open("parse_图片.html","w",encoding="utf-8")as f:
        f.write(data)
    #UnicodeEncodeError: 'ascii' codec can't encode
    # characters in position 10-11: ordinal not in range(128)
    #python:是解释性语言;解析器只支持 ascii 0 - 127
    #不支持中文

get_method_params()