python标准库—urllib，用于处理url请求

半斗烟草

已于 2022-02-07 20:21:09 修改

阅读量2.7k

点赞数

分类专栏： python爬虫文章标签：爬虫

于 2021-09-21 22:37:32 首次发布

本文链接：https://blog.csdn.net/qq_40494873/article/details/120406457

版权

python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

python3，urlib库

最近在使用urlib的时候，模模糊糊，不能忍啊！本来通过思维导图很快就完成了整个urlib的架构，可惜导入到CSDN的时候，太丑了，对于我这种被wiki折磨过的人来说，实在忍不了！干脆自己重新写吧…

一、urlib是什么？

urlib是python的一个标准库，主要用于网络请求，比较典型的使用场景为：python爬虫，用来获取网页信息。
python3 合并了python2的urllib，urlib2，直接学习python3吧，别纠结过去！
urllib包含四个模块，分别是：request（网络请求模块）、parse（url解析、拼接、合并、编码）、error（request请求异常）、robotparser（处理爬虫协议（Robots协议）。

二、urllib源码示图解析：

在这里插入图片描述

三、urllib四大模块介绍

1.request模块（网路请求）

基本用法
response=
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,
context=None)

参数介绍：
url:网页地址
data: {}, 默认为 data=None，表示get请求；若传递改参数，则表格post请求
timeout：请求超时时间，秒为单位
cafile 与capath ：CA证书和路径，
context：ssl.SSLContext类型，用来指定 SSL 设置
cadefault：默认为False，现弃用

代码示例:

进阶用法
step1：
myRequest=urllib.request.Request(url,data=None,headers{},origin_req_host=None,unverifiable=False,method=None)
step2：response=urllib.request.urlopen(myRequest)

headers：通过urllib发送的请求会有一个默认的Headers:
“User-Agent”:“Python-urllib/3.6”，指明请求是由urllib发送的。所以遇到一些验证User-Agent的网站时，需要我们自定义Headers把自己伪装起来。
headers = {
#伪装一个火狐浏览器
“User-Agent”:‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)’,
“host”:‘httpbin.org’ }

处理urlopen返回结果

直接将整个页面以二进制格式返回给某个变量：context_bytes=reponse.read()
直接将整个页面以二进制格式返回给某个列表:context_list_bytes=response.readlines()
直接将整个页面以二进制格式返回给某个列表:context_bytes=response.readline()
获取请求状态码: status_code=response.getcode(); 200（请求成功）404（url不存在）504(服务不可用)
获取网页响应头：headers=response.getheaders()
获取url：url=response.geturl()
response_info=response.info()；获取状态码以及头部信息

处理read的结果

转码二进制： context_str=context_bytes.decode()
下载到本地文件： file=file.write(context_bytes)

2.parse模块（url解析、拼接、合并、编码）

url解析

tuple=urllib.parse.urlparse(url)；

将url分为6个部分，返回一个包含6个字符串项目的元组：协议、位置、路径、参数、查询、片段。
query_list = parse_qsl(tuple)；
解析query组件，返回查询参数列表

tuple=urllib.parse.urlsplit(url)；

将url分为5个部分，返回一个包含字符串项目的元组：协议、位置、路径、查询、片段。

query_dict =parse_qs(tuple)；

会继续将解析query组件，返回字典

扩展：
【url格式】url的格式协议(scheme)、端口(netloc)、路径(path)、参数（params）、查询（query）、片段（frag）、url编码。
【url通用格式】<协议>：//<用户名>：<密码>@<主机域名或者ip地址>：<端口号>/<路径>；<参数>？<查询>#<片段>

url拼接

urllib.parse.urlunparse(url_dict)；

url_dict 六个参数：scheme、netloc、path、params、query、fragment，必须都要指定，即便它不存在，也要指定为空
例如：url_dict =[‘http’,‘www.baidu.com’,‘index.html’,‘user’,‘a=6’,‘comment’]
结果：http://www.baidu.com/index.html;user?a=6#comment

urllib.parse.urlunsplit(url_dict)；

url_dict五个参数：scheme、netloc、path、query、fragment，必须都要指定，即便它不存在、也要指定为空。
例如：url_dict= [‘http’,‘www.baidu.com’,‘index.html;user’,‘a=6’,‘comment’]
结果：http://www.baidu.com/index.html;user?a=6#comment

url合并

urllib.parse.urljoin(base_url,url,allow_fragments=True)，两个url连接通过该方法进行拼接，具体规则见下例子：
x
y
z

url编解码

【编码】格式化请求参数data：urlencode(query, doseq=False, safe=’’, encoding=None, errors=None,quote_via=quote_plus)

作用：格式化请求参数data，将字典或者二元元组转换成bytes

参数介绍：
query:查询参数，支持dict、二元素tuple；
doseq：序列元素是否单独转换
safe：安全默认值
encoding：编码格式
errors：错误默认值
quote_via：默认为quote_plus

【编码】对url中str/bytes：quote()/quote_plus()；对url中字符进行编码：urllib.parse.quote()/quote_plus()；或者
urllib.parse.quote(string,safe=’/’,encoding=None,errors=None)

参数介绍： string： str或bytes型数据，其中下划线，句号，逗号，斜线和字母数字这类符号不需要转化，其它的则需要转化。另外URL不能使用的字符（如中文）前会被加上百分号(%)同时转换成十六进制，即<%xx>的形式
safe: safe字符串包含一些不能转换的字符，默认是斜线(/)。
encoding、errors：这两个参数指定如何处理str.encode()方法接受的非ascii字符

二者的区别在于对特殊字符编码的方式不一样如：
quote() 不编码斜线; 空格‘ ’编码为‘%20’
quote_plus() 会编码斜线为‘%2F’; 空格‘ ’编码为‘+’等等

【解码】解码url请求中的字符串： unquote()/unquote_plus()

str=unquote(string, encoding=‘utf-8’,errors=‘replace’)；不解码加号,默认string的编码utf-8 str=unquote_plus(string,
encoding=‘utf-8’, errors=‘replace’)；加号解码为空格默认string的编码utf-8
注意这里的encoding指定的是string参数的编码格式，不是终端编码格式

【解码】解码url请求中的Bytes： quote_from_bytes()/unquote_from_bytes()

quote_from_bytes(); 类似于quote(),不过它只接受bytes，并且不执行string到bytes的编码。str 需要先转成bytes 两种办法:str.encode(encoding)或bytes(str,encoding)。
unquote_to_bytes(string)类似于unquote()，不过它接受一个str或者bytes，返回一个bytes object。

3.error模块（request请求异常）

URLError（封装的错误信息一般是由网络引起的，包括url错误）；例如：无网络、有网络但是由于种种原因导致服务器连接失败。
HTTPError（封装的错误信息一般是服务器返回了错误状态码）；服务器返回了错误代码如404，403等等（400以上）
综合说明：URLError是OSERROR的子类，HTTPError是URLError的子类；所以捕获的时候HTTPError要放在URLError的上面

4.robotparser（处理爬虫协议（Robots协议）

Robots协议

Robots 协议也称作爬虫协议/机器人协议，它的全名叫作网络爬虫排除标准（ Robots ExclusionProtocol
），用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。它通常是一个叫作robots
.txt的文本文件，一般放在网站的根目录下。当搜索’爬虫访问一个站点时，它首先会检查这个站点根目录下是否存在rob ots.txt
文件，如果存在，搜索爬虫会根据其中定义的爬取范围来爬取。如果没有找到这个文件，搜索爬虫便会访问所有可直接访问的页面。下面我们看一个robots.txt
的样例： User-agent: * Disallow: / Allow: /public/ 这实现了对所有搜索爬虫只允许爬取public
目录的功能，将上述内容保存成ro b ots.tx t 文件，放在网站的根目录下，和网站的人口文件（比如index.php
、index.html 和index.jsp 等）放在一起。上面的User-agent
描述了搜索’爬虫的名称，这里将其设置为＊则代表该协议对任何爬取爬虫有效。比如，我们可以设置：User-agent:
Baiduspider；这就代表我们设置的规则对百度爬虫是有效的。如果有多条User-agent
记录，则就会有多个爬虫会受到爬取限制，但至少需要指定一条。Disallow
指定了不允许抓取的目录，比如上例子中设置为／则代表不允许抓取所有页面。Allow 一般和Disallow
一起使用，一般不会单独使用，用来排除某些限制。现在我们设置为/public ／，则表示所有页面不允许抓取，但可以抓取public 目录。

RobotFileParser对象：rp=RobotFileParser(url=’ ')

rp.set_url()，设置robots.txt文件的URL。
rp.read()，读取robots.txt文件并进行分析，该方法不会返回结果，但对文件进行了读取操作，这一步必须调用，如果不调用，则接下来的判断均为False。
rp.parse() ，解析robots.txt文件
rp.can_fetch()，第一个参数为user_agent，第二个参数为要抓取的url，判断该搜索引擎是否可抓取该url。
rp.mtime()，返回上次抓取和分析robots.txt协议的时间；
rp.modified()，将当前时间设置为上次抓取和分析的时间。

综合使用案例

代码示例：

# -*- coding:utf-8 -*-
import os, random
from urllib import request, parse, error, robotparser
import fake_useragent

current_workspace = os.path.join(os.path.dirname(os.path.abspath(__file__)))
print(current_workspace)

output_path = os.path.join(current_workspace, 'output')
print(output_path)

fake_useragent_json_path = os.path.join(current_workspace, 'config/fake_useragent_0.1.11.json')
print(fake_useragent_json_path)


# 浏览器 UserAgent 模块
def fake_useragent_demo():
    #fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached 
    # 这个库会引用在线资源，cache原因导致请求超时，使用本地的json文件
    ua = fake_useragent.UserAgent(path=fake_useragent_json_path) 
    print(ua.ie) #随机打印ie浏览器任意版本
    print(ua.google) #随机打印google浏览器任意版本
    print(ua.chrome) #随机打印chrome浏览器任意版本
    print(ua.firefox) #随机打印firefox浏览器任意版本
    print(ua.random) #随机打印任意厂家的浏览器

# 最简单的网页下载
def simple_urllib_demo():
    try:
        #"User-Agent":"Python-urllib/3.6"
        response_obj = request.urlopen("https://so.gushiwen.cn/mingjus/", timeout=1)
        print(response_obj) #return <class 'http.client.HTTPResponse'>
        if response_obj.getcode() == 200:
            response_binary = response_obj.read() #return binary_context,readlines() readline()方法
            with open(os.path.join(output_path, 'response.txt'), 'w', encoding='utf-8') as f:
                f.write(response_binary.decode())   
            print("get response page successful!")

            response_headers_list = response_obj.getheaders() #return list:[( ),( ),( )...]
            print(str(response_headers_list))
            response_info_list = response_obj.info() #return <class 'http.client.HTTPMessage'>
            print(str(response_info_list))
        else:
            print("get failed!")

    except error.URLError as e:
        print(e)
    except Exception as e:
        print(e)

# get请求+参数
# 下载网页：
        #https://so.gushiwen.cn/mingjus/default.aspx?page=1&tstr=春天&astr=李白&cstr=唐代&xstr=诗文
        # ?tstr=边塞  类型
        # ?astr=李白  作者
        # ?cstr=唐代  朝代
        # ?xstr=诗文  形式
def get_with_params_demo():
    try:
        url_base = 'https://so.gushiwen.cn/mingjus/default.aspx?'
        params = {
            'page': 4,
            'tstr': '边塞',
            'astr': '李白',
            'cstr': '唐代',
            'xstr': '诗文'
        }
        params_encode = parse.urlencode(params) #ASCII编码
        url = url_base + params_encode #完整url

        # 设置UA（浏览器用户代理），模拟浏览器访问，避免被网页屏蔽
        # 常用的浏览亲UA大全： https://www.cnblogs.com/zhenning-li/p/11429831.html
        # python动态设置UA（pip install fake-useragent）: https://www.cnblogs.com/shaosks/p/10183919.html
        ua = fake_useragent.UserAgent(path=fake_useragent_json_path)
        headers = {'User-agent': ua.random} #随机UA

        request_obj = request.Request(url, headers=headers) #需要将python 伪装成浏览器进行访问
        response_obj = request.urlopen(request_obj)

        if response_obj.getcode() == 200:
            response_binary = response_obj.read() #return binary_context,readlines() readline()方法
            with open(os.path.join(output_path, 'response.txt'), 'w', encoding='utf-8') as f:
                f.write(response_binary.decode())   
            print("get response page successful!")
            
            response_url = response_obj.geturl()
            print(response_url)
            response_headers_list = response_obj.getheaders() #return list:[( ),( ),( )...]
            print(str(response_headers_list))
            response_info_list = response_obj.info() #return <class 'http.client.HTTPMessage'>
            print(str(response_info_list))
        else:
            print("get failed!")

    except error.URLError as e:
        print(e)
    except Exception as e:
        print(e)    

# post请求
# 翻译类网页提供post请求
# 分析post请求指导博客： https://blog.csdn.net/weixin_45228198/article/details/116169634

def post_with_params_demo():
    url='https://fanyi.baidu.com/sug'
    
    ua = fake_useragent.UserAgent(path=fake_useragent_json_path)
    headers_res = {'User-agent': ua.random} #随机UA

    data_res = {
        'kw':'hi'
    }
    data_byte = parse.urlencode(data_res).encode('utf-8') #要转换data为byte

    request_obj = request.Request(url, data=data_byte, headers=headers_res)
    response_obj = request.urlopen(request_obj)
    print(response_obj.read()) #\u 开头代表unicode编码
    print(response_obj.geturl())

if __name__ == '__main__':
    # 最简单的urlib 下载王爷
    simple_urllib_demo()

    #get 请求 + 参数
    get_with_params_demo()

    #post请求
    post_with_params_demo()

半斗烟草

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
python标准库—urllib，用于处理url请求

urllib； python标准库，用于处理url请求request（网络请求模块）response=urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)url:网页地址data: {}, 默认为 data=None，表示get请求；若传递改参数，则表格post请求timeout：请求超时时间，秒为单位cafile 与capath ：CA证书和路径，context：ssl.
复制链接

扫一扫

专栏目录