Urllib库

最新推荐文章于 2024-07-14 22:18:32 发布

大妮子噻

最新推荐文章于 2024-07-14 22:18:32 发布

阅读量427

点赞数

文章标签： python

本文链接：https://blog.csdn.net/bd_nini/article/details/105806564

版权

Python异常处理机制（try语句捕获异常）

URLError

HTTPError

5、URL解析

【1】urlparse() 实现url的识别和分段

定义

内置的HTTP请求库，包含了四个模块：

urllib request库（请求模块）、

urllib error库（异常处理模块）、

urllib parse库(url解析模块)、

urllib robotpaeser库（robots text 解析模块，判断哪些网站可爬）

用法

1、urlopen()

urlopen(url,data,timeout) *重点是这三个参数的使用

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(type(response))

运行结果：
<class 'http.client.HTTPResponse'>

HTTPResposne类型方法和属性

具有多种方法和属性，调用这些属性和方法可以返回结果的一系列信息

read()、readinto()、getheader(name)、getheaders()、fileno() 等方法
msg、version、status、reason、debuglevel、closed 等属性。

POST类型请求

import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({'world':'hello'}),encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

运行结果
b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "world": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "11", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7", \n    "X-Amzn-Trace-Id": "Root=1-5ea799c6-6a91f9108e25adf27f3a889c"\n  }, \n  "json": null, \n  "origin": "119.248.145.161", \n  "url": "http://httpbin.org/post"\n}\n'

Process finished with exit code 0

data可选，添加该参数必须传人bytes(字节流)类型，利用bytes（）方法
bytes（）方法第一个参数必须是字符串类型，故借助urllib.parse模块的urlencode()方法将字典转化为字符串，第二个参数指定编码方式utf8
借助parse模块的urlencode()方法

超时设置并抛出异常

import urllib.request
import urllib.error
import socket
try:
  response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
  if isinstance(e.reason,socket.timeout):
     print('timeout')


运行：
timeout

如果不指定该参数，就会使用全局默认时间

2、构建Request对象

urlopen( )方法可实现最基本的请求，但这几个参数不足以构建完整的请求，故可以借助更加强大的Request类来构建。

import urllib.request
request=urllib.request.Request('http://www.baidu.com')
response=urllib.request.urlopen(request)
print(response.read())

from urllib import request,parse
url='https://httpbin.org/post'
headers = {'user-agent': 'Mozilla/5.0 (LinuxAndroid 6.0 Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Mobile Safari/537.36',
           'Host':'httbin.org'}
data = bytes(parse.urlencode({'world': 'hello'}), encoding='utf8')
res=request.Request(url=url,headers=headers,data=data,method='POST')
response=request.urlopen(res)
print(response.read())

3、高级用法Handler？

理解为各种处理器，有专门处理登录验证的，有处理Cookies的，有处理代理设置的，利用他们几乎可以做到HTTP请求中的所有事情。

4、异常处理

urllib的error模块定义了由request模块产生的异常。若出现异常request模块会抛出error模块定义的异常。

Python异常处理机制（try语句捕获异常）

在异常出现时及时捕获，并内部自我消化掉。即利用try语句进行异常捕获，任何出现在try语句范围内的异常都会被及时捕获到。try语句有两种实现方式：try - except 和 try - finally

1-1、try - except 语句格式

try：

    检测范围

except Exception [as reason]:
    
    出现异常后的处理代码

1-2、针对不同异常设置多个except

try:
   
    检测范围

except exception_1 as reason：

    出现异常后的处理代码_1

except exception_2 as reason：

    出现异常后的处理代码_2

except exception_3 as reason：

    出现异常后的处理代码_3

1-3、针对不同的异常统一处理

try:
     
    检测范围

except (exception_1,exception_2):

    出现异常后的处理代码

URLError

来自urllib库的error模块，它继承了OSError类（操作系统产生的异常），是error异常模块的基类，有request模块产生的任何异常都可以通过捕获这个类来处理。

它具有一个reason属性，即返回错误的原因。


from urllib import request,error
try:
  response=request.urlopen('http://dalaogan.com/123')
except error.HTTPError as e:
  print(e.reason)
except error.URLError as e:
  print(e.reason)
else:
  print(response.read())

运行：
Not Found

HTTPError

为URLError的子类，它具有三个属性：

code:返回HTTP状态码
reason：同父类一样返回错误原因
headers：返回请求头

因为URLError为HTTPError的父类，所以可以先捕获子类的错误，再去捕获父类的错误。

reason属性返回的不一定是字符串，也可能是对象。

import socket
import urllib.request
import urllib.error
try:
  response = urllib.request.urlopen('http://www.dalaogan.com',timeout=0.01)
except urllib.error.URLError as e:
  print(type(e.reason))
  if isinstance(e.reason,socket.timeout):
    print('TIME OUT')



运行：
<class 'socket.timeout'>
TIME OUT

5、URL解析

【1】urlparse() 实现url的识别和分段

from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/')
print(type(result),result)

运行：
<class 'urllib.parse.ParseResult'>
 ParseResult(scheme='https', netloc='www.baidu.com', path='/', params='', query='', fragment='')


标准格式：scheme//:netloc/path;params?query#fragment

scheme-协议
netloc-域名
path-访问路径
params-参数
query-查询条件


ParseResult类型实际是元祖，可以利用属性名获取，也可以利用索引获取

from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)
print(result.scheme,result[0],result.netloc,result[1],sep='\n')

运行：
ttp
http
www.baidu.com
www.baidu.com

除了以上标准配置外urlparse还有另外三个API：

urllib.parse.urlparse(urlstring,sheme=' ',allow_fragment=True)

urlstring :必填项，待解析的URL

scheme ：它是默认协议，加入前面url没有协议，则将这个作为默认的协议。如果url中带有协议信息，则会解析出url中的scheme，此时scheme参数无效

from urllib.parse import urlparse
result = urlparse('www.baidu.com',scheme='https')
print(result)

运行：
 ParseResult(scheme='https', netloc='', path='www.baidu.com', params='', query='', fragment='')


from urllib.parse import urlparse
result = urlparse('http://www.baidu.com',scheme='https')
print(result)

运行：
ParseResult(scheme='http', netloc='www.baidu.com', path='', params='', query='', fragment='')

allow_fragment : 是否忽略fragment。如果被设置为false则被忽略，它会被解析成path、paramters或者query的一部分，而fragment部分则为空

【2】urlunparse()

实现URL的构造

它接受的参数是可迭代对象，但是长度必须为6，否则会抛出数量不足或则过多的异常

from urllib.parse import urlunparse
data=['http','www.baidu.com','index','user','a=8','comment']
print(urlunparse(data))

运行：
http://www.baidu.com/index;user?a=8#comment

【3】urlsplit()

与urlparse相似不过不再单独解析params这一部分，只返回5个部分。

返回的结果是SplitResult类型，也为元祖类型，同样可以利用属性或者索引来获取。

【4】urlunsplit()

与urlunparse相似，也是将链接各个部分组成一个完整的链接，传入的参数也是可迭代对象。唯一区别是长度必须为5。

【5】urljoin()

利用此方法可轻松实现链接的解析、拼合、生成

urlunparse（）和urlunsplit（）方法都是合并成完整链接，但是前提是有特定长度的对象，链接的每一部分都要清晰分开。

urljoin（）方法不同于以上两种。可以提供一个base_url基础链接作为第一个参数，一个新的链接作为作为第二个参数，该方法会分析base_url的scheme、netloc和path这三个内容并对新链接缺失的部分进行补充，然后返回结果

base_url提供了scheme、netloc和path，如果新的的链接中不存在，就予以补充；如果新的链接存在，就使用新的链接的部分；

【6】urlencode()

用于构造GET请求参数时候非常有用（序列化）

from urllib.parse import urlencode
params={'name':'germey','age':22}
base_url='http://www.baidu.com'
new_url=base_url+urlencode(params)
print(new_url)

运行：
http://www.baidu.comname=germey&age=22

【7】parse_qs()

将GET请求参数返回字典格式（反序列化）

from urllib.parse import parse_qs
query='name=germy&age=22'
print(parse_qs(query))

运行：
{'name': ['germy'], 'age': ['22']}

【8】parse_qsl （）

将GET请求参数返回元组组成的列表格式

from urllib.parse import parse_qsl
query='name=germy&age=22'
print(parse_qsl(query))


运行：
[('name', 'germy'), ('age', '22')]

【9】quote()

将内容转化为URL的编码格式

当URL中带有中文时可能会导致乱码问题，利用此方法可以将里面额中文转化为URL的编码格式

from urllib.parse import quote
keyword='壁纸'
url='https://www.baidu.com/s?wd'+quote(keyword)
print(url)


运行：
https://www.baidu.com/s?wd%E5%A3%81%E7%BA%B8

【10】unquote ()

将URL内容进行解码

from urllib.parse import unquote
url = 'https://www.baidu.com/s?wd%E5%A3%81%E7%BA%B8'
print(unquote(url))

运行：
https://www.baidu.com/s?wd��ֽ

大妮子噻

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Urllib库

目录定义用法1、urlopen()HTTPResposne类型方法和属性POST类型请求定义内置的HTTP请求库，包含了四个模块：urllib request库（请求模块）、urllib error库（异常处理模块）、urllib parse库(url解析模块)、urllib robotpaeser库（robots text 解析模块，判断哪些网站可爬）...
复制链接

扫一扫