python3-urllib库详解

最新推荐文章于 2024-04-21 13:13:01 发布

YOUNGBC

最新推荐文章于 2024-04-21 13:13:01 发布

阅读量555

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/qq_43645530/article/details/104088069

版权

python爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

python3-urllib库详解

1.Urllib简介

1.1什么是Urllib

urllib 是 Python 标准库中用于网络请求的库。
该库有四个模块，分别是：

urllib.request    &emsp; 请求模块
urllib.error  &emsp;  异常处理模块
urllib.parse  &emsp;  解析模块
urllib.robotparser &emsp;  robots.txt解析模块

1.2相比Python2变化

Python2

import urllib2 response = urllib2.urlopen('http://www.baidu.com')**

python3

import urllib.request response = urllib.request.urlopen('http://www.baidu.com')**

2.发送请求

2.1 urlopen()

urllib. request 模块提供了最基本的构造 HTTP 请求的方法，利用它可以模拟浏览器的一个请求发起过程，同时它还带有处理授权验证（ authenticaton ）、重定向（ redirection）、浏览器 Cookies 以及其他内容。
以 Python 官网为例，我们来把这个网页抓下来：

import urllib.request
response= urllib.request.urlopen( 'https://www.python.org')
print(response.read().decode('utf-8'))**

2.2 timeout参数-设置请求超时

有些请求可能因为网络原因无法得到响应。因此，我们可以手动设置超时时间。当请求超时，我们可以采取进一步措施，例如选择直接丢弃该请求或者再请求一次。

import urllib.request
url = "http://tieba.baidu.com"
response = urllib.request.urlopen(url, timeout=1)
print(response.read().decode('utf-8'))

2.3 data参数

data 参数是可选的，如果传递了这个参数，则它的请求方式就不再是 GET方式，而是 POST 方式。

import urllib.parse import urllib.request 
data = bytes(urllib.parse.urlencode({'word’:’hello'}), encoding＝’ utf-8') 
response= urllib.request.urlopen('http://httpbin.org/post’, data=data) 
print(response.read())

这里我们传递了一个参数 word ，值是 helloo 它需要被转码成 bytes （字节流）类型。其中转字节流采用了 bytes（）方法，该方法的第一个参数需要是 str （字符串）类型，需要用 urllib.parse 模块里的 urlencode （）方法来将参数字典转化为字符串；第二个参数指定编码格式，这里指定为 utf8。

3.高级用法

3.1 验证

有些网站在打开时就会弹出提示框，直接提示你输入用户名和密码，验证成功后才能查看页面，借助 HTTPBasicAuthHandler 就可以完成

>from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
 from urllib.error import URLError 

username = username 
password =’password ’ 
url = ’ http: //localhost:sooo/' 

p = HTTPPasswordMgrWi thDefaultRealm() 
p.add_password(None, url, username, password) 
auth_handler = HTTPBasicAuthHandler(p)
 opener = build_opener(auth_handler) 

try:
	result = opener.open(url)   
	html = result. read(). decode(’ utf 8’) 
	print(html) 
except URLError as e: 
	print(e.reason)

3.2代理

在做爬虫的时候，免不了要使用代理，如果要添加代理，可以这样做：

from urllib.error import URLError from urllib.request 
import ProxyHandler, build opener 

proxy _handler = ProxyHandler({ 
	’http':’http://127.o.o.1:9743 ’, 
	’https’:’https://127.0.0.1:9743 ’ }) 
opener = build_opener(proxy_handler) 
try:
	response = opener.open(’https://www.baidu.com') 
	print(response.read() .decode(’ utf-8')) 
except URLError as e:
	print(e.reason)

这里我们在本地搭建了一个代理，它运行在 9743 端口上。这里使用了 ProxyHandler，其参数是一个字典，键名是协议类型（比如 HTTP 或者 HTTPS 等），键值是代理链接，可以添加多个代理，然后，利用这个 Handler及 build_opener（）方法构造一个 Opener，之后发送请求即可。

3.3 Cookies

Cookies 的处理就需要相关的 Handler 。

import http.cookiejar, urllib.request 

cookie = http. cookie jar. CookieJar() 
handler = urllib.request.HTTPCookieProcessor (cookie) 
opener = urllib.request.build opener(handler ) 
response = opener. open (’http://www.baidu.com') 
for item in cookie: 
	print(item.name+”= ”+item.value)

首先，我们必须声明一个 CookieJar 对象。接下来，就需要利用 HTTPCookieProcessor 来构建一个 Handler，最后利用 build_opener（）方法构建出 Opener，执行 open（）函数即可

4.异常处理

4.1 URLError

URL Error 类来自 urllib 库的 error 模块，它继承自 OSError 类，是 error 异常模块的基类，由 request 模块生的异常都可以通过捕获这个类来处理。
它具有一个属性 reason ，即返回错误的原因。
下面用一个实例来看一下：

from urllib import request, error 
try:
	response = request. urlopen （ https://cuiqingcai.com/index.htm') 
 except error. URL Error as e:
	print(e.reason)

4.3 HTIPError

它是 URLError 的子类，专门用来处理 HTTP 请求错误，比如认证请求失败等。它有如下 3 个属性。

code：返回 HTTP 状态码，比如 404 表示网页不存在， 500 表示服务器内部错误等。
reason：同父类一样，用于返回错误的原因。
headers：返回请求头

from urllib import request,error
try: 
	response = request. urlopen(' https: / /cuiqingcai. com/index. htm ’) 
except error. HTTP Error as e: 
	print(e.reason, e.code, e.headers, seq='\n’)

5.URL解析

请阅读官方文档

5.1. urlparse()

该方法可以实现 URL 的识别和分段

from urllib.parse import urlparse 
result ＝ urlparse（’ http://www.baidu . com/index .html;user?id=5#comment ’）
print(type(result), result)

这里我们利用 urlparse（）方法进行了一个 URL 的解析。首先，输出了解析结果的类型，然后将结果也输出出来。
运行结果如下：

<class ’ urllib.parse.ParseResult’> 
ParseResult(scheme=’ http’, netloc= www. baidu. com ', path=' /index. html ’, params='user', query='id=S',
&ensp;&ensp;&ensp;&ensp; fragment='comment ' )

可以看到，返回结果是一个 ParseResult 类型的对象，它包含 6个部分，分别是 scheme、 netloc、 path、 params 、 query 和 fragment。

5.2. urlunparse()

有了 urlparse（），相应地就有了它的对立方法 urlunparse（）。它接受的参数是一个可迭代对象，但是它的长度必须是 6，否则会抛出参数数量不足或者过多的问题。先用一个实例看一下：

from urllib.parse import urlunparse 

data =[’http'' ＇www.baidu . com', 'index.html’, ’user’, ’a=6 ', 'comment ' ) 
print(urlunparse(data))

运行结果如下：

http://www.baidu.com/index.html;user?a=6#comment

这样我们就成功实现了 URL 的构造

5.3 urlsplit()

这个方法和 urlparse（）方法非常相似，只不过它不再单独解析 params 这一部分，只运回 5 个结果。上面例子中的 params 会合并到 path 中。

from urllib.parse import urlsplit 
result = urlsplit(’ http://www.baidu .com/index.html;user?id=S#comment’)
print(result)

运行结果如下：

SplitResult(scheme=’ http', netloc＝’www.baidu.com', path=' /index.html;user', query’ id=5 ’，
&ensp;&ensp;&ensp;&ensp; fragment= ’comment ’)

5.4 urlunsplit()

与 urlunparse（）类似，它也是将链接各个部分组合成完整链接的方法，传人的参数也是一个可迭代对象，例如列表、元组等，唯一的区别是长度必须为 5。

from urllib.parse import urlunsplit 
data ＝［’ http’，'www.baidu.com’,' index.html', ’ a=6',’ comment ’］
print(urlunsplit(data))

运行结果如下：

http://www.baidu.com/index.html?a=6#conment

剩余请阅读官方文档 https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse

YOUNGBC

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3-urllib库详解

python3-urllib库详解1.Urllib简介1.1什么是Urlliburllib 是 Python 标准库中用于网络请求的库。该库有四个模块，分别是：urllib.request 请求模块urllib.error 异常处理模块urllib.parse 解析模块urllib.robotparser robots.txt解析模块
复制链接

扫一扫