python3中urllib库_Python3中的Urllib库——小白第一周学习笔记（2）

最新推荐文章于 2021-02-09 16:03:57 发布

麟翛

最新推荐文章于 2021-02-09 16:03:57 发布

阅读量105

点赞数

文章标签： python3中urllib库

本文链接：https://blog.csdn.net/weixin_31317351/article/details/111928595

版权

一、Urllib库——Python内置的HTTP请求库

urllib.request 请求模块(模拟发送请求)

urllib.error 异常处理模块

urllib.parse url解析模块(提供许多url处理方法，如差分，合并)

urllib.robotparser robots.txt解析模块

二、用法演示

1.urlopen

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,cadefault=False,context=None)

前三个参数使用频率较高，后面几个参数基本不用。

(1)第一个参数 url

1 importurllib.request2

3 response = urllib.request.urlopen('http://www.baidu.com') #get类型的请求4 print(response.read().decode('utf-8'))

5 '''

6 也可以直接声明一个request对象

7 request = urllib.request.Request('http://www.baidu.com')

8 response = urllib.request.urlopen(request)

9 '''

打印输出为：

(2)第二个参数data

1 importurllib.parse2 importurllib.request3

4 data = bytes(urllib.parse.urlencode({'world': 'hello'}), encoding='utf8')5 response = urllib.request.urlopen('http://httpbin.org/post', data=data) #post类型的请求，有data参数时以post形式传递data6 print(response.read())

打印输出为：

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "world": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "11", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.7", \n "X-Amzn-Trace-Id": "Root=1-5e426274-a3f0622e47583590491e9caa"\n }, \n "json": null, \n "origin": "223.88.90.230", \n "url": "http://httpbin.org/post"\n}\n'

(3)第三个参数timeout

1 #代码1——不超时的演示

2 importurllib.request3

4 response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) #1秒，若1秒内得到结果则正常显示

5 print(response.read())

1 #代码2——超时演示

2 importsocket3 importurllib.request4 importurllib.error5

6 try:7 response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) #0.1秒，超时则执行except块8 excepturllib.error.URLError as e:9 ifisinstance(e.reason, socket.timeout): #若超时，输出TIME OUT10 print('TIME OUT')

代码2打印输出为：

TIME OUT

(4)关于响应

响应类型，状态码，响应头

1 importurllib.request2

3 response = urllib.request.urlopen('http://www.baidu.com')4 print(type(response)) #响应类型5 print(response.status) #状态码6 print(response.getheaders()) #响应头

(5)构造post请求——利用request对象加入headers及formdata

1 from urllib importrequest, parse2

3 url = 'http://httpbin.org/post'

4 headers ={5 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)',6 'Host': 'httpbin.org'

7 }8 dict ={9 'name': 'Germey'

10 }11 data = bytes(parse.urlencode(dict),encoding='utf8')12 req = request.Request(url=url, data=data, headers=headers, method='POST')13 '''

14 也可利用req.add_header()添加headers信息15 req = request.Request(url=url, data=data, method='POST')16 req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64)')17 '''

18 response =request.urlopen(req)19 print(response.read().decode('utf-8'))

2.Cookie

Cookie可以维持登录信息，用来爬取需要登录认证的网页。

1 importhttp.cookiejar,urllib.request2

3 cookie =http.cookiejar.CookieJar()4 handler =urllib.request.HTTPCookieProcessor(cookie) #利用handler处理cookie5 opener =urllib.request.build_opener(handler) #构造opener6 response = opener.open('http://www.baidu.com')7 for item incookie: #输出cookie中的信息8 print(item.name+'='+item.value)

输出结果为：

BAIDUID=DD72491D2B581558F3408ACDEC22350C:FG=1BIDUPSID=DD72491D2B581558D20BEA6F906A04C3

H_PS_PSSID=1461_21105

PSTM=1581496829delPer=0

BDSVRTM=0

BD_HOME=0

3.异常处理

当urlopen不能处理服务器的响应时，会抛出URLError异常。HTTPError的父类是URLError异常。

HTTPError比URLError的信息更详细，HTTPError可以详细的判断异常并返回对应的状态码。

1 #URLError例子

2 from urllib importrequest, error3 try:4 response = request.urlopen('http://cuiqingcai.com/index_html')5 excepterror.URLError as e:6 print(e.reason) #输出为Not Found

1 #HTTPError和URLError对比2 from urllib importrequest, error3

4 try:5 response = request.urlopen('http://cuiqingcai.com/index_html')6 excepterror.HTTPError as e:7 print(e.reason, e.code, e.headers, sep='\n')8 excepterror.URLError as e:9 print(e.reason)10 else:11 print('Request Successfully') #若没有捕获异常，输出Request Successfully

输出结果为：

Not Found404Server: nginx/1.10.3(Ubuntu)

Date: Wed,12 Feb 2020 09:03:53GMT

Content-Type: text/html; charset=UTF-8Transfer-Encoding: chunked

Connection: close

Set-Cookie: PHPSESSID=94j0c7n2t6l640b7hn6or3g680; path=/Pragma: no-cache

Vary: Cookie

Expires: Wed,11 Jan 1984 05:00:00GMT

Cache-Control: no-cache, must-revalidate, max-age=0

Link:; rel="https://api.w.org/"

4.URL解析

(1)urlparse——将url字符串拆分成其组件

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

1 #解析url2 from urllib.parse importurlparse3

4 result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')5 print(type(result), result, sep='\n')

输出结果：

ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

(2)urlunparse——参数长度必须为6，构造url

1 from urllib.parse importurlunparse2

3 data = ['https', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']4 print(urlunparse(data)) #输出结果为：https://www.baidu.com/index.html;user?a=6#comment

(3)urlencode——把字典转换成请求参数

1 from urllib.parse importurlencode2

3 params ={4 'name': 'jack',5 'age': 20

6 }7 base_url = 'https://www.baidu.com?'

8 url = base_url +urlencode(params)9 print(url) #输出结果为：https://www.baidu.com?name=jack&age=20

麟翛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3中urllib库_Python3中的Urllib库——小白第一周学习笔记（2）

一、Urllib库——Python内置的HTTP请求库urllib.request 请求模块(模拟发送请求)urllib.error 异常处理模块urllib.parse url解析模块(提供许多url处理方法，如差分，合并)urllib.robotparser robots.txt解析模块二、用法演示1.urlopenurllib...
复制链接

扫一扫