一、Urllib库——Python内置的HTTP请求库
urllib.request 请求模块(模拟发送请求)
urllib.error 异常处理模块
urllib.parse url解析模块(提供许多url处理方法,如差分,合并)
urllib.robotparser robots.txt解析模块
二、用法演示
1.urlopen
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,cadefault=False,context=None)
前三个参数使用频率较高,后面几个参数基本不用。
(1)第一个参数 url
1 importurllib.request2
3 response = urllib.request.urlopen('http://www.baidu.com') #get类型的请求4 print(response.read().decode('utf-8'))
5 '''
6 也可以直接声明一个request对象
7 request = urllib.request.Request('http://www.baidu.com')
8 response = urllib.request.urlopen(request)
9 '''
打印输出为:
(2)第二个参数data
1 importurllib.parse2 importurllib.request3
4 data = bytes(urllib.parse.urlencode({'world': 'hello'}), encoding='utf8')5 response = urllib.request.urlopen('http://httpbin.org/post', data=data) #post类型的请求,有data参数时以post形式传递data6 print(response.read())
打印输出为:
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "world": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "11", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.7", \n "X-Amzn-Trace-Id": "Root=1-5e426274-a3f0622e47583590491e9caa"\n }, \n "json": null, \n "origin": "223.88.90.230", \n "url": "http://httpbin.org/post"\n}\n'
(3)第三个参数timeout
1 #代码1——不超时的演示
2 importurllib.request3
4 response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) #1秒,若1秒内得到结果则正常显示
5 print(response.read())
1 #代码2——超时演示
2 importsocket3 importurllib.request4 importurllib.error5
6 try:7 response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) #0.1秒,超时则执行except块8 excepturllib.error.URLError as e:9 ifisinstance(e.reason, socket.timeout): #若超时,输出TIME OUT10 print('TIME OUT')
代码2打印输出为:
TIME OUT
(4)关于响应
响应类型,状态码,响应头
1 importurllib.request2
3 response = urllib.request.urlopen('http://www.baidu.com')4 print(type(response)) #响应类型5 print(response.status) #状态码6 print(response.getheaders()) #响应头
(5)构造post请求——利用request对象加入headers及formdata
1 from urllib importrequest, parse2
3 url = 'http://httpbin.org/post'
4 headers ={5 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)',6 'Host': 'httpbin.org'
7 }8 dict ={9 'name': 'Germey'
10 }11 data = bytes(parse.urlencode(dict),encoding='utf8')12 req = request.Request(url=url, data=data, headers=headers, method='POST')13 '''
14 也可利用req.add_header()添加headers信息15 req = request.Request(url=url, data=data, method='POST')16 req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64)')17 '''
18 response =request.urlopen(req)19 print(response.read().decode('utf-8'))
2.Cookie
Cookie可以维持登录信息,用来爬取需要登录认证的网页。
1 importhttp.cookiejar,urllib.request2
3 cookie =http.cookiejar.CookieJar()4 handler =urllib.request.HTTPCookieProcessor(cookie) #利用handler处理cookie5 opener =urllib.request.build_opener(handler) #构造opener6 response = opener.open('http://www.baidu.com')7 for item incookie: #输出cookie中的信息8 print(item.name+'='+item.value)
输出结果为:
BAIDUID=DD72491D2B581558F3408ACDEC22350C:FG=1BIDUPSID=DD72491D2B581558D20BEA6F906A04C3
H_PS_PSSID=1461_21105
PSTM=1581496829delPer=0
BDSVRTM=0
BD_HOME=0
3.异常处理
当urlopen不能处理服务器的响应时,会抛出URLError异常。HTTPError的父类是URLError异常。
HTTPError比URLError的信息更详细,HTTPError可以详细的判断异常并返回对应的状态码。
1 #URLError例子
2 from urllib importrequest, error3 try:4 response = request.urlopen('http://cuiqingcai.com/index_html')5 excepterror.URLError as e:6 print(e.reason) #输出为Not Found
1 #HTTPError和URLError对比2 from urllib importrequest, error3
4 try:5 response = request.urlopen('http://cuiqingcai.com/index_html')6 excepterror.HTTPError as e:7 print(e.reason, e.code, e.headers, sep='\n')8 excepterror.URLError as e:9 print(e.reason)10 else:11 print('Request Successfully') #若没有捕获异常,输出Request Successfully
输出结果为:
Not Found404Server: nginx/1.10.3(Ubuntu)
Date: Wed,12 Feb 2020 09:03:53GMT
Content-Type: text/html; charset=UTF-8Transfer-Encoding: chunked
Connection: close
Set-Cookie: PHPSESSID=94j0c7n2t6l640b7hn6or3g680; path=/Pragma: no-cache
Vary: Cookie
Expires: Wed,11 Jan 1984 05:00:00GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link:; rel="https://api.w.org/"
4.URL解析
(1)urlparse——将url字符串拆分成其组件
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
1 #解析url2 from urllib.parse importurlparse3
4 result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')5 print(type(result), result, sep='\n')
输出结果:
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
(2)urlunparse——参数长度必须为6,构造url
1 from urllib.parse importurlunparse2
3 data = ['https', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']4 print(urlunparse(data)) #输出结果为:https://www.baidu.com/index.html;user?a=6#comment
(3)urlencode——把字典转换成请求参数
1 from urllib.parse importurlencode2
3 params ={4 'name': 'jack',5 'age': 20
6 }7 base_url = 'https://www.baidu.com?'
8 url = base_url +urlencode(params)9 print(url) #输出结果为:https://www.baidu.com?name=jack&age=20