概念:
urlllib是python内置HTTP请求库,包含以下四个模块
request:发送请求
error:异常处理
parse:工具模块
robotparse:识别网站可爬性(少用)
发送请求
-
1.1 urlopen()
import urllib.request response = urllib.request.urlopen('https://www.python.org') print(response.read().decode('utf-8'))
通过图片我们发现,只用了两行代码,输出了网页的源代码。其中链接、图片的地址以及文本信息都在。
然后,我们利用type()的方法看看它返回的是什么import urllib.request response = urllib.request.urlopen('https://www.python.org') print(type(response)) 输出结果是: <class 'http.client.HTTPResponse'>
这是一个HTTPResponse类型的对象,包含很多方法,例如read() readinto() getheader(name) getheaders() fielno(),以及msg status reason debuglevel closed等属性,因此我们调用不同的属性,就可以输出不同的值了。
import urllib.request response = urllib.request.urlopen('https://www.python.org') print(type(response)) print(response.status) print(response.getheaders()) print(response.getheader('Strict-Transport-Security')) 输出如下: 200 [ ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48940'), ('Accept-Ranges', 'bytes'), ('Date', 'Tue, 15 Jan 2019 10:28:43 GMT'), ('Via', '1.1 varnish'), ('Age', '212'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2134-IAD, cache-tyo19929-TYO'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '1, 404'), ('X-Timer', 'S1547548123.453413,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains') ] max-age=63072000; includeSubDomains
上面就是urllib请求的实例,我们还可以使用参数传入给链接:
urllinb.requset.ulropen(url, date=none, [timeout,]*, cafile=None, cadefault = falase, context=None
-
data参数
我们传递了一个参数word值是hello,采用bytes的方法编写,使用urlencode()的方法将参数转化为字符串,格式是utf8。import urllib.request import urllib.parse url = 'http://httpbin.org/post' data = bytes(urllib.parse.urlencode({'word' : 'hello'}), encoding = 'utf8') response = urllib.request.urlopen(url, data=data) print(response.read()) 输出如下: { "args": {}, "data": "", "files": {}, "form": { "word": "hello }, "headers": { "Accept-Encoding": "identity", "Connection": "close", "Content-Length": "10", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Python-urllib/3.7"\n }, "json": null, "origin": "119.4.133.18", "url": "http://httpbin.org/post" }
-
timeout
设置时间的超时,等待页面响应的时间限制。
-
1.2 Request
了解用法import urllib.requset request = urllib.requset.Request('http://python.org') response = urllib.requset.urlopen(request) print(response.read())
依然用urlopen的方法,只是以Request的方式传递
参数构造的格式
urllinb.requset.ulropen(url, date=none, [timeout,]*, cafile=None, cadefault = falase, context=None
异常处理
- URLError
具有reason属性,即返回错误的原因
-
HTTPError
是URLError的子类,专门处理HTTP的请求错误
有三个属性:code(status状态码),reason,headers(返回请求头)
宗上,以实际的例子举例说明 (先获取子类错误,再获取父类)from urllib import request, error try: response = request.urlopen('http://www.baidu.com') except error.HTTPError as e: print(e.reason, e.code, e.headers) except error.URLError as e: print(e.reason) else: print('request successfully')
链接分析
-
urlparse()
可以识别和分段URLfrom urllib.parse import urlparse result = urlparse('http://www.baidu.com/index.html;user?id=6#comment') print(result) 输出: ParseResult ( scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=6', fragment='comment' )
一个URL的标准格式(6个部分)
sheme://netloc/path:parms?query#fragment
sheme:协议
netloc:域名
path:访问路径
parms:代表参数
query:查询条件
fragment:定位下拉位置
例如:http://www.baidu.com/index.html;user?id=6#comment
-
urlunparse
from urllib.parse import urlunparse data = ['http','www.baidu,com','index.html','user','id=6','comment'] result = urlunparse(data) print(result) 输出: http://www.baidu,com/index.html;user?id=6#comment
-
urlencode()
此方式在GET请求里面非常有用
例如:from urllib.parse import urlencode parmes ={ 'name':'germey', 'age':22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(parmes) print(url) 输出: http://www.baidu.com?name=germey&age=22
首先调用字典,在利用urlencode的方法转化为GET请求
- quote
将内容转化为URL编码。即将中文转化为URL
from urllib.parse import quote
keyword = '美食'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)
输出:
https://www.baidu.com/s?wd=%E7%BE%8E%E9%A3%9F
5.unquote
解码urlquote
上面输出的结果进行解码:
from urllib.parse import unquote
url = 'https://www.baidu.com/s?wd=%E7%BE%8E%E9%A3%9F'
print(unquote(url))
输出:
https://www.baidu.com/s?wd=美食
参考资料
教材《Python3网络爬虫开发实战》.崔庆才