python——urllib

最新推荐文章于 2021-03-06 21:30:48 发布

垃圾桶随意收

最新推荐文章于 2021-03-06 21:30:48 发布

阅读量99

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/han_qing1213/article/details/105740997

版权

python 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

urllib

URL处理模块，提供了一系列操作URL的功能。
urllib.request：打开和读取 URL，最基本的请求处理模块，
urllib.error ：包含 urllib.request 抛出的异常，异常处理模块，如果请求出现异常，可以对其进行捕获，保证其它请求的正常进行。
urllib.parse ：用于解析 URL，提供了许多 URL 处理方法，比如拆分、解析、合并等等的方法。
urllib.robotparser ：用于解析 robots.txt 文件（少用）

一、发送请求request

urllib.request 模块提供了最基本的构造 HTTP 请求的方法，利用它可以模拟浏览器的一个请求发起过程，同时它还带有处理authenticaton（授权验证），redirections（重定向)，cookies（浏览器Cookies）以及其它内容。

1.request.urlopen()

实现方法：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

参数说明：
url：打开统一资源定位地址 url，可以是一个字符串或一个 Request 对象。
data：data内容需为字节流编码格式的内容，即 bytes 类型，通过 bytes() 方法可以进行转化；如果传递了这个 data 参数，它的请求方式就不再是 GET 方式请求，而是 POST请求。

data = bytes(parse.urlencode({"name":"han","sex":"女"}), encoding='utf-8')
#data为：b'name=han&sex=%E5%A5%B3'

timeout：timeout 参数可以设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间还没有得到响应，就会抛出异常，如果不指定，就会使用全局默认时间。它支持 HTTP、HTTPS、FTP 请求。可以使用try…except捕获长时间未响应抛出的异常URLError。
context ：它必须是 ssl.SSLContext 类型，用来指定 SSL 设置。
cafile 和 capath 两个参数是指定 CA 证书和它的路径，这个在请求 HTTPS 链接时会有用。

代码示例：

def urllib_request():
    response = request.urlopen('http://www.baidu.com')
    print(type(response))   #<class 'http.client.HTTPResponse'>
    #print(response.read().decode('utf-8'))
    print(response.status)  #200
    print(response.reason)  #OK
    print(response.code)    #200
    print(response.getheader('server')) #BWS/1.1
    print(response.msg) #OK
    print(response.version) #11
    print(response.closed)  #False
    print(response.debuglevel)  #0

urlopen方法返回的是一个HttpResponse对象，它主要包含的方法有 read()、readinto()、getheader(name)、getheaders()、fileno() 等方法和 msg、version、status、reason、debuglevel、closed 等属性。其中：

read():读取网页内容
getheader(Name):获取Headers中指定的值
getheaders():获取响应头信息
status：请求的完成状态，一般返回状态码
code：状态码
reason：服务器返回的原因短语

2.Request对象

构造方法：

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

第一个 url 参数是请求 URL，这个是必传参数，其他的都是可选参数。
第二个 data 参数如果要传必须传 bytes（字节流）类型的，如果是一个字典，可以先用 urllib.parse 模块里的 urlencode() 编码。
第三个 headers 参数是一个字典，这个就是 Request Headers 了，你可以在构造 Request 时通过 headers 参数直接构造，也可以通过调用 Request 实例的 add_header() 方法来添加, Request Headers 最常用的用法就是通过修改 User-Agent 来伪装浏览器，默认的 User-Agent 是 Python-urllib，我们可以通过修改它来伪装浏览器。
第四个 origin_req_host 参数指的是请求方的 host 名称或者 IP 地址。
第五个 unverifiable 参数指的是这个请求是否是无法验证的，默认是False。意思就是说用户没有足够权限来选择接收这个请求的结果。例如我们请求一个 HTML 文档中的图片，但是我们没有自动抓取图像的权限，这时 unverifiable 的值就是 True。
第六个 method 参数是一个字符串，它用来指示请求使用的方法，比如GET，POST，PUT等等。

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

3.parse

parse.urlunparse():根据传入的数据生成一个url

data = ['https','baidu','com','index.html','user','sex=nv']
print(parse.urlunparse(data))   #https://baidu/com;index.html?user#sex=nv

parse.urlparse(url):解析url组成

result = parse.urlparse('http://www.baidu.com/index.html;user?id=1#comment',allow_fragments=False)
print(result)   #ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=1#comment', fragment='')

parse.urljoin():拼接url

print(parse.urljoin('http://www.bai.com', 'index.html'))    #http://www.bai.com/index.html
print(parse.urljoin('http://www.baicu.com', 'https://www.thanlon.cn/index.html'))   #以后面为基准,即https://www.thanlon.cn/index.html

parse.urlencode():将字典对象转换成get请求的参数

from urllib.parse import urlencode

params = {
    'name': 'Thanlon',
    'age': 22
}
baseUrl = 'http://www.thanlon.cn?'
url = baseUrl + urlencode(params)
print(url)

4.cookie

获取cookie：

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
res = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name + '=' + item.value)

输出为：

BAIDUID=051D781715AD99E2B1AE02CA9FF34C8F:FG=1
BIDUPSID=051D781715AD99E29399C176404993B4
H_PS_PSSID=31355_30963_1420_21124_31422_31341_31228_30824_26350_31164
PSTM=1587797201
BDSVRTM=0
BD_HOME=1

cookie的保存

MozillaCookieJar(filename)形式保存cookie

import http.cookiejar, urllib.request

filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
res = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

LWPCookieJar(filename)形式保存cookie：

# coding:utf8
import http.cookiejar, urllib.request

filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
res = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

cookie的读取

#读取cookie请求，获取登陆后的信息
# coding:utf8
import http.cookiejar, urllib.request

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
resp = opener.open('http://www.baidu.com')
print(resp.read().decode('utf-8'))

二、异常处理模块

通常有URLError和HTTPError。
HTTPError,它有三个属性。
code，返回 HTTP Status Code，即状态码，比如 404 网页不存在，500 服务器内部错误等等。
reason，同父类一样，返回错误的原因。
headers，返回 Request Headers。
因为 URLError 是 HTTPError 的父类，所以我们可以先选择捕获子类的错误，再去捕获父类的错误

垃圾桶随意收

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python——urllib

urllibURL处理模块，提供了一系列操作URL的功能。urllib.request：打开和读取 URL，最基本的请求处理模块，urllib.error ：包含 urllib.request 抛出的异常，异常处理模块，如果请求出现异常，可以对其进行捕获，保证其它请求的正常进行。urllib.parse ：用于解析 URL，提供了许多 URL 处理方法，比如拆分、解析、合并等等的方法。...
复制链接

扫一扫