python内建库-urllib

最新推荐文章于 2024-03-25 13:55:43 发布

snistty

最新推荐文章于 2024-03-25 13:55:43 发布

阅读量626

点赞数

分类专栏： python爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/Intro21/article/details/84899856

版权

python爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

介绍

urllib是Python自带的标准库，无需安装，直接可以用。
提供了如下功能：

网页请求
响应获取
代理和cookie设置
异常处理
URL解析
爬虫所需要的功能，基本上在urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。

urllib，urllib2，request之间的关系
在Python2中，分为urllib库和urllib2库，urllib2库是对urllib库的升级。二者分管不同功能，关系复杂，且存在编码问题。在实际使用中准确区分应该使用哪个库是个令人头疼的问题。
python3中已经将urllib，urllib2合并为urllib库。
在python3中，urllib包括以下四个子模块：

urllib.request：请求模块
urllib.error：异常处理模块
urllib.parse：url解析模块
urllib.robotparser：robots.txt解析模块

urllib.request

The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

根据官方文档，request主要是用于打开HTTP的URL，支持HTTP、FTP、本地文件和数据的URL。
对于HTTPS的请求，不能简单地使用request进行连接，相关处理方法可以参考：
python3 爬虫https的坑 – 已解决
 python中requests和https使用简单示例

urllib.request.urlopen 打开URL

request.urlopen() 用于打开一个url，url可以是一个字符串，也可以是一个请求对象。

语法

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

解释

常用参数：
url：需要访问的地址
data：如果网站是通过GET请求，不填写；如果是通过POST请求，需要填写，包含一些字节流编码格式的内容，即 bytes 类型
timeout：超时时间，单位为秒。
其他参数：
cafile：指定CA证书，在请求HTTPS链接时需要
capath：指定CA证书的路径，在请求HTTPS链接时需要
context：它必须是 ssl.SSLContext 类型，用来指定 SSL 设置

request返回对象

read() , readline() ,readlines() , fileno() , close() ：对HTTPResponse类型数据进行操作。
info()：返回HTTPMessage对象，表示远程服务器返回的头信息。
getcode()：返回Http状态码。
geturl()：返回请求的url。

举例

# request:GET
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

# request: POST
# http测试：http://httpbin.org/
import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

# 超时设置
import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get',timeout=1)
print(response.read())

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('TIME OUT')

urllib.request.ProxyHandler 设置代理IP

需要用到以下两个func：
urllib.request.install_opener(opener) 封装代理IP或请求头
urllib.request.build_opener([handler, …]) 设置成全局代理

举例

import urllib.request

proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
a = urllib.request.urlopen("http://www.zhihu.com/").read().decode("utf8")
print(a)

urllib.request.Request

通过urllib.request.urlopen()可以发起一个简单的请求，但是在实际场景中，这几个简单的参数不足以构建一个完整的请求。
如果需要在请求中加入headers等信息，可以先利用urllib.request.Request()来构建一个Request对象，然后将其传递给urlopen()进行请求。

语法

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

解释

url：需要打开的url地址，需要是str格式
data：需要传入的额外信息

headers：添加请求头，模拟浏览器
同时，headers也可以使用add_headers()进行添加

req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5;Windows NT)')

origin_req_host：请求方的 host 名称或者 IP 地址
unverifiable：这个请求是否是无法验证的，默认是 False 。意思就是说用户没有足够权限来选择接收这个请求的结果。例如我们请求一个HTML文档中的图片，但是我们没有自动抓取图像的权限，这时 unverifiable 的值就是 True 。
method：请求使用的方法：POST/GET/PUT

举例

from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Host':'httpbin.org'
}
# 构造POST表格
dict = {
    'name':'Germey'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
response = request.urlopen(req)
print(response.read())

# 或者随后增加header
from urllib import request, parse
url = 'http://httpbin.org/post'
dict = {
    'name':'Germey'
}
req = request.Request(url=url,data=data,method='POST')
req.add_hader('User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

urllib.request.HTTPCookieProcessor 操作Cookies

cookies用于记录用户身份,维持登录信息。具体用法待研究

urllib.error

error模块主要用于捕获异常。
错误分为两种：URLError（错误信息）和HTTPError(错误编码)。
其中HTTPError是URLError的子类

URLError

URLError里只有一个属性：reason,即抓异常的时候只能打印错误信息

HTTPError

HTTPError里有三个属性：code,reason,headers，

code：错误代码，404/403
reason：错误信息
headers：headers信息

举例

from urllib import request,error
try:
    response = request.urlopen("http://pythonsite.com/1111.html")
except error.HTTPError as e:
    print(e.reason)
    # Not Found
    print(e.code)
    # 404
    print(e.headers)
    '''Date: Sat, 08 Dec 2018 13:58:56 GMT
	Server: Apache
	Vary: Accept-Encoding
	Content-Length: 207
	Connection: close
	Content-Type: text/html; charset=iso-8859-1
	'''
except error.URLError as e:
    print(e.reason)
    # Not Found
else:
    print("reqeust successfully")

urllib.parse

parse主要是一个工具模块，用于对url进行操作。

urllib.parse.urlparse 拆分url

语法：

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

说明：

parse.urlparse()将url拆分为六个部分：
scheme, netloc, path, params, query, fragment

Attribute	Index	Value	Value if not present
scheme	0	URL scheme specifier	scheme parameter
netloc	1	Network location part	empty string
path	2	Hierarchical path	empty string
query	3	Query component	empty string
fragment	4	Fragment identifier	empty string
username		User name	None
password		Password	None
hostname		Host name (lower case)	None
port		Port number as integer, if present	None

举例

>>> from urllib.parse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   # doctest: +NORMALIZE_WHITESPACE
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

urllib.parse.urlunparse 拼接url

parse.urlunparse()将url各个部分进行拼接，是urlparse的反向操作

举例

from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))
# http://www.baidu.com/index.html;user?a=6#comment

urllib.parse.urljoin 拼接两个url

语法

urllib.parse.urljoin(base, url, allow_fragments=True)

说明

Construct a full (“absolute”) URL by combining a “base URL” (base) with another URL (url). Informally, this uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL.

对前面的base_url和后面的url进行拼接，形成一个新的absolute url。
即使用后面的url对前面的url进行补齐，对于重复的部分，后面url的优先级高于前面的url。

举例

from urllib.parse import urljoin

print(urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html'))
# 'http://www.cwi.nl/%7Eguido/FAQ.html'
print(urljoin('http://www.cwi.nl/%7Eguido/Python.html','//www.python.org/%7Eguido'))
# 'http://www.python.org/%7Eguido'

urllib.parse.urlencode 将字典转化为url参数

举例

from urllib.parse import urlencode

params = {
    "name":"zhaofan",
    "age":23,
}
base_url = "http://www.baidu.com?"

url = base_url+urlencode(params)
print(url)
# http://www.baidu.com?name=zhaofan&age=23

参考资料：
基础篇-Python的urllib库
 python3网络爬虫一《使用urllib.request发送请求》
python爬虫从入门到放弃（三）之 Urllib库的基本使用
 python 3.x 爬虫基础—Urllib详解
 urllib.request — Extensible library for opening URLs¶

snistty

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python内建库-urllib

介绍urllib是Python自带的标准库，无需安装，直接可以用。提供了如下功能：网页请求响应获取代理和cookie设置异常处理URL解析爬虫所需要的功能，基本上在urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。urllib，urllib2，request之间的关系在Python2中，分为urllib库和urllib2库，...
复制链接

扫一扫