urllib库万字详解

慎铭

已于 2022-02-17 09:23:27 修改

阅读量5.8k

点赞数 6

分类专栏： Python 文章标签： https python ssl

于 2022-02-10 12:10:55 首次发布

本文链接：https://blog.csdn.net/qq_52828510/article/details/122832346

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

什么是urllib库

urllib库是Python内置的HTTP请求库，不需要额外的下载，主要有一下四大模块

urllib.request  请求模块
urllib.error  异常处理模块
urllib.parse  url解析模块
urllib.robotparser  robots.txt解析模块

urllib.request

urllib.request.urlopen()

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

- url： url 地址。
- data： 发送到服务器的其他数据对象，要求为字节流形式传递参数，即bytes形式。默认为None，以GET方法传递数据，若不为None，则使用POST放啊传递数据。
- timeout： 设置访问超时时间，单位：秒(s)。
- cafile 和 capath： cafile 为 CA 证书， capath 为 CA 证书的路径，使用 HTTPS 需要用到。
- cadefault： 已经被弃用。
- context： ssl.SSLContext类型，用来指定 SSL 设置。

from urllib.request import urlopen

response = urlopen("https://www.baidu.com/")
print(response.read())			# 全部读取
print(response.read(20))		# 指定读取前20行
print(response.read().decode("utf-8"))		# 解码为utf-8编码
print(response.readline())		# 读取一行内容

lines = response.readlines()		# 读取全部内容，赋值给一个列表变量
for line in lines:
	print(line)

使用data参数

import urllib.parse
import urllib.request

data = {"Word":"Hello"}

data = urllib.parse.urlencode(data).encode('utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data = data)
html = response.readlines()

for line in html:
	print(line)

使用timeout参数

import urllib.request
import urllib.error
import socket

try:
	response = urllib.request.urlopen("http://httpbin.org", timeout = 0.1)
except urllib.error.URLError as e:
	if(isinstance(e.reason, socket.timeout):
		pirnt("TIME OUT!")

响应

状态码
我们在对网页进行抓取时，经常需要判断网页是否可以正常访问，这里我们就可以使用 getcode() 函数获取网页状态码，返回 200 说明网页正常，返回 404 说明网页不存在:

import urllib.request
import urllib.error

try:
	response = urllib.request.urlopen("http://www.baidu.com")
except urllib.error.HTTPError as e:
	if(e.code == 404)		
		print(response.getcode())		# 404

响应头

import urllib.request

response = urllib.request.urlopen("http://httpbin.org")
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader("Server"))

结果为

<class 'http.client.HTTPResponse'>
200
[('Date', 'Wed, 09 Feb 2022 04:20:20 GMT'), ('Content-Type', 'text/html; charset=utf-8'), ('Content-Length', '9593'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]
gunicorn/19.9.0

urllib.request.Request类

我们抓取网页一般需要对 headers（网页头信息）进行模拟，这时候需要使用到 urllib.request.Request 类：

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

- url：url 地址。
- data：发送到服务器的其他数据对象，默认为 None。
- headers：HTTP 请求的头部信息，字典格式。
- origin_req_host：请求的主机地址，IP 或域名。
- unverifiable：很少用整个参数，用于设置网页是否需要验证，默认是False。。
- method：请求方法， 如 GET、POST、DELETE、PUT等。

from urllib import request, parse

url = "http://httpbin.org/post"
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.9 Safari/537.36"
}

data = {"name":"Germer"}
data = parse.urlencode(data).encode("utf-8")

req = request.Request(url, data=data, headers=headers, method="POST")
req.add_header("Host", "httpbin.org")       # 添加请求标头
response = request.urlopen(req)
lines = response.readlines()

for line in lines:
    print(line.decode("utf-8"))

结果为

{

  "args": {}, 

  "data": "", 

  "files": {}, 

  "form": {

    "name": "Germer"

  }, 

  "headers": {

    "Accept-Encoding": "identity", 

    "Content-Length": "11", 

    "Content-Type": "application/x-www-form-urlencoded", 

    "Host": "httpbin.org", 

    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.9 Safari/537.36", 

    "X-Amzn-Trace-Id": "Root=1-620347ea-463c469d1cc6e37114f8f842"

  }, 

  "json": null, 

  "origin": "120.219.4.162", 

  "url": "http://httpbin.org/post"

}

Handler

代理
如果我们一直用同一个IP去请求同一个网站上的网页，久了之后可能会被该网站服务器屏蔽，因此我们可以使用代理IP来发起请求，代理实际上指的就是代理服务器。当我们使用代理IP发起请求时，服务器端显示的是代理IP的地址，即使被屏蔽了，我们可以换一个代理IP继续爬取。设置代理便是一种防止爬虫被反爬的措施。

使用代理

proxy_support = urllib.request.ProxyHandler({})

参数是一个字典，字典的键时代理的类型，例如 http,ftp或https,字典的值就是代理的IP地址和对应的端口号。这里代理前需要加上协议，即http或者https,当请求链接是http协议时，ProxyHandler会调用http代理，当请求链接是https协议时，会调用https代理。

import urllib.request
proxy_id = "58.240.53.196:8080"
proxy_headler = urllib.request.ProxyHeadler(
	{"http":"http://" + proxy_id,
	  "https":"https://" + proxy_id}
	  )
opener = urllib,request.build_opener(proxy_headler)
response = opener.open("http://www.baidu.com")
html = response.read().decode("utf-8")
print(html)

创建opener
opener 可以看做是一个私人订制，但是这个opener是可以定制的，例如，给它定制特殊的headers，或者给它定制指定的代理IP。这里可以使用build_opener()函数创建一个属于我们自己私人定制的opener。这样就相当于此opener已经设置好代理了，接下来可以直接调用opener对象的open() 方法，即可访问我们所想要的链接。

opener = urllib.request.build_opener(proxy_headler)

此处不能使用urlopen()函数打开网页，需要使用open()函数打开网页才可。

下面代码实例使用IP池，每一次访问随机选取IP代理，假设我们的代理IP全部记录在IP.txt文件内。

from urllib import request, error
import random
import socket

url = "http://ip.tool.chinaz.com"
proxy_iplist = []

with open("IP.txt", "w") as f:
	for line in f.readlines():
		ip = line.strip()
		proxy_iplist.append(ip)

while True:
	proxy_ip = random.choice(proxy_iplist)
	proxy_headler = request.ProxyHeadler(
		{
			"http":"http://" + proxy_ip,
			"https":"https://" + proxy_ip
		})
	opener = request.build_opener(proxy_headler)
	try:
		response = opener.open(url, timeout = 1)
		print(response.read().decode("utf-8"))
	except error.URLError as e1:
		if isinstance(e1.reason, socket.timeout):
			print("TIME OUT!")
	except error.HTTPError as e2:
		if response.status == 404:
			print("404 ERROE!")
	finally:
		flag = input("Y/N")
		if flag == 'N' or flag == 'n':
			break

遇到需要认证的代理

proxy = 'username:password@58.240.53.196:8080'

这里只需要改变proxy变量，只需要加入代理认证的用户名密码即可。

Cookie
我们调用http.cookiejar库的函数对日志进行操作。
CookieJar类有一些子类，分别是FileCookieJar，MozillaCookieJar，LWPCookieJar。

CookieJar： 管理HTTP cookie值、存储HTTP请求生成的cookie、向传出的HTTP请求添加cookie的对象。整个cookie都存储在内存中，对CookieJar实例进行垃圾回收后cookie也将丢失。
FileCookieJar (filename,delayload=None,policy=None)： 从CookieJar派生而来，用来创建FileCookieJar实例，检索cookie信息并将cookie存储到文件中。filename是存储cookie的文件名。delayload为True时支持延迟访问访问文件，即只有在需要时才读取文件或在文件中存储数据。
MozillaCookieJar (filename,delayload=None,policy=None)： 从FileCookieJar派生而来，创建与Mozilla浏览器 cookies.txt兼容的FileCookieJar实例。
LWPCookieJar (filename,delayload=None,policy=None)： 从FileCookieJar派生而来，创建与libwww-perl标准的 Set-Cookie3 文件格式兼容的FileCookieJar实例。

代码实例

# 此段代码演示获取cookie，保存到cookiejar对象中并打印
#===============================================================================================================================================

import urllib.request
import http.cookiejar

url = "http://www.baidu.com"

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

# 打印cookie
for item in cookie:
    print(item.name + "=" + item.value)

BAIDUID=069F91E0E5A0B7E85F7FDFE97194CA18:FG=1
BIDUPSID=069F91E0E5A0B7E87C083ED4D88287F6
H_PS_PSSID=35105_31660_34584_35490_35245_35796_35316_26350_35765_35746
PSTM=1644458154
BDSVRTM=0
BD_HOME=1

# 把获得的cookie保存到cookie.txt文件中
#===============================================================================================================================================
# 无load方法

import urllib.request
import http.cookiejar

url = "http://www.baidu.com"
filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
cookie.save()

# 把获得的cookie保存到cookie.txt文件中
#===============================================================================================================================================
# 有load方法
import urllib.request
import http.cookiejar

url = "http://www.baidu.com"
filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar()
cookie.load(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

cookie.txt文件：

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	1675995432	BAIDUID	139CA77B6F46CA597186A3F1F6FCF790:FG=1
.baidu.com	TRUE	/	FALSE	3791943079	BIDUPSID	139CA77B6F46CA5980C5AB053579F5CF
.baidu.com	TRUE	/	FALSE	3791943079	PSTM	1644459432

urllib.error

urllib.error 模块为 urllib.request 所引发的异常定义了异常类，基础异常类是 URLError。urllib.error 包含了两个方法，URLError 和 HTTPError。

URLError 是 OSError 的一个子类，用于处理程序在遇到问题时会引发此异常（或其派生的异常），包含的属性 reason 为引发异常的原因。

HTTPError 是 URLError 的一个子类，用于处理特殊 HTTP 错误例如作为认证请求的时候，包含的属性 code 为 HTTP 的状态码， reason 为引发异常的原因，headers 为导致 HTTPError 的特定 HTTP 请求的 HTTP 响应头。

对不存在的网页抓取并处理异常:

import urllib.request
import urllib.error

try:
	response = urllib.request.urlopen("http://www.baidu.com")
except urllib.error.HTTPError as e:
	if(e.code == 404)		
		print(response.getcode())		# 404

urllib.parse

urllib.parse 用于解析 URL，格式如下：

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
urlstring 为 字符串的 url 地址
scheme 为协议类型，
allow_fragments 参数为 false，则无法识别片段标识符。相反，它们被解析为路径，参数或查询组件的一部分，并 fragment 在返回值中设置为空字符串。

注意： 当urlstring里表明了协议类型时，scheme参数无效，协议以urlstring里注明的协议为准，若urlstring无协议，则以scheme协议为主

import urllib.parse

result1 =urllib.parse.urlparse("https://www.csdn.net/?spm=1001.2101.3001.4476")
result2 = urllib.parse.urlparse("www.csdn.net/?spm=1001.2101.3001.4476",scheme = "https")
result3 = urllib.parse.urlparse("https://www.csdn.net/?spm=1001.2101.3001.4476", scheme="http")

print(result1)
print(result2)
print(result3)

ParseResult(scheme='https', netloc='www.csdn.net', path='/', params='', query='spm=1001.2101.3001.4476', fragment='')
ParseResult(scheme='https', netloc='', path='www.csdn.net/', params='', query='spm=1001.2101.3001.4476', fragment='')
ParseResult(scheme='https', netloc='www.csdn.net', path='/', params='', query='spm=1001.2101.3001.4476', fragment='')

从结果可以看出，内容是一个元组，包含 6 个字符串：协议，位置，路径，参数，查询，判断。

我们也可以直接读取属性

from urllib.parse import urlparse

result = urlparse("https://www.runoob.com/?s=python+%E6%95%99%E7%A8%8B")
print(result.scheme)

https

属性	索引	值	值(如果不存在)
scheme	0	URL协议	scheme参数
netloc	1	网络位置部分	空字符串
path	2	分层路径	空字符串
paramg	3	最后路径元素的参数	空字符串
query	4	查询组件	空字符串
fragment	5	片段识别	空字符串
username		用户名	None
password		密码	None
hostname		主机名(小写)	None
port		端口号为整数(如果存在)	None

urlunparse
除此之外，我们还可以用urlunparse进行反拼接

from urllib.parse import urlunparse

data = ["http", "www.baidu.com", "index.html", "user", "a=6", "comment"]
print(urlunparse(data))

http://www.baidu.com/index.html;user?a=6#comment

urljoin

urljoin(base, url, allow_fragments=True)
base 基准母站
url 需要拼接成的绝对路径的url
allow_fragments 是否识别片段标识符

urljoin()将base和url拼接成一个网址，如果url是一个完整的网址，则以url为基准

from urllib import parse

url1 = parse.urljoin("https://www.baidu.com", "index.html")
url2 = parse.urljoin("https://www.baidu.com", "https://www.jianshu.com/p/20065f9b39bb")

print(url1)
print(url2)

https://www.baidu.com/index.html
https://www.jianshu.com/p/20065f9b39bb

urlencode
我们知道GET传递参数的时候用“&”符号间隔，然而Python中字典元素之间却用“，”间隔，我们可以用urlencode将字典转化为用“&”间隔的键值对进行传参

from urllib import parse

data = {
    "keyword":"Python",
    "id":"3252525",
    "page":"3"
}

base_url = "http://www.example.com"
url = base_url + parse.urlencode(data)
print(url)

http://www.example.comkeyword=Python&id=3252525&page=3

urllib.robotparser

urllib.robotparser 用于解析 robots.txt 文件。

robots.txt（统一小写）是一种存放于网站根目录下的 robots 协议，它通常用于告诉搜索引擎对网站的抓取规则。

urllib.robotparser 提供了 RobotFileParser 类，语法如下：

class urllib.robotparser.RobotFileParser(url='')

这个类提供了一些可以读取、解析 robots.txt 文件的方法：

set_url(url) - 设置 robots.txt 文件的 URL。
read() - 读取 robots.txt URL 并将其输入解析器。
parse(lines) - 解析行参数。
can_fetch(useragent, url) - 如果允许 useragent 按照被解析 robots.txt 文件中的规则来获取 url 则返回 True。
mtime() -返回最近一次获取 robots.txt 文件的时间。这适用于需要定期检查 robots.txt 文件更新情况的长时间运行的网页爬虫。
modified() - 将最近一次获取 robots.txt 文件的时间设置为当前时间。
crawl_delay(useragent) -为指定的 useragent 从 robots.txt 返回 Crawl-delay 形参。如果此形参不存在或不适用于指定的 useragent 或者此形参的 robots.txt 条目存在语法错误，则返回 None。
request_rate(useragent) -以 named tuple RequestRate(requests, seconds) 的形式从 robots.txt 返回 Request-rate 形参的内容。如果此形参不存在或不适用于指定的 useragent 或者此形参的 robots.txt 条目存在语法错误，则返回 None。
site_maps() - 以 list() 的形式从 robots.txt 返回 Sitemap 形参的内容。如果此形参不存在或者此形参的 robots.txt 条目存在语法错误，则返回 None。

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rrate = rp.request_rate("*")
>>> rrate.requests
3
>>> rrate.seconds
20
>>> rp.crawl_delay("*")
6
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True

慎铭

关注

6
点赞
踩
51

收藏

觉得还不错? 一键收藏
0
评论
urllib库万字详解

什么是urllib库 urllib库是Python内置的HTTP请求库，不需要额外的下载，主要有一下四大模块urllib.request 请求模块urllib.error 异常处理模块urllib.parse url解析模块urllib.robotparser robots.txt解析模块urllib.requesturllib.request.urlopen()urllib.request.urlopen(url, data=None, [timeout, ]*, cafile
复制链接

扫一扫