Python_urllib

最新推荐文章于 2024-04-23 16:42:17 发布

苦涩2020

最新推荐文章于 2024-04-23 16:42:17 发布

阅读量722

点赞数

分类专栏： Python 文章标签： urllib urllib.request urllib.parse urllib.error urllib.robotoparse

本文链接：https://blog.csdn.net/UserPython/article/details/83188963

版权

Python 专栏收录该内容

42 篇文章 1 订阅

订阅专栏

文章目录

简介

urllib是 Python 标准库中用于网络请求的库，集合了四个模块：

urllib.request 用于发送request和获取request的结果
urllib.error 包含了urllib.request产生的异常
urllib.parse 用于解析和处理URL
urllib.robotoparse 用于解析页面的robots.txt文件

urllib.request

urllib.request.urlopen()

urllib.request.urlopen（url，data = None，[ timeout，] *，cafile = None，capath = None，cadefault = False，context = None)

功能：打开网址URL
参数：
url：目标资源网站在网络中的位置。可以是网址字符串或Request对象
data：提交POST数据，指定要发送到服务器的数据对象，参数默认为None；如果要添加data，需要以bytes字节流编码格式的内容，即bytes类型，可以通过bytes()方法进行转换，或者在其后面使用encode(“utf-8”)进行转换
timeout：可选参数，设置网站的访问超时时间，以秒为单位；意思就是如果请求超出了设置的这个时间还没有得到响应，就会抛出异常，如果不指定，将使用全局默认超时设置）

Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
>>>import urllib.request
>>>request = urllib.request.urlopen("http://www.baidu.com")
>>>print(request)
<http.client.HTTPResponse object at 0x000001FA7EE26128>

urlopen()函数返回的是 http.client.HTTPResponse 对象，提供的方法：
read()、readline()、readlines()、close(): 对HTTPResponse类型数据进行操作
info()：返回HTTPMessage对象，表示远程服务器返回的头信息
getcode()：返回Http状态码。如果是http请求，200请求成功完成，404网址未找到
geturl()：返回请求的url

一般使用read()方法后，还需要加decode()方法进行解码，因为返回的网页内容实际上是没有被解码的，数据格式为bytes类型，可以在read()方法后面加上decode()方法，指定网页对应的解码格式进行解码

data实战应用

import urllib.request
import urllib.parse

url = "http://home.51cto.com/index?reback=http://www.51cto.com/"
#post数据
postdata = {"username":"hhhh", "password":"fdfsfd"}
#post的数据必须为bytes类型，所以可以用bytes()方法转换
#data = bytes(urllib.parse.urlencode(postdata),encoding="utf-8")
#使用encode()进行转换
data = urllib.parse.urlencode(postdata).encode("utf-8")
#发起请求
req = urllib.request.Request(url, data)
#接受响应
response = urllib.request.urlopen(req)
#使用utf-8解码，否则将打印bytes字符串
print(response.read().decode("utf-8"))

urllib.request.Request()

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

参数：
url：请求链接
data：跟urlopen()中的data参数用法相同
header：指定发起的HTTP请求的头部信息。header是一个字典。它除了在Request中添加，还可以通过调用Request实例的add_header()方法来添加请求头
origin_req_host：参数指的是请求方的host名称或者IP地址
unverifiable：表示这个请求是否是无法验证的，默认值是False。意思就是说用户没有足够权限来选择接收这个请求的结果。例如我们请求一个HTML文档中的图片，但是我们没有自动抓取图像的权限，我们就要将 unverifiable 的值设置成 True
method：指的是发起的 HTTP 请求的方式，有 GET、POST、DELETE、PUT等

headers头信息实战应用

import urllib.request
import urllib.parse

url = "http://home.51cto.com/index?reback=http://www.51cto.com/"
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
# 告诉服务器我是从哪个页面链接过来的
referer = "https://www.google.com.hk/"
#post数据
postdata = {"username":"hhhh", "password":"fdfsfd"}
# 将user_agent,referer写入头信息
headers = {"User-Agent" : user_agent, "Referer" : referer}
#post的数据必须为bytes类型，所以要用bytes()方法转换
data = bytes(urllib.parse.urlencode(postdata),encoding="utf-8")
#添加头信息，发起请求
req = urllib.request.Request(url, data, headers)
#接受响应
response = urllib.request.urlopen(req)
#使用utf-8解码，否则将打印bytes字符串
print(response.read().decode("utf-8"))

还可以使用add_header来添加请求头信息

import urllib.request
import urllib.parse

url = "http://home.51cto.com/index?reback=http://www.51cto.com/"
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
# 告诉服务器我是从哪个页面链接过来的
referer = "https://www.google.com.hk/" #一般为要打开的网页地址
#post数据
postdata = {"username":"hhhh", "password":"fdfsfd"}
# 将user_agent,referer写入头信息
headers = {"User-Agent" : user_agent, "Referer" : referer}
#post的数据必须为bytes类型，所以要用bytes()方法转换
data = bytes(urllib.parse.urlencode(postdata),encoding="utf-8")

#添加头信息
req = urllib.request.Request(url, data)
req.add_header("User-Agent", user_agent)
req.add_header("Referer", referer)
#接受响应
response = urllib.request.urlopen(req)
#使用utf-8解码，否则将打印bytes字符串
print(response.read().decode("utf-8"))

对有些header要特别留意，服务器会针对这些header做检查，例如
User-Agent：有些服务器或Proxy会通过该值来判断是否是浏览器发出的请求
Content-Type：在使用REST接口时，服务器会检查该值，用来确定HTTP Body中的内容该怎样解析。在使用服务器提供的REST-ful或SOAP服务时，Content-Type设置错误会导致服务器拒绝服务。常见的取值有：application/xml (XML RPC，如RESTful/SOAP 调用时使用)、application/json (在JSON RPC调用时使用)、application/x-www-from-urlencode(服务器提交Web表单时使用)
Referer：服务器有时候会检查防盗链

Handle处理器和自定义Opener

opener是 urllib.request.OpenerDirector 的实例，我们之前一直都在使用的urlopen，它是一个特殊的opener（也就是模块帮我们构建好的）。我们采用urlopen()的方式去请求，其实是有些局限性的，比如我们需要打开debug模式，或通过代理模式去请求或带有cookie去请求，就不行了。如果要实现debug模式或代理请求的话，我们需要自己定义Handler和opener。

这时可以使用相关的 Handler处理器来创建特定功能的处理器对象；
然后通过 urllib.request.build_opener() 方法使用这些处理器对象，创建自定义opener对象；
使用自定义的opener对象，调用open()方法发送请求；与urlopen()函数的功能相同

如果程序里所有的请求都使用自定义的opener，可以使用urllib.request.install_opener() 将自定义的 opener 对象定义为全局opener，表示如果之后凡是调用urlopen，都将使用这个opener（根据自己的需求来选择

在urllib库中，给我们提供了一些 Handler处理器，如：HTTPHandler，HTTPSHandler，ProxyHandler，HTTPCookieProcessor、BaseHandler，AbstractHTTPHandler，FileHandler，FTPHandler，分别用于处理HTTP，HTTPS，Proxy代理、Cookie处理等

需求一：定义一个HTTPHandler处理器的opener对象，向网站发送请求及获取响应

import urllib.request

# 第一步：构建一个HTTPHandler处理器对象，支持处理HTTP请求
http_handler = urllib.request.HTTPHandler()

# 第二步：调用urllib.build_opener()方法，创建支持处理HTTP请求的opener对象
opener = urllib.request.build_opener(http_handler)

# 第三步：构建Request请求
request = urllib.request.Request("http://www.baidu.com")

# 第四步：调用自定义opener对象的open()方法，发送request请求
response = opener.open(request)

# 第五步：获取服务器响应内容
print(response.read().decode("utf-8"))

这种方式发送请求得到的结果，和使用urllib.request.urlopen()发送HTTP请求得到的结果是一样的，只不过这是我们自己自定义的而已。代码如下

response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))

需求二：定义一个HTTPCookieProcessor处理器的opener对象，将请求的网页的cookie打印出来

import urllib.request
import http.cookiejar #该模块用来获取网页的cookie

# 第一步：构建一个CookieJar对象实例来保存cookie
cookiejar = http.cookiejar.CookieJar()

# 第二步：使用HTTPCookieProcessor()来创建cookie处理器对象，参数为cookieJar()对象
cookie_handler = urllib.request.HTTPCookieProcessor(cookiejar)

# 第三步：调用urllib.build_opener()方法，创建支持处理cookie请求的opener对象
opener = urllib.request.build_opener(cookie_handler)

# 第四步：构建Request请求
request = urllib.request.Request("http://www.baidu.com")

# 第五步：调用自定义opener对象的open()方法，发送request请求，访问网页的cookie自动保存到cookiejar中
response = opener.open(request)

# 第六步：获取服务器响应内容
print(response.read().decode("utf-8"))


# 第七步：将保存的cookie打印出来
for item in cookiejar:
    print(item.name + "=" + item.value)

Cookie处理

Proxy代理

urllib.parse

urllib.parse.urlparse()

urllib.parse.urlparse(urlstring, scheme=’ ', allow_fragments=True)

将URL解析为六个组件，返回一个含6个元素的元组，对应于URL的一般结构：scheme://netloc/path;parameters?query#fragment，包含六个部分，每个元组项都是一个字符串，可能是空的，这六个部分均不能再被分割成更小的部分；以下为返回的元组元素：

元素	值	值不存在时默认值
scheme	协议类型	一定存在
netloc	网址	空字符串
path	分层路径	空字符串
params	最后一个路径元素的参数	空字符串
query	查询组件	空字符串
fragment	片段标识符	空字符串

示例如下：

import urllib.parse

print(urllib.parse.urlparse("https://blog.csdn.net/UserPython/article/details/83214161"))

####################
ParseResult(scheme='https', netloc='blog.csdn.net', path='/UserPython/article/details/83214161', params='', query='', fragment='')

urllib.parse.parse_qs()

urllib.parse.parse_qs（qs，keep_blank_values = False，strict_parsing = False，encoding =‘utf-8’，errors =‘replace’ )

这个函数主要用于分析URL中query组件的参数，返回一个key-value对应的字典格式

import urllib.parse

print(urllib.parse.parse_qs("username=UserPython&password=passwd"))

####################
{'username': ['UserPython'], 'password': ['passwd']}

urllib.parse.parse_qsl()

urllib.parse.parse_qsl（qs，keep_blank_values = False，strict_parsing = False，encoding =‘utf-8’，errors =‘replace’ ）

这个函数和urllib.parse.parse_qs()功能，唯一的区别就是这个函数返回值是list格式

import urllib.parse

print(urllib.parse.parse_qsl("username=UserPython&password=passwd"))

####################
[('username', 'UserPython'), ('password', 'passwd')]

urllib.parse.urlunparse(parts)

可以将urlparse()函数分解出来的元组再组装成URL

import urllib.parse

parsed = urllib.parse.urlparse("https://blog.csdn.net/UserPython/article/details/83214161")

print(parsed)
# ParseResult(scheme='https', netloc='blog.csdn.net', path='/UserPython/article/details/83214161', params='', query='', fragment='')

print(urllib.parse.urlunparse(parsed))
# https://blog.csdn.net/UserPython/article/details/83214161

urllib.parse.urlspli()

urllib.parse.urlsplit（urlstring，scheme =’’，allow_fragments = True ）

这类似于urlparse()，唯一的区别是这个函数不会将url中的param分离出来；就是说相比urlparse()少一个param元素，返回的元组元素参照urlparse()的元组表，少了一个param元素

import urllib.parse

parsed = urllib.parse.urlsplit("https://blog.csdn.net/UserPython/article/details/83214161")

print(parsed)

#SplitResult(scheme='https', netloc='blog.csdn.net', path='/UserPython/article/details/83214161', query='', fragment='')

urllib.parse.urlunsplit(parts)

将 urlspli() 返回的元组元素组合为一个完整的URL作为字符串，与 urlunparse() 功能相似

import urllib.parse

parsed = urllib.parse.urlsplit("https://blog.csdn.net/UserPython/article/details/83214161")

print(parsed)
#SplitResult(scheme='https', netloc='blog.csdn.net', path='/UserPython/article/details/83214161', query='', fragment='')

print(urllib.parse.urlunsplit(parsed))
# https://blog.csdn.net/UserPython/article/details/83214161

urllib.parse.urljoin()

urllib.parse.urljoin（base，url，allow_fragments = True ）

将一个基本URL与另一个URL组合来构造完整的URL ，将相对路径转换成绝对路径的 URL

import urllib.parse

print(urllib.parse.urljoin('http://www.example.com/path/file.html', 'anotherfile.html'))
print(urllib.parse.urljoin('http://www.example.com/path/', 'anotherfile.html'))
print(urllib.parse.urljoin('http://www.example.com/path/file.html', '../anotherfile.html'))
print(urllib.parse.urljoin('http://www.example.com/path/file.html', '/anotherfile.html'))

# http://www.example.com/path/anotherfile.html
# http://www.example.com/path/anotherfile.html
# http://www.example.com/anotherfile.html
# http://www.example.com/anotherfile.html

如果添加的URL是一个包含“//”或“scheme://”开头，则这个URL的主机名或请求标识会自动返回

import urllib.parse

parsed = urllib.parse.urljoin("https://blog.csdn.net/UserPython/article/details/83214161", "//python.html")

print(parsed)
# https://python.html

urllib.parse.urldefrag(url)

如果URL中包含fragment标识符，就会返回一个不带fragment标识的URL，fragment标识会被当成一个分离的字符串返回；如果URL中不包含fragment标识，就会返回一个URL和一个空字符串

import urllib.parse

print(urllib.parse.urldefrag("https://blog.csdn.net/UserPython/article/#hhh"))
print(urllib.parse.urldefrag("https://blog.csdn.net/UserPython/article/"))

# DefragResult(url='https://blog.csdn.net/UserPython/article/', fragment='hhh')
# DefragResult(url='https://blog.csdn.net/UserPython/article/', fragment='')

urllib.parse.urlencode()

urllib.parse.urlencode(query, doseq=False, safe=’’, encoding=None, errors=None, quote_via=quote_plus)

使用 urlencode() 函数可以将一个 dict 转换成合法的查询参数

import urllib.parse

query_data = {"username" : "UserH", "password" : "passwd"}

print(urllib.parse.urlencode(query_data))
print(urllib.parse.urlencode(query_data).encode("utf-8"))

# username=UserH&password=passwd
# b'username=UserH&password=passwd'

苦涩2020

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python_urllib

文章目录简介urllib.request简介urllib是一个集合几个处理URL模块的包：urllib.request 用于打开和阅读URL中内容urllib.error 包含由urllib.request发生的异常urllib.parse 用于解析URLurllib.robotoparse 用于解析robots.txt文件urllib.requesturllib.reques...
复制链接

扫一扫