python爬虫_urllib_ZHOU125disorder_

最新推荐文章于 2024-09-06 22:05:33 发布

zjing125

最新推荐文章于 2024-09-06 22:05:33 发布

阅读量220

点赞数

分类专栏： # python爬虫文章标签： urllib python 爬虫

原文链接：https://blog.csdn.net/UserPython/article/details/83188963

版权

python爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

`urllib`

urllib简介

 - urllib库是包括四个模块的python标准库中用于网络请求的库;

urllib.request模块发送requests模块和和获得相应数据
urllib.error模块 urllib.requests模块在请求时的异常
urllib.parse模块解析url和处理url
urllib.robotoparse模块解析roboto.txt文件

urllib.request

`urllib.request.urlopen()`

urllib.request.urlopen()

from urllib.request import urlopen	//导入urlopen()
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile = None, capath = None, cadefault = False, context = None)

urllib.request.urlopen()用于打开网页url:

data：提交POST数据，指定要发送到服务器的数据对象，参数默认为None；
如果要添加data，需要以bytes字节流编码格式的内容，即bytes类型，可以通过bytes()方法进行转换，或者在其后面使用encode(“utf-8”)进行转换

timeout：可选参数，设置网站的访问超时时间，以秒为单位；
意思就是如果请求超出了设置的这个时间还没有得到响应，就会抛出异常，如果不指定，将使用全局默认超时设置）

关于urllib.request.urlopen()

import urllib.request
request = urllib.request.urlopen('https://www.baidu.com/')
print(request)
# <http.client.HTTPResponse object at 0x000001FA50B23760>

urlopen()函数返回的是 http.client.HTTPResponse 对象，提供的方法：

read()、readline()、readlines()、close(): 对HTTPResponse类型数据进行操作
info()：返回HTTPMessage对象，表示远程服务器返回的头信息
getcode()：返回Http状态码。如果是http请求，200请求成功完成，404网址未找到
geturl()：返回请求的url

一般使用read()方法后，还需要加decode()方法进行解码，因为返回的网页内容实际上是没有被解码的，数据格式为bytes类型，
可以在read()方法后面加上decode()方法，指定网页对应的解码格式进行解码;

import urllib.request
import urllib.parse
url = "https://www.baidu.com/"

# post数据
postdata = {"username": "love", "password": "123456"}
# post的数据必须为bytes类型，所以可以用bytes()方法转换
# data = bytes(urllib.parse.urlencode(postdata),encoding="utf-8")

# 使用encode()进行转换
data = urllib.parse.urlencode(postdata).encode("utf-8")

# 发起请求
req = urllib.request.Request(url, data)

# 接受响应
response = urllib.request.urlopen(req)

# 使用utf-8解码，否则将打印bytes字符串
print(response.read().decode("utf-8"))

`urllib.request.Request()`

urllib.request.Request()

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

header：指定发起的HTTP请求的头部信息。header是一个字典。它除了在Request中添加，还可以通过调用Request实例的add_header()方法来添加请求头;

origin_req_host： 参数指的是请求方的host名称或者IP地址;

unverifiable：表示这个请求是否是无法验证的，默认值是False。意思就是说用户没有足够权限来选择接收这个请求的结果。
例如我们请求一个HTML文档中的图片，但是我们没有自动抓取图像的权限，我们就要将 unverifiable 的值设置成 True;

method：指的是发起的 HTTP 请求的方式，有 GET、POST、DELETE、PUT等;

import urllib.request
import urllib.parse

url = "https://www.baidu.com/"
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

# 告诉服务器我是从哪个页面链接过来的
referer = "https://www.baidu.com/s?wd=%E9%9E%A0%E5%A9%A7%E7%A5%8E"

# post数据
postdata = {"username": "love", "password": "123456"}

# 将user_agent,referer写入头信息
headers = {"User-Agent": user_agent, "Referer": referer}

# post的数据必须为bytes类型，所以要用bytes()方法转换
data = bytes(urllib.parse.urlencode(postdata), encoding="utf-8")

# 添加头信息，发起请求
req = urllib.request.Request(url, data, headers)

# 接受响应
response = urllib.request.urlopen(req)

# 使用utf-8解码，否则将打印bytes字符串
print(response.read().decode("utf-8"))

`add_header`

使用add_header添加请求头信息

import urllib.request
import urllib.parse

url = "https://www.baidu.com/"
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

# 告诉服务器我是从哪个页面链接过来的
referer = "https://www.baidu.com/s?wd=%E9%9E%A0%E5%A9%A7%E7%A5%8E"
# 一般为要打开的网页地址

# post数据
postdata = {"username": "love", "password": "123456"}

# 将user_agent,referer写入头信息
headers = {"User-Agent": user_agent, "Referer": referer}

# post的数据必须为bytes类型，所以要用bytes()方法转换
data = bytes(urllib.parse.urlencode(postdata), encoding="utf-8")

# 添加头信息
req = urllib.request.Request(url, data)
req.add_header("User-Agent", user_agent)
req.add_header("Referer", referer)

# 接受响应
response = urllib.request.urlopen(req)

# 使用utf-8解码，否则将打印bytes字符串
print(response.read().decode("utf-8"))

`header`

User-Agent：有些服务器或Proxy会通过该值来判断是否是浏览器发出的请求;

Content-Type：在使用REST接口时，服务器会检查该值，用来确定HTTP Body中的内容该怎样解析。
在使用服务器提供的REST-ful或SOAP服务时，Content-Type设置错误会导致服务器拒绝服务。
常见的取值有：
		application/xml (XML RPC，如RESTful/SOAP 调用时使用)、application/json (在JSON RPC调用时使用)、
		application/x-www-from-urlencode(服务器提交Web表单时使用)

Referer：服务器有时候会检查防盗链;

Handle处理器和自定义Opener

opener是 urllib.request.OpenerDirector 的实例，我们之前一直都在使用的urlopen，它是一个特殊的opener（也就是模块帮我们构建好的）。
我们采用urlopen()的方式去请求，其实是有些局限性的，比如我们需要打开debug模式，或通过代理模式去请求或带有cookie去请求，就不行了。
如果要实现debug模式或代理请求的话，我们需要自己定义Handler和opener。

这时可以使用相关的 Handler处理器来创建特定功能的处理器对象；
然后通过 urllib.request.build_opener() 方法使用这些处理器对象，创建自定义opener对象；
使用自定义的opener对象，调用open()方法发送请求；与urlopen()函数的功能相同

如果程序里所有的请求都使用自定义的opener，可以使用urllib.request.install_opener() 将自定义的 opener 对象 定义为 全局opener，
表示如果之后凡是调用urlopen，都将使用这个opener（根据自己的需求来选择

在urllib库中，给我们提供了一些 Handler处理器，
如：HTTPHandler，HTTPSHandler，ProxyHandler，HTTPCookieProcessor、BaseHandler，AbstractHTTPHandler，FileHandler，FTPHandler，
   分别用于处理HTTP，HTTPS，Proxy代理、Cookie处理等

`urllib_实战`

定义一个HTTPHandler处理器的opener对象，向网站发送请求及获取响应

import urllib.request

# 第一步：构建一个HTTPHandler处理器对象，支持处理HTTP请求
http_handler = urllib.request.HTTPHandler()

# 第二步：调用urllib.build_opener()方法，创建支持处理HTTP请求的opener对象
opener = urllib.request.build_opener(http_handler)

# 第三步：构建Request请求
request = urllib.request.Request("https://www.baidu.com")

# 第四步：调用自定义opener对象的open()方法，发送request请求
response = opener.open(request)

# 第五步：获取服务器响应内容
print(response.read().decode("utf-8"))

这种方式发送请求得到的结果，和使用urllib.request.urlopen()发送HTTP请求得到的结果是一样的，只不过这是我们自己自定义的而已。代码如下：

import urllib.request
response = urllib.request.urlopen("https://www.baidu.com")
print(response.read().decode("utf-8"))

定义一个HTTPCookieProcessor处理器的opener对象，将请求的网页的cookie打印出来

import urllib.request
import http.cookiejar
# 该模块用来获取网页的cookie

# 第一步：构建一个CookieJar对象实例来保存cookie
cookiejar = http.cookiejar.CookieJar()

# 第二步：使用HTTPCookieProcessor()来创建cookie处理器对象，参数为cookieJar()对象
cookie_handler = urllib.request.HTTPCookieProcessor(cookiejar)

# 第三步：调用urllib.build_opener()方法，创建支持处理cookie请求的opener对象
opener = urllib.request.build_opener(cookie_handler)

# 第四步：构建Request请求
request = urllib.request.Request("https://www.baidu.com")

# 第五步：调用自定义opener对象的open()方法，发送request请求，访问网页的cookie自动保存到cookiejar中
response = opener.open(request)

# 第六步：获取服务器响应内容
print(response.read().decode("utf-8"))

# 第七步：将保存的cookie打印出来
for item in cookiejar:
    print(item.name + "=" + item.value)

cookie处理
Proxy代理
urllib.parse
urllib.parse.urlparse()

urllib.parse.urlparse(urlstring, scheme=' ', allow_fragments=True)

将URL解析为六个组件，返回一个含6个元素的元组，对应于URL的一般结构：
											scheme://netloc/path;parameters?query#fragment，包含六个部分，
											每个元组项都是一个字符串，可能是空的，这六个部分均不能再被分割成更小的部分；
											以下为返回的元组元素：

元素		值						值不存在时默认值
scheme		协议类型					一定存在
netloc		网址					空字符串
path		分层路径					空字符串
params		最后一个路径元素的参数	空字符串
query		查询组件					空字符串
fragment	片段标识符				空字符串

示例如下：

import urllib.parse

print(urllib.parse.urlparse("https://blog.csdn.net/ZHOU125disorder/article/details/113438283"))

ParseResult(scheme='https', netloc='blog.csdn.net', path='/ZHOU125disorder/article/details/113438283', params='', query='', fragment='')

`urllib.parse.parse_qs()`

urllib.parse.parse_qs()

import urllib.parse

urllib.parse.parse_qs（qs，keep_blank_values = False，strict_parsing = False，encoding =‘utf-8’，errors =‘replace’ )

这个函数主要用于分析URL中query组件的参数，返回一个key-value对应的字典格式;

import urllib.parse

print(urllib.parse.parse_qs("username=ZHOU125disorder&password=passwd"))

[('username', 'ZHOU125disorder'), ('password', 'passwd')]

`urllib.parse.parse_qsl()`

urllib.parse.parse_qsl()

import urllib.parse

urllib.parse.parse_qsl（qs，keep_blank_values = False，strict_parsing = False，encoding ='utf-8'，errors ='replace' ）

这个函数和urllib.parse.parse_qs()功能，唯一的区别就是这个函数返回值是list格式

import urllib.parse

print(urllib.parse.parse_qsl("username=ZHOU125disorder&password=passwd"))

[('username', 'ZHOU125disorder'), ('password', 'passwd')]

`urllib.parse.urlunparse(parts)`

import urllib.parse

parsed = urllib.parse.urlparse("https://blog.csdn.net/ZHOU125disorder/article/details/113793457")

print(parsed)
# ParseResult(scheme='https', netloc='blog.csdn.net', path='/UserPython/article/details/83214161', params='', query='', fragment='')

print(urllib.parse.urlunparse(parsed))
# https://blog.csdn.net/ZHOU125disorder/article/details/113793457

`urllib.parse.urlspli()`

import urllib.parse

urllib.parse.urlsplit（urlstring，scheme =’’，allow_fragments = True ）

这类似于urlparse()，唯一的区别是这个函数不会将url中的param分离出来；
就是说相比urlparse()少一个param元素，返回的元组元素参照urlparse()的元组表，少了一个param元素

import urllib.parse

parsed = urllib.parse.urlsplit("https://blog.csdn.net/ZHOU125disorder/article/details/113793457")

print(parsed)

# SplitResult(scheme='https', netloc='blog.csdn.net', path='/ZHOU125disorder/article/details/113793457', query='', fragment='')

`urllib.parse.urlunsplit(parts)`

import urllib.parse

将 urlspli() 返回的元组元素组合为一个完整的URL作为字符串，与 urlunparse() 功能相似

import urllib.parse

parsed = urllib.parse.urlsplit("https://blog.csdn.net/ZHOU125disorder/article/details/113793457")

print(parsed)
# SplitResult(scheme='https', netloc='blog.csdn.net', path='/ZHOU125disorder/article/details/113793457', query='', fragment='')

print(urllib.parse.urlunsplit(parsed))
# https://blog.csdn.net/ZHOU125disorder/article/details/113793457

`urllib.parse.urljoin()`

import urllib.parse

urllib.parse.urljoin（base，url，allow_fragments = True ）

import urllib.parse

print(urllib.parse.urljoin('http://www.example.com/path/file.html', 'anotherfile.html'))
print(urllib.parse.urljoin('http://www.example.com/path/', 'anotherfile.html'))
print(urllib.parse.urljoin('http://www.example.com/path/file.html', '../anotherfile.html'))
print(urllib.parse.urljoin('http://www.example.com/path/file.html', '/anotherfile.html'))

# http://www.example.com/path/anotherfile.html
# http://www.example.com/path/anotherfile.html
# http://www.example.com/anotherfile.html
# http://www.example.com/anotherfile.html

如果添加的URL是一个包含“//”或“scheme://”开头，则这个URL的主机名或请求标识会自动返回

import urllib.parse

parsed = urllib.parse.urljoin("https://blog.csdn.net/ZHOU125disorder/article/details/113793457", "//python_study.html")

print(parsed)
# https://python_study.html

`urllib.parse.urldefrag(url)`

如果URL中包含fragment标识符，就会返回一个不带fragment标识的URL，fragment标识会被当成一个分离的字符串返回；
如果URL中不包含fragment标识，就会返回一个URL和一个空字符串

import urllib.parse

print(urllib.parse.urldefrag("https://blog.csdn.net/UserPython/article/#卡卡西"))
print(urllib.parse.urldefrag("https://blog.csdn.net/UserPython/article/"))

# DefragResult(url='https://blog.csdn.net/UserPython/article/', fragment='卡卡西')
# DefragResult(url='https://blog.csdn.net/UserPython/article/', fragment='')

`urllib.parse.urlencode()`

import urllib.parse

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)

使用 urlencode() 函数可以将一个 dict 转换成合法的查询参数

import urllib.parse

query_data = {"username": "love", "password": "123456"}

print(urllib.parse.urlencode(query_data))
print(urllib.parse.urlencode(query_data).encode("utf-8"))

# username=love&password=123456
# b'username=love&password=123456'

转载于：https://blog.csdn.net/UserPython/article/details/83188963

zjing125

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫_urllib_ZHOU125disorder_

urlliburllib简介 - urllib库是包括四个模块的python标准库中用于网络请求的库;urllib.request模块发送requests模块和和获得相应数据urllib.error模块 urllib.requests模块在请求时的异常urllib.parse模块解析url和处理urlurllib.robotoparse模块解析roboto.txt文件urllib.requesturllib.request.urlopen()urllib.request
复制链接

扫一扫

专栏目录