5-1请求库-urllib_urllib socket-CSDN博客

本文链接：https://blog.csdn.net/qq_35249586/article/details/115630163

Urllib库与URLError异常处理

1.四模块介绍

request：最基本的HTTP请求模块，用来模拟发送请求。
error：异常处理模块，如果出现请求错误，可以捕获这些异常，可以进行重试或者其他操作，这样可以保证程序不会意外终止
parse：一个工具模块，提供了许多URL处理方法，比如拆分、解析、合并等。
robotparser：可以识别robot.txt文件，用来判断哪些可以爬，那些不可以爬，使用较少。

2.发送请求

urllib.request模块：提供了最基本的构造HTTP请求的方法，模拟发送请求。还可以处理授权验证（authenticaton）、重定向（redirection）、浏览器Cookie以及其他内容。
具体用法如下：

2.1 urlopen

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

2.1.1 第一个参数url

import urllib.request

response = urllib.request.urlopen('https://www.baidu.com/')
print(response.read().decode('utf-8'))
print(type(response))

response是<class ‘http.client.HTTPResponse’> HTTPResponse类型的对象，该对象的方法有 read（）、readinto （）、getheader(name ）、
getheaders （）【响应的头信息】、fileno （）等，msg 、version 、status（响应的状态码）、reason 、debuglevel 、closed等属性

import urllib.request

response = urllib.request.urlopen('https://www.baidu.com/')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

response.getheader(‘Server’)：传递了一个参数server，查询服务器。

2.1.2 data参数：访问 URL 时要传送的数据，请求方式是POST
参数data 用urlencode（）和bytes（）方法转成字节流。

import urllib.parse
import urllib.request

value= {'Hello':'World'}
data= bytes(urllib.parse.urlencode(value),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

2.1.3 timeout参数:默认为 socket._GLOBAL_DEFAULT_TIMEOUT
用于超时时间，单位秒。如果请求超出了这个设置的时间，还没有响应，就会抛出异常。如果不使用这个参数，就使用全局默认时间。

import urllib.request

response = urllib.request.urlopen('https://www.baidu.com/',timeout=1)
print(response.read())

urllib.error.URLError: <urlopen error timed out>

抛出了URLError异常，该异常属于urllib.error,错误原因是超时。
因此，可以通过这个设置一个页面如果长时间未响应，就跳过，利用try except实现。

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com/',timeout=0.001)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('Time out')

isinstance（）方法来判断它的类型，

2.2 Request

如果请求中加入Header等信息，就需要Request来构建。
urllib. request. Request ( url, data=None, headers={}, origin_req_host=None, unverifiable=False, method =None)

import urllib.request

request = urllib.request.Request('https://www.baidu.com/')
print(type(request))
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
print(type(response))

将请求独立成一个Request类型的对象，这样可以更灵活的配置参数。
参数介绍：
在这里插入图片描述
2.2.2 POST和GET数据传送

GET：以链接形式访问，链接上包含了所有参数值。
POST：不会再网址上显示所有参数，这样可以提高安全性（如：登录时的用户名和密码）

import urllib.parse
import urllib.request


values = {"username":"18883871244","password":"lmx93020514!"}
data = bytes(urllib.parse.urlencode(values),encoding='utf-8')
url = "https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"
request = urllib.request.Request(url=url,data=data)
response = urllib.request.urlopen(request)
print(response.read())

这段代码有可能报错，因为CSDN 还有个流水号的字段，没有设置全，比较复杂在这里就不写上去了，在此只是说明登录的原理。一般的登录网站一般是这种写法。

写入参数：参考
在这里插入图片描述

2.3 高级用法

比如遇到Cookie设置、代理设置时该怎么办？
2.3.1 设置 Headers

headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }

headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  ,'Referer':'http://www.zhihu.com/articles' }

Referer：” 反盗链” 的方式，对付防盗链，服务器会识别 headers 中的 referer 是不是它自己，如果不是，有的服务器不会响应，所以我们还可以在 headers 中加入 referer。

下面的需要特别注意一下：

User-Agent : 有些服务器或 Proxy 会通过该值来判断是否是浏览器发出的请求 Content-Type : 在使用 REST 接口时，服务器会检查该值，用来确定 HTTP Body 中的内容该怎样解析。 application/xml ：在 XML RPC，如 RESTful/SOAP 调用时使用 application/json ：在 JSON RPC 调用时使用 application/x-www-form-urlencoded ：浏览器提交 Web 表单时使用在使用服务器提供的 RESTful 或 SOAP 服务时， Content-Type 设置错误会导致服务器拒绝服务。

使用Add_header() 添加报头
在这里插入图片描述

2.3.2 Handler
在这里插入图片描述

验证：有些网站必须先输入用户名和密码后才能查看网页。
可以借助：HTTPBasicAuthHandler
代码：
代理
Cookie

获取网站Cookie

import urllib.request
import http.cookiejar

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com/')
for item in cookie:
    print(item.name +'='+item.value)

先声明一个Cookiejar对象，第二HTTPCookieProcessor（）构建handler，第三利用build_opener（）建造opener，最后执行open（）函数。

BAIDUID=5E61BA88BC7966CD6109C54457AC98E2:FG=1
BIDUPSID=5E61BA88BC7966CDE8B11DD53AB1F110
PSTM=1618301431
BD_NOT_HTTPS=1

以文件形式保存获取下来的cookie

CookieJar 就需要换成问MozillaCookieJar，它在生成文件时会用到，是CookieJar 的子类，可以用来处理Cookies 和文件相关的事件，比如读取和保存Cookies ，可以将Cookies 保存成Mozilla 型浏览器的Cookies 格式。

import urllib.request
import http.cookiejar

filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com/')
cookie.save(ignore_discard=True,ignore_expires=True)

LWPCookieJar 同样可以读取和保存Cookies ，但是保存的格式和MozillaCookieJar 不一样，它会保存成libwww-perl(LWP）格式的Cookies 文件。

import urllib.request
import http.cookiejar

filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com/')
cookie.save(ignore_discard=True,ignore_expires=True)

生成了Cookies 文件后，怎样从文件中读取并利用呢？

LWPCookieJar

import urllib.request
import http.cookiejar


cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com/')
print(response.read().decode('utf-8'))

MozillaCookieJar

import urllib.request
import http.cookiejar


cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
req = urllib.request.Request('https://www.baidu.com/')
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(req)
print(response.read())

3.异常处理
3.1 URLError
URLError 类来自urllib 库的error 模块，它继承自OSError 类，是error 异常模块的基类，由request模块生的异常都可以通过捕获这个类来处理。

它具有一个属性reason ，即返回错误的原因.

from urllib import request,error

try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)