爬虫：urllib基本库的使用

最新推荐文章于 2024-04-14 00:26:34 发布

此生小会

最新推荐文章于 2024-04-14 00:26:34 发布

阅读量369

点赞数

分类专栏：爬虫文章标签： urllib

本文链接：https://blog.csdn.net/cckavin/article/details/86686316

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

urllib包含了request（打开和读取url）, error（包含request引发的异常）, parse（解析url）, robotparser（解析robots.txt文件）四个用于处理URL的模块。

一.发送请求

1.urlopen()

使用urllib.request.urlopen()发送请求：

https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen

发送请求后得到HTTPResponse对象，调用HTTPResponse的相关方法和属性，可以获取相关信息：

https://docs.python.org/3/library/http.client.html#httpresponse-objects

代码示例：

# -*- coding:utf-8 -*-
from urllib import request, error, parse, robotparser
import socket

# get请求
url = 'https://wx.zsxq.com/dweb/#/login'  # 知识星球登录页
res = request.urlopen(url)  # 使用urllib.request模块，发送请求后得到HTTPResponse对象
web_server = res.getheader('Server')  # 查看运行知识星球的服务器类型
print(web_server)   # Tengine（详见http://tengine.taobao.org/）

# post请求
data = bytes(parse.urlencode({'data': '请求的数据'}), encoding='utf-8')  # 使用urllib.parse模块
try:
    res = request.urlopen('https://httpbin.org/post', data=data, timeout=0.01)  # 设置超时时间为0.01s
except error.URLError as e:  # 使用urllib.error模块
    if isinstance(e.reason, socket.timeout):
        print('超时')

2.Request

向urlopen()传递参数并不能构造一个完整的请求对象，所以有了Request Object对象：

https://docs.python.org/3/library/urllib.request.html#request-objects

要构造Request Object对象需要用到urllib.request.Request()方法：

https://docs.python.org/3/library/urllib.request.html#request-objects

代码示例：

# -*- coding:utf-8 -*-
from urllib import request, parse

# 使用urlopen()发起请求时，传入的参数并不能构造一个完整的请求，所以有了urllib.request.Request对象
url = 'https://httpbin.org/post'
data = bytes(parse.urlencode({'data': '请求数据'}), encoding='utf-8')
headers = {
    'Host': 'httpbin.org'
}
req = request.Request(url=url, data=data, headers=headers)
res = request.urlopen(req)
print(res.read().decode('utf-8'))