爬虫基础_urllib

最新推荐文章于 2024-09-02 10:00:00 发布

顽皮的橙子

最新推荐文章于 2024-09-02 10:00:00 发布

阅读量1.5k

点赞数

分类专栏：爬虫文章标签：爬虫 python

本文链接：https://blog.csdn.net/demonscg/article/details/122780876

版权

本文详细介绍了Python的urllib库，包括urlopen方法的使用，Request类的API，如何处理登录验证、代理和Cookie，以及parse模块中的URL解析方法。还讨论了Robots协议及其解析，帮助理解爬虫的基础操作。

摘要由CSDN通过智能技术生成

urllib库的结构

urllib库包含以下四个模块:

request: 基本的HTTP请求模块
error: 异常处理模块
parse: 工具模块
robotparser: 识别robots.txt的模块

urlopen方法

使用urlopen方法可以发送简单请求

API

urllib.request.urlopen(url, data=None, [timeout,] *, cafile=None, capath=None, cadefault=False, context=None)

url: 要请求的URL
data: 请求携带的参数, 如果设置了此参数, 请求方式将变为POST, 而不是GET
timeout: 超时时间, 单位为秒, 超时抛出URLError异常
cafile: CA证书
cspath: CA证书的路径
cadefault: 已弃用, 默认False
context: 用来指定SSL设置, 值必须是ssl.SSLContext类的对象

另外, urlopen方法还可以接受一个Request对象作为参数, 详见后文

发送GET请求

from urllib.request import urlopen

url = 'https://www.python.org'
resp = urlopen(url=url)
print(resp.read().decode('utf-8'))  # read()方法返回的数据是字节, 需要手动解码

发送POST请求

from urllib.request import urlopen
from urllib.parse import urlencode

url = 'https://www.httpbin.org/post'
data = {
   'name': 'germey'}
# 使用urlencode将数据编码, 再由bytes转为字节
data = bytes(urlencode(data), encoding='utf-8')
# 携带data之后, 请求方式变为POST
resp = urlopen(url=url, data=data)
print(resp.read().decode('utf-8'))

处理超时

import socket
from urllib.request import urlopen
from urllib.error import URLError

url = 'https://www.httpbin.org/get'
try:
    resp = urlopen(url=url, timeout=0.1)  # timeout单位为秒
    html = resp.read().decode('utf-8')
    print(html)
except URLError as e:  # 超时抛出URLError异常
    if isinstance(e.reason, socket.timeout):  # 判断异常具体类型
        print('TIME OUT')

Request类

Request类能够添加更多的请求信息, 例如请求头信息, 请求方式等

API

class urllib.request.Request(url, data=None, headers={}, origin_rep_host=None, unverifiable=False, method=None)

url: 要请求的URL
data: 要传递的数据, 必须是bytes类型
headers: 请求头信息, 类型是字典, 请求头信息可以通过headers参数传递, 也可以通过Request对象的add_header方法传递
origin_req_host: 请求方名称或IP地址
unverifiable: 请求是否是无法验证的
method: 请求方式

使用方法

from urllib.request import Request
from urllib.request import urlopen
from urllib.parse import urlencode

url = 'https://www.httpbin.org/post'
data = bytes(urlencode({
   'name': 'germey'}), encoding='utf-8')
headers = {
   
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.81 Safari/537.36',
    'host': 'www.httpbin.org',
}
req = Request(url=url, data=data, headers=headers, method='POST')
resp = urlopen(req)  # 仍然使用urlopen发送请求, 传入Request对象作为参数
print(resp.read().decode('utf-8'

最低0.47元/天解锁文章

顽皮的橙子

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
爬虫基础_urllib

urllib库的结构urllib库包含以下四个模块:request: 基本的HTTP请求模块error: 异常处理模块parse: 工具模块robotparser: 识别robots.txt的模块urlopen方法使用urlopen方法可以发送简单请求APIurllib.request.urlopen(url, data=None, [timeout,] *, cafile=None, capath=None, cadefault=False, context=None)url:
复制链接

扫一扫

专栏目录