urllib——URL处理库

Yealle

已于 2023-01-28 22:21:31 修改

阅读量1k

点赞数

分类专栏：学习笔记文章标签：编辑器前端

于 2022-09-17 18:53:19 首次发布

本文链接：https://blog.csdn.net/weixin_40469416/article/details/126843095

版权

学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

python urllib库

urllib——URL处理库

urllib库主要完成urls相关处理，主要有4个模块：
1 urllib.request 打开及读取url
2 urllib.error 包含了urllib.request产生的异常
3 urllib.parse 解析url
4 urllib.robotparser 解析robots.txt文件
（来自https://docs.python.org/3.8/library/urllib.html）

本文只介绍最简单的几种应用，模块4没用到。
首先，创建好工程和对应的.py文件后，先引入对应的库和模块。

import urllib.request
import urllib.parse

然后，我们需要了解一下GET和POST，如下：

GET 请求读取由URL所标志的信息
POST 给服务器添加信息（例如，注释）

下面是它们的具体实现：

# GET方法读取url
response_get = urllib.request.urlopen("http://www.baidu.com")
print(response_get.read().decode("utf-8"))


# POST方法读取url
url = "http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({1: 2}), encoding="utf-8")
response_post = urllib.request.urlopen(url=url, data=data)
print(response_post.read().decode("utf-8"))

# 异常处理
try:
    url = "http://httpbin.org/post"
    data = bytes(urllib.parse.urlencode({1: 2}), encoding="utf-8")
    response_post = urllib.request.urlopen(url=url, data=data, timeout=0.01)
    print(response_post.read().decode("utf-8"))
except urllib.error.URLError as e:
    print("time out!")

# 获取响应码/头部信息等
url = "http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({1: 2}), encoding="utf-8")
response_post = urllib.request.urlopen(url=url, data=data)
print(response_post.status)
print(response_post.getheaders())
print(response_post.getheader("P3p"))

# 伪装成不是爬虫，加header里面的User-Agent即可
url = "http://httpbin.org/post"
head = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33"
}
data = bytes(urllib.parse.urlencode({1: 2}), encoding="utf-8")
req = urllib.request.Request(url=url, headers=head, data=data, method='POST')
response_post = urllib.request.urlopen(req)
print(response_post.read().decode("utf-8"))