python urllib库
urllib——URL处理库
urllib库主要完成urls相关处理,主要有4个模块:
1 urllib.request 打开及读取url
2 urllib.error 包含了urllib.request产生的异常
3 urllib.parse 解析url
4 urllib.robotparser 解析robots.txt文件
(来自https://docs.python.org/3.8/library/urllib.html)
本文只介绍最简单的几种应用,模块4没用到。
首先,创建好工程和对应的.py文件后,先引入对应的库和模块。
import urllib.request
import urllib.parse
然后,我们需要了解一下GET和POST,如下:
- GET 请求读取由URL所标志的信息
- POST 给服务器添加信息(例如,注释)
下面是它们的具体实现:
# GET方法读取url
response_get = urllib.request.urlopen("http://www.baidu.com")
print(response_get.read().decode("utf-8"))
# POST方法读取url
url = "http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({1: 2}), encoding="utf-8")
response_post = urllib.request.urlopen(url=url, data=data)
print(response_post.read().decode("utf-8"))
# 异常处理
try:
url = "http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({1: 2}), encoding="utf-8")
response_post = urllib.request.urlopen(url=url, data=data, timeout=0.01)
print(response_post.read().decode("utf-8"))
except urllib.error.URLError as e:
print("time out!")
# 获取响应码/头部信息等
url = "http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({1: 2}), encoding="utf-8")
response_post = urllib.request.urlopen(url=url, data=data)
print(response_post.status)
print(response_post.getheaders())
print(response_post.getheader("P3p"))
# 伪装成不是爬虫,加header里面的User-Agent即可
url = "http://httpbin.org/post"
head = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33"
}
data = bytes(urllib.parse.urlencode({1: 2}), encoding="utf-8")
req = urllib.request.Request(url=url, headers=head, data=data, method='POST')
response_post = urllib.request.urlopen(req)
print(response_post.read().decode("utf-8"))