Python爬虫之Urllib的基础运用

最新推荐文章于 2024-03-23 16:13:03 发布

而又何羡乎

最新推荐文章于 2024-03-23 16:13:03 发布

阅读量253

点赞数

分类专栏： Python网络爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_44285092/article/details/106369472

版权

Python网络爬虫专栏收录该内容

7 篇文章 1 订阅

订阅专栏

1. 什么是Urllib
2. urlopen
3. Request（可模拟计算机访问）
4. 查看响应是否成功

1. 什么是Urllib

Urllib是Python内置的HTTP请求库，不需要额外安装的库，只要装好Python就可以使用。其主要模块如下：

模块名	说明
urllib.request	发送请求模块
urllib.error	异常处理模块
urllib.parse	url解析模块

2. urlopen

urllib.request.urlopen(url, data=None, timeout=<object object at 0x0000025557C0B750>, *, cafile=None, capath=None, cadefault=False, context=None)

url ：请求的网址
data（默认空）：是伴随 url 提交的数据（比如要post的数据），同时 HTTP 请求将从 "GET"方式改为 "POST"方式。
timeout = 1 ：设置超时的时长，超过一秒则自动结束

情景1：仅有url参数，此时为Get请求

from urllib import request
url = "http://www.baidu.com"
# 仅有url参数
response = request.urlopen(url)
# 使用read()来去读urlopen请求url的结果
a = response.read() # 可以查看结果为bytes数据类型，因此需要解码为“utf-8”
# 用decode(编码，“ignore”) ignore”目的是忽略一些小错误
b = a.decode("utf-8","ignore")
# 实际上，以上步骤通常一部写完
c = response.read().decode("utf-8","ignore")

情景2：有url、data，这种为Post请求

from urllib import parse,request
# 设置一个data，注意是一个bytes数据类型
data = bytes(parse.urlencode({"word":"hello"}),encoding = "utf8")
response = request.urlopen("http://httpbin.org/post",data=data)
# print(response.read())

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6", \n    "X-Amzn-Trace-Id": "Root=1-5ecd40e7-d4beebdce8d18eef9f517b8a"\n  }, \n  "json": null, \n  "origin": "183.229.25.96", \n  "url": "http://httpbin.org/post"\n}\n'

情景3：设置超时

from urllib import request
url = "http://www.baidu.com"
# 仅有url参数
response = request.urlopen(url,timeout=1)
# 实际上，以上步骤通常一部写完
c = response.read().decode("utf-8","ignore")

情景4：错误捕获

from urllib import request,error
url = "http://www.baidu.com"
# 仅有url参数
try:
    response = request.urlopen(url,timeout=0.01)
except error.URLError as e:
    print(e)

<urlopen error timed out>

3. Request（可模拟计算机访问）

前面urlopen()的参数就是一个url地址；但是如果需要执行更复杂的操作，比如增加HTTP报头，必须创建一个 Request 实例来作为urlopen()的参数；而需要访问的url地址则作为 Request 实例的参数。

request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
url ：请求的网址
data ：是伴随 url 提交的数据（比如要post的数据），同时 HTTP 请求将从 "GET"方式改为 "POST"方式。
headers（默认空）：是一个字典，包含了需要发送的HTTP报头的键值对。
method=“POST”调用post请求

User-Agent：接用urllib给一个网站发送请求的话，确实略有些唐突了，就好比，人家每家都有门，你以一个路人的身份直接闯进去显然不是很礼貌。而且有一些站点不喜欢被程序（非人为访问）访问，有可能会拒绝你的访问请求。但是如果我们用一个合法的身份去请求别人网站，显然人家就是欢迎的，所以我们就应该给我们的这个代码加上一个身份，就是所谓的User-Agent头。

在这里插入图片描述

headers其实就是User-Agent，不过在程序中将其改为字典的形式而已。

情景1：只有Url

from urllib import request
res = request.Request("http://python.org")
response = request.urlopen(res)
result = response.read().decode("utf-8","ignore")

情景2 有url和headers，有headers则是通过模拟计算及访问，是一种反爬措施

from urllib import request
header = {
    "User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Mobile Safari/537.36"
}
res = request.Request("http://python.org",headers=header)
response = request.urlopen(res)
result = response.read().decode("utf-8","ignore")

情景3 有url，headers，data，此时变为post请求

from urllib import parse,request
# 设置一个data，注意是一个bytes数据类型
data = bytes(parse.urlencode({"word":"hello"}),encoding = "utf8")
header = {
    "User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Mobile Safari/537.36"
}
res = request.Request("http://httpbin.org/post",data = data ,headers=header)
response = request.urlopen(res)
result = response.read().decode("utf-8","ignore")

headers还可以用以下方法添加

from urllib import parse,request
# 设置一个data，注意是一个bytes数据类型
data = bytes(parse.urlencode({"word":"hello"}),encoding = "utf8")
res = request.Request("http://httpbin.org/post",data = data ,method="POST")
res.add_header("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Mobile Safari/537.36"
)
response = request.urlopen(res)
result = response.read().decode("utf-8","ignore")

此外还可以添加，timeout，在urlopen中添加

4. 查看响应是否成功

from urllib import request
res = request.Request("http://python.org")
response = request.urlopen(res)

print(response.info())# 响应头
# 判断是否访问成功
print(response.getcode())# 状态码 2XX 3XX 4XX 5XX
# 返回所响应的地址
print(response.geturl()) # 返回响应地址

Connection: close
Content-Length: 48882
Server: nginx
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Via: 1.1 vegur
Via: 1.1 varnish
Accept-Ranges: bytes
Date: Tue, 26 May 2020 17:23:47 GMT
Via: 1.1 varnish
Age: 1215
X-Served-By: cache-bwi5127-BWI, cache-hkg17930-HKG
X-Cache: HIT, HIT
X-Cache-Hits: 3, 1151
X-Timer: S1590513827.165994,VS0,VE0
Vary: Cookie
Strict-Transport-Security: max-age=63072000; includeSubDomains


200
https://www.python.org/

# 获取网页内容方法
result = response.read().decode("utf-8","ignore")
# print(result)

更多用法参考,如设置代理，cookie等