数据爬虫（二）：python爬虫中urllib库详解,parse和request使用方法

最新推荐文章于 2024-07-20 03:52:27 发布

Raybra

最新推荐文章于 2024-07-20 03:52:27 发布

阅读量8.3k

点赞数 8

分类专栏： python爬虫文章标签： urllib urllib.prase urllib.request

本文链接：https://blog.csdn.net/Byweiker/article/details/79234824

版权

一、urllib.request 请求模块：

urllib.request 模块提供了最基本的构造 HTTP （或其他协议如 FTP）请求的方法，利用它可以模拟浏览器的一个请求发起过程。利用不同的协议去获取 URL 信息。它的某些接口能够处理基础认证（ Basic Authenticaton）、redirections （HTTP 重定向)、 Cookies (浏览器 Cookies）等情况。而这些接口是由 handlers 和 openers 对象提供的。

（1）、urlopen：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

参数：url:需要打开的网址 data: Post 提交的数据, 默认为 None ，当 data 不为 None 时, urlopen() 提交方式为 Post timeout：设置网站访问超时时间

说明: 直接使用 urllib.request 模块中的 urlopen方法获取页面，其中 page 数据类型为 bytes 类型，经过 decode 解码转换成 string 类型。通过输出结果可以 urlopen 返回对象是 HTTPResposne 类型对象。

urlopen 返回一个类文件对象，并提供了如下方法：

read() , readline() , readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样; info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息；可以通过Quick Reference to Http Headers查看 Http Header 列表。 getcode()：返回Http状态码。如果是http请求，200表示请求成功完成;404表示网址未找到； geturl()：返回获取页面的真实 URL。在 urlopen（或 opener 对象）可能带一个重定向时，此方法很有帮助。获取的页面 URL 不一定跟真实请求的 URL 相同。

import urllib.request
response = urllib.request.urlopen('https://python.org/')
print("查看 response 的返回类型：",type(response))
print("查看反应地址信息: ",response)
print("查看头部信息1(http header)：\n",response.info())
print("查看头部信息2(http header)：\n",response.getheaders())
print("输出头部属性信息：",response.getheader("Server"))
print("查看响应状态信息1(http status)：\n",response.status)
print("查看响应状态信息2(http status)：\n",response.getcode())
print("查看响应 url 地址：\n",response.geturl())
page = response.read()
print("输出网页源码:",page.decode('utf-8'))

（二）、Post数据：

import urllib.request,urllib.parse
url = 'https://httpbin.org/post'
headers = {
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36',
 'Referer': 'https://httpbin.org/post',
 'Connection': 'keep-alive'
 }
 # 模拟表单提交
dict = {
 'name':'MIka',
 'old:':18
}
data = urllib.parse.urlencode(dict).encode('utf-8')
\#data 数如果要传bytes（字节流）类型的，如果是一个字典，先用urllib.parse.urlencode()编码。
req = urllib.request.Request(url = url,data = data,headers &#