Python3 模块1之 Urllib之 urllib.request

最新推荐文章于 2024-04-23 16:42:17 发布

种子选手

最新推荐文章于 2024-04-23 16:42:17 发布

阅读量3k

点赞数 5

分类专栏：爬虫 python python 库文章标签： python urlli

本文链接：https://blog.csdn.net/qq_36148847/article/details/79135213

版权

什么是 Urllib 库？

urllib 库是 Python 内置的 HTTP 请求库。urllib 模块提供的上层接口，使访问 www 和 ftp 上的数据就像访问本地文件一样。

有以下几种模块：
1.urllib.request 请求模块
2. urllib.error 异常处理模块
3. urllib.parse url 解析模块
4. urllib.robotparser robots.txt 解析模块

Urllib 库下的几种模块基本使用如下：

urllib.request

关于 urllib.request： urllib.request 模块提供了最基本的构造 HTTP （或其他协议如 FTP）请求的方法，利用它可以模拟浏览器的一个请求发起过程。利用不同的协议去获取 URL 信息。它的某些接口能够处理基础认证（ Basic Authenticaton）、redirections （HTTP 重定向)、 Cookies (浏览器 Cookies）等情况。而这些接口是由 handlers 和 openers 对象提供的。

一. urlopen

 urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

参数说明：

url:需要打开的网址
data: Post 提交的数据, 默认为 None ，当 data 不为 None 时, urlopen() 提交方式为 Post
timeout：设置网站访问超时时间

下面是一个请求实例：

import urllib.request
# 可以是 from urllib import request,语句等价
response = urllib.request.urlopen('http://www.baidu.com')
print("查看 response 响应信息类型: ",type(response))
page = response.read()
print(page.decode('utf-8'))

输出情况:

查看 response 响应信息类型:  <class 'http.client.HTTPResponse'>
<!DOCTYPE html>
<!--STATUS OK-->

说明: 直接使用 urllib.request 模块中的 urlopen方法获取页面，其中 page 数据类型为 bytes 类型，经过 decode 解码转换成 string 类型。通过输出结果可以 urlopen 返回对象是 HTTPResposne 类型对象。

urlopen 返回一个类文件对象，并提供了如下方法：

read() , readline() , readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样;
info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息；可以通过Quick Reference to Http Headers查看 Http Header 列表。
getcode()：返回Http状态码。如果是http请求，200表示请求成功完成;404表示网址未找到；
geturl()：返回获取页面的真实 URL。在 urlopen（或 opener 对象）可能带一个重定向时，此方法很有帮助。获取的页面 URL 不一定跟真实请求的 URL 相同。

使用实例：

import urllib.request
response = urllib.request.urlopen('http://python.org/')
print("查看 response 的返回类型：",type(response))
print("查看反应地址信息: ",response)
print("查看头部信息1(http header)：\n",response.info())
print("查看头部信息2(http header)：\n",response.getheaders())
print("输出头部属性信息：",response.getheader("Server"))
print("查看响应状态信息1(http status)：\n",response.status)
print("查看响应状态信息2(http status)：\n",response.getcode())
print("查看响应 url 地址：\n",response.geturl())
page = response.read()
print("输出网页源码:",page.decode('utf-8'))

显示输出：限于幅度，可以自行测试。

关于 Post 数据

下面是一个 Post 实例：

import urllib.request,urllib.parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36',
    'Referer': 'http://httpbin.org/post',
    'Connection': 'keep-alive'
    }
  # 模拟表单提交
dict = {
    'name':'MIka',
    'old:':18
}
data = urllib.parse.urlencode(dict).encode('utf-8')
\#data 数如果要传bytes（字节流）类型的，如果是一个字典，先用urllib.parse.urlencode()编码。
req = urllib.request.Request(url = url,data = data,headers = headers)
response = urllib.request.urlopen(req)
page = response.read().decode('utf-8')
print(page)

注解：
如实例易知，在 urlopen 参数 data 不为 None 时，urlopen() 数据提交方式为 Post。urllib.parse.urlencode()方法将参数字典转化为字符串。
提交的网址是httpbin.org，它可以提供HTTP请求测试。 http://httpbin.org/post 这个地址可以用来测试 POST 请求，它可以输出请求和响应信息，其中就包含我们传递的 data 参数。

关于 timeout 参数
timeout参数可以设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间还没有得到响应，就会抛出异常，如果不指定，就会使用全局默认时间。它支持 HTTP 、 HTTPS 、 FTP 请求。

相关实例：

import urllib.request
response = urllib.request.urlopen("http://httpbin.org/get",timeout=1)
print(response.read().decode("utf-8"))

输出结果：

  "args": {},

最低0.47元/天解锁文章

种子选手

关注

5
点赞
踩
10

收藏

觉得还不错? 一键收藏
1
评论
Python3 模块1之 Urllib之 urllib.request

什么是 Urllib 库？ urllib 库是 Python 内置的 HTTP 请求库。urllib 模块提供的上层接口，使访问 www 和 ftp 上的数据就像访问本地文件一样。有以下几种模块： 1.urllib.request 请求模块 2. urllib.error 异常处理模块 3. urllib.parse url 解析模块 4. urllib.robotp
复制链接

扫一扫