Python3模块详解--老司机工具urllib模块详解之urllib.request子模块

最新推荐文章于 2024-08-25 23:24:39 发布

郑小源

最新推荐文章于 2024-08-25 23:24:39 发布

阅读量3.1k

点赞数

本文链接：https://blog.csdn.net/zly412934578/article/details/77773310

版权

urllib模块中常见的模块就是urllib.request模块，对于这个模块我会深入讲解一下，因为Python3.3.0之后，它的用法有了一个变化，在开发的过程中需要大家注意：

The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

官方对urllib.request的定位是：帮助在复杂的世界中打开URL（主要是HTTP）的函数和类——基本和摘要身份验证、重定向、cookie等等。

定义中提到“主要是HTTP”，目前只有一下协议：支持HTTP、FTP、本地文件和数据的URL，所以对于HTTPS的请求，我们放在其他文章中去讲解。

urllib.request模块默认定义了一下几个函数：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

这个函数主要用来打开一个url，url可以是一个字符串，也可以是一个请求对象；

这个函数总是返回一下可以作为上下文的方法：

getURL()--返回检索的资源的URL，通常用于确定是否遵循重定向；

info()--返回页面的元信息；

getcode()--返回响应的HTTP状态代码；

read()--返回页面元素。

代码如下：

import urllib.request
page=urllib.request.urlopen("http://www.zhihu.com/")
print(page.info())
print(page.getURL())
print(page.getcode())
print(page.read())

urllib.request. install_opener ( opener )和urllib.request.build_opener([handler, …])

两个函数主要用于设置代理，这个不用多说，大家在做爬虫的时候，多少都会用到代理IP的，代码如下：

步骤多为：（1）、准备代理IP或者请求头；

（2）、利用urllib.request.build_opener()封装代理IP或请求头；

（3）、利用urllib.request.instanll_opener()安装成全局；

（4）、用urlopen()访问网页.

import urllib.request
proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
 
a = urllib.request.urlopen("http://www.zhihu.com/ ").read().decode("utf8")
print(a)

urllib.request. pathname2url ( path )和urllib.request.url2pathname(path)

这两个函数暂时还没研究透，就不拿出来给大家了

urllib.request.getproxies()

这个帮助函数讲一个scheme的字典返回给代理服务器URL映射。

以下是从python2中移植的函数，官方文档上说，他们可能在未来的某个时间变得过时：

urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)

这个函数大家做爬虫，特别是爬网页的时候，应该是用的最多的，作用是讲一个URL访问的网络对象复制到本地，第二个参数用于制定本地文件位置；

def getImage(html):
    reg=r'src="(.+?\.jpg)"'#正则表达式
    imgre=re.compile(reg)#将正则表达式编译成一个正则表达式对象
    html = html.decode('utf-8')#python3中使用
    # imglist=reg.findall(imgre,html)
    imglist=re.findall(imgre,html)
    x=0
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'H:\picture\%s.jpg' %x)
        x += 1

urllib.request.urlcleanup()

这个函数用于清理之前调用urlretrieve（）函数留下的临时文件。

这个子模块的函数介绍常用的就这么多，以后会挑个时间对这个子模块的类进行分析。