python之urllib模块urlretrieve（）函数初窥

最新推荐文章于 2024-07-22 17:25:13 发布

Together_CZ

最新推荐文章于 2024-07-22 17:25:13 发布

阅读量8.7k

点赞数 1

本文链接：https://blog.csdn.net/Together_CZ/article/details/56480579

版权

今天在使用python的库urllib做实验的时候接触到了urlretrieve()这个函数，个人感觉很有意思，可能是因为之前一直都没有使用过这样简洁的函数，urllib模块可以很容易的获取互联网上页面的html内容，之后可以利用re模块提供的正则表达式来提取自己需要的信息，在需要从网上下载数据或者材料的时候需要编写专门的程序来保存文件。

urlretrieve()这个函数可以直接从互联网上下载文件保存到本地路径下，下面简单来介绍一下这个函数的用法。

在python shell中输入：

help(urllib.urlretrieve)

得到关于这个函数的基本用法信息为：

Help on function urlretrieve in module urllib:

urlretrieve(url, filename=None, reporthook=None, data=None)
(END)

在这里介绍一下各个参数的意义，进入python官网查看的关于urllib模块的官网文档得到如下解释：

参数 finename 指定了保存本地路径（如果参数未指定，urllib会生成一个临时文件保存数据。）
参数 reporthook 是一个回调函数，当连接上服务器、以及相应的数据块传输完毕时会触发该回调，我们可以利用这个回调函数来显示当前的下载进度。
参数 data 指 post 到服务器的数据，该方法返回一个包含两个元素的(filename, headers)元组，filename 表示保存到本地的路径，header 表示服务器的响应头。

在这里借用python爬虫课程中的一个程序例子来简单介绍一下urlretrieve()函数的使用：

#!/usr/bin/python  
# -*- coding: utf-8 -*-  

import os
from urllib import urlretrieve
from urllib import urlopen
from bs4 import BeautifulSoup

downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"

def getAbsoluteURL(baseUrl, source):
    if source.startswith("http://www."):
        url = "http://"+source[11:]
    elif source.startswith("http://"):
        url = source
    elif source.startswith("www."):
        url = source[4:]
        url = "http://"+source
    else:
        url = baseUrl+"/"+source
    if baseUrl not in url:
        return None
    return url

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
    path = absoluteUrl.replace("www.", "")
    path = path.replace(baseUrl, "")
    path = downloadDirectory+path
    directory = os.path.dirname(path)

    if not os.path.exists(directory):
        os.makedirs(directory)

    return path

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html, "html.parser")
downloadList = bsObj.findAll(src=True)
print '________________________________________________________________________________________________'
print downloadList
print len(downloadList)

for download in downloadList:
    fileUrl = getAbsoluteURL(baseUrl, download["src"])
    if fileUrl is not None:
        print '********************************************************************************'
        print(fileUrl)
        urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

实验结果如下：

[<script src="http://www.pythonscraping.com/misc/jquery.js?v=1.4.4" type="text/javascript"></script>, <script src="http://www.pythonscraping.com/misc/jquery.once.js?v=1.2" type="text/javascript"></script>, <script src="http://www.pythonscraping.com/misc/drupal.js?nhx1dd" type="text/javascript"></script>, <script src="http://www.pythonscraping.com/sites/all/themes/skeletontheme/js/jquery.mobilemenu.js?nhx1dd" type="text/javascript"></script>, <script src="http://www.pythonscraping.com/sites/all/modules/google_analytics/googleanalytics.js?nhx1dd" type="text/javascript"></script>, <img alt="Home" src="http://www.pythonscraping.com/sites/default/files/lrg_0.jpg"/>, <iframe frameborder="0" height="500px" scrolling="no" src="http://www.oreilly.com/authors/widgets/669.html" width="200px"></iframe>, <img alt="" src="http://pythonscraping.com/img/lrg%20(1).jpg" style="height:394px; width:300px"/>]
8
http://pythonscraping.com/misc/jquery.js?v=1.4.4
http://pythonscraping.com/misc/jquery.once.js?v=1.2
http://pythonscraping.com/misc/drupal.js?nhx1dd
http://pythonscraping.com/sites/all/themes/skeletontheme/js/jquery.mobilemenu.js?nhx1dd
http://pythonscraping.com/sites/all/modules/google_analytics/googleanalytics.js?nhx1dd
http://pythonscraping.com/sites/default/files/lrg_0.jpg
http://pythonscraping.com/img/lrg%20(1).jpg

运行之后在程序同级的目录下面出现了downloaded目录文件，打开一看可以看到文件都保存成功，程序很简单，感兴趣的同学也可以拿去运行试试。

接下来使用urlretrieve()来下载几个网页保存到指定路径中去，程序如下：

#!/usr/bin/python  
# -*- coding: utf-8 -*-  

import os
import urllib

def callback(dbnum, dbsize, size):
    '''回调函数
    dbnum: 已经下载的数据块
    dbsize: 数据块的大小
    size: 远程文件的大小
    '''
    percent = 100.0 * dbnum * dbsize / size
    if percent > 100:
        percent = 100
    print "%.2f%%"% percent

#将输入的url处理为标准格式，否则会发生解析出错的问题
def url_handle(url):
    if url.startswith('http://www.'):
        pass
    else:
        url = 'http://www.' + url
    return url 

if __name__ == '__main__':
    if not os.path.exists('downloaddata'):
        os.mkdir('downloaddata')
    url_list = ['http://www.baidu.com', 'taobao.com', 'http://www.vmall.com']
    for one_url in url_list:
    	print '*****************************downloading************************************'
        url = url_handle(one_url)
        local = 'downloaddata' + '/' + url.split('.')[-2] + '.html'
        urllib.urlretrieve(url, local, callback)

得到如下结果：

*****************************downloading************************************
-0.00%
-819200.00%
-1638400.00%
-2457600.00%
-3276800.00%
-4096000.00%
-4915200.00%
-5734400.00%
-6553600.00%
-7372800.00%
-8192000.00%
-9011200.00%
-9830400.00%
-10649600.00%
*****************************downloading************************************
-0.00%
-819200.00%
-1638400.00%
-2457600.00%
-3276800.00%
-4096000.00%
-4915200.00%
-5734400.00%
-6553600.00%
-7372800.00%
-8192000.00%
-9011200.00%
-9830400.00%
-10649600.00%
-11468800.00%
-12288000.00%
-13107200.00%
-13926400.00%
*****************************downloading************************************
-0.00%
-819200.00%
-1638400.00%
-2457600.00%
-3276800.00%
-4096000.00%
-4915200.00%
-5734400.00%
-6553600.00%
-7372800.00%
-8192000.00%
-9011200.00%
-9830400.00%
-10649600.00%
-11468800.00%
-12288000.00%
-13107200.00%
-13926400.00%
-14745600.00%
-15564800.00%