python单击url下载网页文件_Python实现web文件(常规url、重定向url)下载方式总结

最新推荐文章于 2024-07-03 18:08:13 发布

weixin_39867125

最新推荐文章于 2024-07-03 18:08:13 发布

阅读量860

点赞数

文章标签： python单击url下载网页文件

简介

Python提供了多种下载web文件（pdf、文档、图片和视频等）的方式，在本文中将介绍以下要点：

下载常规文件；

下载重定向文件；

下载大型文件；

多线程下载。

requests方法

requests 模块提供了简单的方式实现url链接文件的下载。

比如以下代码：

import requests

url = 'https://www.python.org/static/img/python-logo@2x.png'

file = requests.get(url)

with open('logo.png', 'wb') as f:

f.write(file.content)

importrequests

url='https://www.python.org/static/img/python-logo@2x.png'

file=requests.get(url)

withopen('logo.png','wb')asf:

f.write(file.content)

在这里我们下载Python官网的logo图片。首先导入

requests 模块，调用

get 方法下载url的响应头，连接保持打开状态，并将结果保存到变量

file 中，最后通过

file.content 将内容写入文件方法，将内容保存到当前目录，并命名为logo.png。

wget方法

熟悉Linux命令的都对

wget 不陌生，Python同样提供了

wget 模块，能够通过url很方便地下载文件。

代码如下：

import wget

url = 'https://www.python.org/static/img/python-logo@2x.png'

wget.download(url, './logo.png')

importwget

url='https://www.python.org/static/img/python-logo@2x.png'

wget.download(url,'./logo.png')

同样是下载Python官网的logo文件。在这段代码中，url和本地保存的地址直接传递给

wget.download 方法。需要注意的是

wget 本身不在

anaconda 中，需要使用

pipinstallwget 下载安装。

重定向文件下载

在下载一个文件时，通过url链接常常不能直接访问文件url，web服务器会经过重定向方式引导到真正的链接地址，Python提供了适合重定向的链接下载方式。

代码如下：

import requests

url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'

file = requests.get(url, allow_redirects=True)

with open('finthon.pdf', 'wb') as f:

f.write(file.content)

importrequests

url='https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'

file=requests.get(url,allow_redirects=True)

withopen('finthon.pdf','wb')asf:

f.write(file.content)

在这里，我们同样使用

requests 模块的

get 方法获取该url的内容，在

get 方法中传入

allow_redirects=True 参数，允许重定向操作，并将重定向后的内容传给变量

file ，最后保存文件到finthon.pdf。

大文件分块下载

默认情况下，

requests 当你进行网络请求后，响应体会立即被下载，当数据过大时会导致内存不足。当在请求上设置

stream=True 时，这避免了立即将内容读入内存以获得较大的响应。推迟下载响应体直到访问

Response.content 属性。如果

stream=False (默认)，数据将作为单个块返回。

代码如下：

import requests

url = 'https://www.python.org/static/img/python-logo@2x.png'

with open('logo.png', 'wb') as f:

with requests.get(url, stream=True) as r:

for chunk in r.iter_content(chunk_size=20):

if chunk:

f.write(chunk)

importrequests

url='https://www.python.org/static/img/python-logo@2x.png'

withopen('logo.png','wb')asf:

withrequests.get(url,stream=True)asr:

forchunkinr.iter_content(chunk_size=20):

ifchunk:

f.write(chunk)

如果你在请求中

stream=True ，

Requests 无法将连接释放回连接池，除非你消耗了所有的数据，或者调用了

Response.close 。这样会带来连接效率低下的问题。如果你发现你在使用

stream=True 的同时还在部分读取请求的 body（或者完全没有读取 body），那么你就应该考虑使用 with 语句发送请求，这样可以保证请求一定会被关闭。

在这里

chunk_size 控制每次读取大小为20字节的块，当所有的块都读完后，保存到logo.png文件中。

并行批量下载多个文件

import requests

from time import time

from multiprocessing.pool import ThreadPool

importrequests

fromtimeimporttime

frommultiprocessing.poolimportThreadPool

我们导入了

time 模块来检查下载文件需要多少时间。

ThreadPool 模块允许你使用池运行多个线程或进程。

首先创建一个简单的函数，将响应分块发送到一个文件：

def get_url(url, name):

with open(name, 'wb') as f:

with requests.get(url, stream=True) as r:

for chunk in r.iter_content(chunk_size=20):

if chunk:

f.write(chunk)

defget_url(url,name):

withopen(name,'wb')asf:

withrequests.get(url,stream=True)asr:

forchunkinr.iter_content(chunk_size=20):

ifchunk:

f.write(chunk)

接下来我们要批量下载多个url文件：

urls = [('event1', 'https://www.python.org/events/python-events/805'),

('event2', 'https://www.python.org/events/python-events/801'),

('event3', 'https://www.python.org/events/python-events/790'),

('event4', 'https://www.python.org/events/python-events/798'),

('event5', 'https://www.python.org/events/python-events/807'),

('event6', 'https://www.python.org/events/python-events/807'),

('event7', 'https://www.python.org/events/python-events/757'),

('event8', 'https://www.python.org/events/python-events/816')]

urls=[('event1','https://www.python.org/events/python-events/805'),

('event2','https://www.python.org/events/python-events/801'),

('event3','https://www.python.org/events/python-events/790'),

('event4','https://www.python.org/events/python-events/798'),

('event5','https://www.python.org/events/python-events/807'),

('event6','https://www.python.org/events/python-events/807'),

('event7','https://www.python.org/events/python-events/757'),

('event8','https://www.python.org/events/python-events/816')]

我们将这些url传递给

requests.get 。最后，我们打开文件(url中指定的路径)并写入页面内容。

现在，我们可以分别为每个url调用这个函数，我们也可以同时为所有url调用这个函数。让我们在for循环中分别为每个url调用这个函数，注意计时器:

start = time()

for x in urls:

get_url(x[1], x[0])

end = time()

print('time to download: ', end - start)

start=time()

forxinurls:

get_url(x[1],x[0])

end=time()

print('time to download: ',end-start)

输出：

time to download: 10.102667570114136

timetodownload:10.102667570114136

现在，考虑以下代码：

p = ThreadPool(8)

start = time()

for x in urls:

p.apply_async(get_url, args=(x[1], x[0]))

p.close()

p.join() #调用join之前，先调用close函数，否则会出错。执行完close后不会有新的进程加入到pool,join函数等待所有子进程结束

end = time()

print('time to download: ', end - start)

p=ThreadPool(8)

start=time()

forxinurls:

p.apply_async(get_url,args=(x[1],x[0]))

p.close()

p.join()#调用join之前，先调用close函数，否则会出错。执行完close后不会有新的进程加入到pool,join函数等待所有子进程结束

end=time()

print('time to download: ',end-start)

输出：

time to download: 2.9488866329193115

timetodownload:2.9488866329193115

可以看出，使用多进程后，程序运行的速度更快了。

创建进度条下载

进度条是

clint 模块的一个UI组件。输入以下命令来安装

clint 模块：

pip install clint

pipinstallclint

有如下代码：

import requests

from clint.textui import progress

url = 'https://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf'

r = requests.get(url, stream=True)

with open('lp.pdf', 'wb') as f:

total_length = int(r.headers.get('content-length'))

for x in progress.bar(r.iter_content(chunk_size=200000), expected_size=(total_length/1024) + 1):

if x:

f.write(x)

importrequests

fromclint.textuiimportprogress

url='https://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf'

r=requests.get(url,stream=True)

withopen('lp.pdf','wb')asf:

total_length=int(r.headers.get('content-length'))

forxinprogress.bar(r.iter_content(chunk_size=200000),expected_size=(total_length/1024)+1):

ifx:

f.write(x)

在这段代码中，我们首先导入了

requests 模块，然后，我们从

clint.textui 导入了进度组件。唯一的区别是在for循环中。在将内容写入文件时，我们使用了进度条模块的

bar 方法。

总结

本文介绍了Python下载web文件的各种方法，记得点赞收藏哦！

weixin_39867125

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫