本文翻译自:Download large file in python with requests
Requests is a really nice library. 请求是一个非常不错的库。 I'd like to use it for download big files (>1GB). 我想用它来下载大文件(> 1GB)。 The problem is it's not possible to keep whole file in memory I need to read it in chunks. 问题是不可能将整个文件保留在内存中,我需要分块读取它。 And this is a problem with the following code 这是以下代码的问题
import requests
def DownloadFile(url)
local_filename = url.split('/')[-1]
r = requests.get(url)
f = open(local_filename, 'wb')
for chunk in r.iter_content(chunk_size=512 * 1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.close()
return
By some reason it doesn't work this way. 由于某种原因,它无法按这种方式工作。 It still loads response into memory before save it to a file. 仍将响应加载到内存中,然后再将其保存到文件中。
UPDATE 更新
If you need a small client (Python 2.x /3.x) which can download big files from FTP, you can find it here . 如果您需要一个小型客户端(Python 2.x /3.x),可以从FTP下载大文件,则可以在此处找到它。 It supports multithreading & reconnects (it does monitor connections) also it tunes socket params for the download task. 它支持多线程和重新连接(它确实监视连接),还可以为下载任务调整套接字参数。
#1楼
参考:https://stackoom.com/question/1836h/使用请求在python中下载大文件
#2楼
Your chunk size could be too large, have you tried dropping that - maybe 1024 bytes at a time? 您的块大小可能太大,您是否尝试过删除它-一次一次可能是1024个字节? (also, you could use with
to tidy up the syntax) (同样,您可以使用with
整理语法)
def DownloadFile(url):
local_filename = url.split('/')[-1]
r = requests.get(url)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return
Incidentally, how are you deducing that the response has been loaded into memory? 顺便说一句,您如何推断响应已加载到内存中?
It sounds as if python isn't flushing the data to file, from other SO questions you could try f.flush()
and os.fsync()
to force the file write and free memory; 听起来好像python没有将数据刷新到文件,从其他SO问题 f.flush()
您可以尝试f.flush()
和os.fsync()
来强制文件写入和释放内存;
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
os.fsync(f.fileno())
#3楼
With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file: 使用以下流代码,无论下载文件的大小如何,Python内存的使用都受到限制:
def download_file(url):
local_filename = url.split('/')[-1]
# NOTE the stream=True parameter below
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
# f.flush()
return local_filename
Note that the number of bytes returned using iter_content
is not exactly the chunk_size
; 注意,使用iter_content
返回的字节数不完全是chunk_size
; it's expected to be a random number that is often far bigger, and is expected to be different in every iteration. 它应该是一个通常更大的随机数,并且在每次迭代中都应该有所不同。
See http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow for further reference. 有关更多参考,请参见http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow 。
#4楼
It's much easier if you use Response.raw
and shutil.copyfileobj()
: 如果使用Response.raw
和shutil.copyfileobj()
会容易得多:
import requests
import shutil
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url, stream=True) as r:
with open(local_filename, 'wb') as f:
shutil.copyfileobj(r.raw, f)
return local_filename
This streams the file to disk without using excessive memory, and the code is simple. 这样就无需占用过多内存就可以将文件流式传输到磁盘,并且代码很简单。
#5楼
Not exactly what OP was asking, but... it's ridiculously easy to do that with urllib
: 不完全是OP的要求,但是...使用urllib
做到这一点非常容易:
from urllib.request import urlretrieve
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
dst = 'ubuntu-16.04.2-desktop-amd64.iso'
urlretrieve(url, dst)
Or this way, if you want to save it to a temporary file: 或这样,如果您要将其保存到临时文件中:
from urllib.request import urlopen
from shutil import copyfileobj
from tempfile import NamedTemporaryFile
url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
with urlopen(url) as fsrc, NamedTemporaryFile(delete=False) as fdst:
copyfileobj(fsrc, fdst)
I watched the process: 我看了看这个过程:
watch 'ps -p 18647 -o pid,ppid,pmem,rsz,vsz,comm,args; ls -al *.iso'
And I saw the file growing, but memory usage stayed at 17 MB. 而且我看到文件在增长,但内存使用量保持在17 MB。 Am I missing something? 我想念什么吗?