在Django下上传文件是一件简单的事情,尤其是使用request.FILES,真是很简单。不过在性能方面,就不是很美妙了。
先补充说明一下,我现在用的Django版本是svn trunk revision 6635 。上传大文件的时候,内存占用和CPU都很高。查了一下代码,获取上传文件的代码是在django/http/__init__.py文件里的parse_file_upload方法里,这个方法会把client post过来的数据解析,放到POST和FILES两个集合里。代码如下:
def
parse_file_upload(header_dict, post_data):
" Returns a tuple of (POST QueryDict, FILES MultiValueDict) "
import email, email.Message
from cgi import parse_header
raw_message = ' ' .join([ ' %s:%s ' % pair for pair in header_dict.items()])
raw_message += ' ' + post_data
msg = email.message_from_string(raw_message)
POST = QueryDict( '' , mutable = True)
FILES = MultiValueDict()
for submessage in msg.get_payload():
if submessage and isinstance(submessage, email.Message.Message):
name_dict = parse_header(submessage[ ' Content-Disposition ' ])[ 1 ]
# name_dict is something like {'name': 'file', 'filename': 'test.txt'} for file uploads
# or {'name': 'blah'} for POST fields
# We assume all uploaded files have a 'filename' set.
if ' filename ' in name_dict:
assert type([]) != type(submessage.get_payload()), " Nested MIME messages are not supported "
if not name_dict[ ' filename ' ].strip():
continue
# IE submits the full path, so trim everything but the basename.
# (We can't use os.path.basename because that uses the server's
# directory separator, which may not be the same as the
# client's one.)
filename = name_dict[ ' filename ' ][name_dict[ ' filename ' ].rfind( " / " ) + 1 :]
FILES.appendlist(name_dict[ ' name ' ], FileDict({
' filename ' : filename,
' content-type ' : ' Content-Type ' in submessage and submessage[ ' Content-Type ' ] or None,
' content ' : submessage.get_payload(),
}))
else :
POST.appendlist(name_dict[ ' name ' ], submessage.get_payload())
return POST, FILES
" Returns a tuple of (POST QueryDict, FILES MultiValueDict) "
import email, email.Message
from cgi import parse_header
raw_message = ' ' .join([ ' %s:%s ' % pair for pair in header_dict.items()])
raw_message += ' ' + post_data
msg = email.message_from_string(raw_message)
POST = QueryDict( '' , mutable = True)
FILES = MultiValueDict()
for submessage in msg.get_payload():
if submessage and isinstance(submessage, email.Message.Message):
name_dict = parse_header(submessage[ ' Content-Disposition ' ])[ 1 ]
# name_dict is something like {'name': 'file', 'filename': 'test.txt'} for file uploads
# or {'name': 'blah'} for POST fields
# We assume all uploaded files have a 'filename' set.
if ' filename ' in name_dict:
assert type([]) != type(submessage.get_payload()), " Nested MIME messages are not supported "
if not name_dict[ ' filename ' ].strip():
continue
# IE submits the full path, so trim everything but the basename.
# (We can't use os.path.basename because that uses the server's
# directory separator, which may not be the same as the
# client's one.)
filename = name_dict[ ' filename ' ][name_dict[ ' filename ' ].rfind( " / " ) + 1 :]
FILES.appendlist(name_dict[ ' name ' ], FileDict({
' filename ' : filename,
' content-type ' : ' Content-Type ' in submessage and submessage[ ' Content-Type ' ] or None,
' content ' : submessage.get_payload(),
}))
else :
POST.appendlist(name_dict[ ' name ' ], submessage.get_payload())
return POST, FILES
简单看看就知道,真的是很不美妙啊,上传过来的内容都放在内存里(raw_message),要是有人上传个大文件,估计服务器就要被搞死了。在modpython.py和wsgi.py中,都是调用parse_file_upload来获取FILES的。
在网上查了一下资料,也有不少人问类似的问题。有人说,在生产环境中,可以在前端WEB服务器(比如Apache)中限制client post request的数据大小。还有人提出改进Django中的上传机制,采用临时文件来处理大文件上传。后面这个办法我觉得比较靠谱。在Django的Trac上有个编号2070的ticket,从去年就在改进这个问题了,我看了一下最新的UPDATE日期,是10月27日的,似乎快要搞定了。具体感兴趣的可以去看看:
还有这篇文档