一、讲一个亲身经历的例子
本人前些日子给公司HR部门做了一个面试管理系统,基于Django框架,采用xadmin扩展。其中一个模块 简历管理模块, 主要功能是上传简历、识别简历文本/关键信息、以及存储简历文件。
本来表结构已经设计的好好的,其中文件存储方式采用django默认的文件存储方式:数据库表只存放简历文件的存储路径,简历实体文件则存放在服务器硬盘上。某一天呢,我们不太懂技术的项目经理突然过目项目,觉得这样的存储方式太low,不安全,他觉得存储在硬盘上的话,岂不是谁登录这台服务器都能看到简历了,某个人手闲删了简历文件怎么办呢。于是乎就要求将简历文件存储在数据库。当时我是百般解释存储在数据库的弊端,比如性能问题阿,数据库备份问题阿。可是人家就是不听你的,非要存到数据库,那就存吧,数据库新加了两个字段:cv_name(简历名称)、cv_data(简历二进制数据)。
就这样,项目上线了一段时间,问题终于出现了。HR访问简历列表页的时候非常慢,大概需要5-6秒才能刷出来,此时的简历大概也只有3000多份左右。经调查cv_data这个字段就是问题文原因之一。
二、性能测试
首先,通过操作UI画面的方式进行测试,从宏观上感受一下cv_data这个字段对网站性能的影响(测试数据量为1332条):
所测试表结构:
不对cv_data字段做任何修改,且字段值不为空(存储60kb左右的文件):
打开list页面至加载完成平均耗时:12s
不对cv_data字段做任何修改,且字段值设置为空:
打开list页面至加载完成平均耗时:4.7s
将cv_data字段由BinaryField改为普通CharField类型字段:
打开list页面至加载完成平均耗时:2.6s
可以看到cv_data字段即使为空值,对性能影响也非常大。
三、性能分析
下面我将从两个方面对性能进行分析,并结合代码、ORM、数据库进行详细解释:
1、先看分页器
django自带了分页器功能,在django/core/paginator.py下class Paginator(object),访问列表页,django-xadmin会进行分页查询,初始化分页器并接收queryset参数,返回分页结果,以下是django和xadmin源码:
class Paginator(object):
"""分页器"""
def __init__(self, object_list, per_page, orphans=0,
allow_empty_first_page=True):
self.object_list = object_list
self._check_object_list_is_ordered()
self.per_page = int(per_page)
self.orphans = int(orphans)
self.allow_empty_first_page = allow_empty_first_page
def validate_number(self, number):
"""
Validates the given 1-based page number.
"""
try:
number = int(number)
except (TypeError, ValueError):
raise PageNotAnInteger(_('That page number is not an integer'))
if number < 1:
raise EmptyPage(_('That page number is less than 1'))
if number > self.num_pages:
if number == 1 and self.allow_empty_first_page:
pass
else:
raise EmptyPage(_('That page contains no results'))
return number
def page(self, number):
"""
Returns a Page object for the given 1-based page number.
"""
number = self.validate_number(number)
bottom = (number - 1) * self.per_page
top = bottom + self.per_page
if top + self.orphans >= self.count:
top = self.count
return self._get_page(self.object_list[bottom:top], number, self)
def _get_page(self, *args, **kwargs):
"""
Returns an instance of a single page.
This hook can be used by subclasses to use an alternative to the
standard :cls:`Page` object.
"""
return Page(*args, **kwargs)
@cached_property
def count(self):
"""
Returns the total number of objects, across all pages.
"""
try:
return self.object_list.count()
except (AttributeError, TypeError):
# AttributeError if object_list has no count() method.
# TypeError if object_list.count() requires arguments
# (i.e. is of type list).
return len(self.object_list)
@cached_property
def num_pages(self):
"""
Returns the total number of pages.
"""
if self.count == 0 and not self.allow_empty_first_page:
return 0
hits = max(1, self.count - self.orphans)
return int(ceil(hits / float(self.per_page)))
@property
def page_range(self):
"""
Returns a 1-based range of pages for iterating through within
a template for loop.
"""
return six.moves.range(1, self.num_pages + 1)
def _check_object_list_is_ordered(self):
"""
Warn if self.object_list is unordered (typically a QuerySet).
"""
ordered = getattr(self.object_list, 'ordered', None)
if ordered is not None and not ordered:
obj_list_repr = (
'{} {}'.format(self.object_list.model, self.object_list.__class__.__name__)
if hasattr(self.object_list, 'model')
else '{!r}'.format(self.object_list)
)
warnings.warn(
'Pagination may yield inconsistent results with an unordered '
'object_list: {}.'.format(obj_list_repr),
UnorderedObjectListWarning,
stacklevel=3
)
分页器初始化后进行数据总行数的计算,self.object_list.count(),object_list是初始化分页器传进来的queryset,这条ORM(202行len(self.result_list)也一样)对应的查询语句则是:
SELECT COUNT(*) FROM (SELECT DISTINCT `hr_hruser`.`id` AS Col1, `hr_hruser`.`username` AS Col2, `hr_hruser`.`cv_name` AS Col3, `hr_hruser`.`cv_data` AS Col4, `hr_hruser`.`uploader_id` AS Col5, `hr_hruser`.`skill_tags` AS Col6, `hr_hruser`.`cv_status` AS Col7, `hr_hruser`.`gender` AS Col8, `hr_hruser`.`education` AS Col9, `hr_hruser`.`language` AS Col10, `hr_hruser`.`language_level` AS Col11, `hr_hruser`.`age` AS Col12, `hr_hruser`.`work_life` AS Col13, `hr_hruser`.`mail` AS Col14, `hr_hruser`.`phone` AS Col15, `hr_hruser`.`work_type` AS Col16, `hr_hruser`.`first_impression` AS Col17, `hr_hruser`.`alltxt` AS Col18, `hr_hruser`.`comment` AS Col19, `hr_hruser`.`create_time` AS Col20, `hr_hruser`.`update_time` AS Col21 FROM `hr_hruser`)
共耗时:7.1s
手动去掉cv_data这个字段时:
SELECT COUNT(*) FROM (SELECT DISTINCT `hr_hruser`.`id` AS Col1, `hr_hruser`.`username` AS Col2, `hr_hruser`.`cv_name` AS Col3, `hr_hruser`.`uploader_id` AS Col5, `hr_hruser`.`skill_tags` AS Col6, `hr_hruser`.`cv_status` AS Col7, `hr_hruser`.`gender` AS Col8, `hr_hruser`.`education` AS Col9, `hr_hruser`.`language` AS Col10, `hr_hruser`.`language_level` AS Col11, `hr_hruser`.`age` AS Col12, `hr_hruser`.`work_life` AS Col13, `hr_hruser`.`mail` AS Col14, `hr_hruser`.`phone` AS Col15, `hr_hruser`.`work_type` AS Col16, `hr_hruser`.`first_impression` AS Col17, `hr_hruser`.`alltxt` AS Col18, `hr_hruser`.`comment` AS Col19, `hr_hruser`.`create_time` AS Col20, `hr_hruser`.`update_time` AS Col21 FROM `hr_hruser`)
共耗时:0.4s
为做对比,在贴几个精简的查询方式:
SELECT COUNT(*) FROM `hr_hruser`
共耗时:0.02s
SELECT cv_data FROM `hr_hruser`
共耗时:3.2s
SELECT count(cv_data) FROM `hr_hruser`
共耗时:3.2s
绿色部分是queryset惰性特性,也就是说获取queryset时并不会去查询,只有做循环、if判断、查值、计算等才会真正去数据库查询。
结论:
django分页器进行总数量查询时,ORM转SQL时会产生一个包含所有字段的子查询臃肿语句(也就是上面绿色的sql文),并且BinaryField类型的字段会对SQL性能造成非常巨大影响。
2、再看数据渲染
xadmin渲染列表页数据时,会经历以下步骤:
1、首先拿到所有数据的queryset
2、由分页器获取某一页要展示数据的queryset,比如第一页,展示前20条数据
3、遍历这20条数据,对queryset进行加工
queryset做遍历时,才会真正去数据库做查询,对应的sql语句为:
SELECT DISTINCT `hr_hruser`.`id`, `hr_hruser`.`username`, `hr_hruser`.`cv_name`, `hr_hruser`.`cv_data`, `hr_hruser`.`uploader_id`, `hr_hruser`.`skill_tags`, `hr_hruser`.`cv_status`, `hr_hruser`.`gender`, `hr_hruser`.`education`, `hr_hruser`.`language`, `hr_hruser`.`language_level`, `hr_hruser`.`age`, `hr_hruser`.`work_life`, `hr_hruser`.`mail`, `hr_hruser`.`phone`, `hr_hruser`.`work_type`, `hr_hruser`.`first_impression`, `hr_hruser`.`alltxt`, `hr_hruser`.`comment`, `hr_hruser`.`create_time`, `hr_hruser`.`update_time`, `auth_user`.`id`, `auth_user`.`password`, `auth_user`.`last_login`, `auth_user`.`is_superuser`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`date_joined` FROM `hr_hruser` INNER JOIN `auth_user` ON (`hr_hruser`.`uploader_id` = `auth_user`.`id`) ORDER BY `hr_hruser`.`create_time` DESC, `hr_hruser`.`id` DESC LIMIT 20
共耗时:3.75s
以上sql的执行顺序为先拿到数据库所有数据(1332条),并做排序处理,然后在拿到前20条数据。
我尝试了对sql做一些优化测试:
仅去掉cv-data字段:
SELECT DISTINCT `hr_hruser`.`id`, `hr_hruser`.`username`, `hr_hruser`.`cv_name`,`hr_hruser`.`uploader_id`, `hr_hruser`.`skill_tags`, `hr_hruser`.`cv_status`, `hr_hruser`.`gender`, `hr_hruser`.`education`, `hr_hruser`.`language`, `hr_hruser`.`language_level`, `hr_hruser`.`age`, `hr_hruser`.`work_life`, `hr_hruser`.`mail`, `hr_hruser`.`phone`, `hr_hruser`.`work_type`, `hr_hruser`.`first_impression`, `hr_hruser`.`alltxt`, `hr_hruser`.`comment`, `hr_hruser`.`create_time`, `hr_hruser`.`update_time`, `auth_user`.`id`, `auth_user`.`password`, `auth_user`.`last_login`, `auth_user`.`is_superuser`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`date_joined` FROM `hr_hruser` INNER JOIN `auth_user` ON (`hr_hruser`.`uploader_id` = `auth_user`.`id`) ORDER BY `hr_hruser`.`create_time` DESC, `hr_hruser`.`id` DESC LIMIT 20
共耗时:0.26s
仅去掉排序条件:
SELECT DISTINCT `hr_hruser`.`id`, `hr_hruser`.`username`, `hr_hruser`.`cv_name`, `hr_hruser`.`cv_data`, `hr_hruser`.`uploader_id`, `hr_hruser`.`skill_tags`, `hr_hruser`.`cv_status`, `hr_hruser`.`gender`, `hr_hruser`.`education`, `hr_hruser`.`language`, `hr_hruser`.`language_level`, `hr_hruser`.`age`, `hr_hruser`.`work_life`, `hr_hruser`.`mail`, `hr_hruser`.`phone`, `hr_hruser`.`work_type`, `hr_hruser`.`first_impression`, `hr_hruser`.`alltxt`, `hr_hruser`.`comment`, `hr_hruser`.`create_time`, `hr_hruser`.`update_time`, `auth_user`.`id`, `auth_user`.`password`, `auth_user`.`last_login`, `auth_user`.`is_superuser`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`date_joined` FROM `hr_hruser` INNER JOIN `auth_user` ON (`hr_hruser`.`uploader_id` = `auth_user`.`id`) LIMIT 20
共耗时:0.03s
最后,我又尝试对create_time、id字段创建索引,结果对性能提升也是微乎其微。
4、将加工的数据传到模板中进行渲染
结论:
当同时存在BinaryField类型字段和排序条件时,mysql的性能会非常差。即使对排序字段创建索引和联合索引也无济于事。
四、解决办法
1、针对分页器的优化
思路:ORM具有惰性,在ORM转成sql真正去数据库查询前进行拦截,获取并修改sql,去掉无关字段再执行sql。
自定义分页器,重写count方法:
1、新建paginator.py文件:
from django.core.paginator import Paginator
from django.utils.functional import cached_property
from utils.ORM_performance import ORM
class HRMSPaginator(Paginator):
@cached_property
def count(self):
"""
Returns the total number of objects, across all pages.
"""
try:
return ORM.count(self.object_list)
except (AttributeError, TypeError):
return len(self.object_list)
2、新建ORM_performance.py文件:
from django.db import connection
class ORM(object):
_bloated_field = ["`hr_hruser`.`cv_data`", "`hr_interview`.`cv_safe_data`", "`hr_interview`.`evaluation_1_data`",
"`hr_interview`.`evaluation_2_data`", "`hr_interview`.`evaluation_3_data`",
"`hr_interview`.`offer_data`"
] #这些字段都是BinaryField类型
@classmethod
def count(cls, queryset):
"""receive a queryset and return the length of the queryset"""
cursor = connection.cursor()
sql = queryset.query.__str__()
#去掉无关字段
for field in cls._bloated_field:
if field in sql:
sql = sql.replace(field, "")
new_sql = sql.replace(", ,", ",")
try:
#new_sql中当包含以字符串为参数的查询时,execute会报错,因为new_sql中的字符串并没有被引号引起来。而sql语句要求字符串必须要有引号的。
return cursor.execute(new_sql)
except:
try:
return queryset.count()
except (AttributeError, TypeError):
return len(queryset)
finally:
cursor.close()
3、修改\admin\views\list.py
class ListAdminView(ModelAdminView):
"""
Display models objects view. this class has ordering and simple filter features.
"""
list_display = ('__str__',)
list_display_links = ()
list_display_links_details = False
list_select_related = None
list_per_page = 50
list_max_show_all = 200
list_exclude = ()
search_fields = ()
paginator_class = HRMSPaginator #修改一:使用自定义paginator
ordering = None
def make_result_list(self):
# Get search parameters from the query string.
self.base_queryset = self.queryset()
self.list_queryset = self.get_list_queryset()
self.ordering_field_columns = self.get_ordering_field_columns()
self.paginator = self.get_paginator()
# Get the number of objects, with admin filters applied.
self.result_count = self.paginator.count
self.can_show_all = self.result_count <= self.list_max_show_all
self.multi_page = self.result_count > self.list_per_page
# Get the list of objects to display on this page.
if (self.show_all and self.can_show_all) or not self.multi_page:
self.result_list = self.list_queryset._clone()
else:
try:
self.result_list = self.paginator.page(
self.page_num + 1).object_list
except InvalidPage:
if ERROR_FLAG in self.request.GET.keys():
return SimpleTemplateResponse('xadmin/views/invalid_setup.html', {
'title': _('Database error'),
})
return HttpResponseRedirect(self.request.path + '?' + ERROR_FLAG + '=1')
self.has_more = self.result_count > (
self.list_per_page * self.page_num + ORM.count(self.result_list)) #修改二:使用自定义方法查询queryset长度
2、针对数据渲染的优化
如果排序需求不是很大,可以先去掉排序。
也可考虑将cv_data这样的字段单独存储到一张表,以外键关联,专门存储二进制或其他较大的文本数据。
五、结论
1、数据库最好不用存放大文件,严格说不要使用BinaryField类型字段,如果必须,那就用单独一张表保存,以外键与业务表关联。
2、User.objects.all()、User.objects.filter()、User.objects.get()等类似的ORM转化成SQL时,通常都会产生一个类似select field1, field2...所有字段 from .. where .. 这样的臃肿语句,即使你仅仅只想查找一个字段。所以尽量使用User.objects.values()、User.objects.values_list()来避免以上问题。并且包含BinaryField类型的字段时,会对SQL性能造成非常巨大影响。
3、select * form hr_hruser order by id limit 0,20这样的语句执行顺序为先取出表中所有数据,然后进行排序,最后拿到前20条。如果数据量非常大,那排序将是非常耗时的。当select username, cv_data .. form hr_hruser order by id limit 0,20这样的语句同时存在BinaryField和排序时,即使数据量很小,查询也会非常耗时。去掉任意一个条件(BinaryField或者order by)即可改善。