记一次关于BinaryField类型字段对 Mysql | Django分页 | Xadmin列表页 性能影响的大调查

18 篇文章 0 订阅
17 篇文章 4 订阅

 

一、讲一个亲身经历的例子

本人前些日子给公司HR部门做了一个面试管理系统,基于Django框架,采用xadmin扩展。其中一个模块  简历管理模块,  主要功能是上传简历、识别简历文本/关键信息、以及存储简历文件。

本来表结构已经设计的好好的,其中文件存储方式采用django默认的文件存储方式:数据库表只存放简历文件的存储路径,简历实体文件则存放在服务器硬盘上。某一天呢,我们不太懂技术的项目经理突然过目项目,觉得这样的存储方式太low,不安全,他觉得存储在硬盘上的话,岂不是谁登录这台服务器都能看到简历了,某个人手闲删了简历文件怎么办呢。于是乎就要求将简历文件存储在数据库。当时我是百般解释存储在数据库的弊端,比如性能问题阿,数据库备份问题阿。可是人家就是不听你的,非要存到数据库,那就存吧,数据库新加了两个字段:cv_name(简历名称)、cv_data(简历二进制数据)。

就这样,项目上线了一段时间,问题终于出现了。HR访问简历列表页的时候非常慢,大概需要5-6秒才能刷出来,此时的简历大概也只有3000多份左右。经调查cv_data这个字段就是问题文原因之一。

 

二、性能测试

首先,通过操作UI画面的方式进行测试,从宏观上感受一下cv_data这个字段对网站性能的影响(测试数据量为1332条):

所测试表结构:

不对cv_data字段做任何修改,且字段值不为空(存储60kb左右的文件):

打开list页面至加载完成平均耗时:12s

不对cv_data字段做任何修改,且字段值设置为空:

打开list页面至加载完成平均耗时:4.7s

将cv_data字段由BinaryField改为普通CharField类型字段:

打开list页面至加载完成平均耗时:2.6s

可以看到cv_data字段即使为空值,对性能影响也非常大。

 

三、性能分析

下面我将从两个方面对性能进行分析,并结合代码、ORM、数据库进行详细解释:

1、先看分页器

django自带了分页器功能,在django/core/paginator.py下class Paginator(object),访问列表页,django-xadmin会进行分页查询,初始化分页器并接收queryset参数,返回分页结果,以下是django和xadmin源码:

class Paginator(object):
    """分页器"""
    def __init__(self, object_list, per_page, orphans=0,
                 allow_empty_first_page=True):
        self.object_list = object_list
        self._check_object_list_is_ordered()
        self.per_page = int(per_page)
        self.orphans = int(orphans)
        self.allow_empty_first_page = allow_empty_first_page

    def validate_number(self, number):
        """
        Validates the given 1-based page number.
        """
        try:
            number = int(number)
        except (TypeError, ValueError):
            raise PageNotAnInteger(_('That page number is not an integer'))
        if number < 1:
            raise EmptyPage(_('That page number is less than 1'))
        if number > self.num_pages:
            if number == 1 and self.allow_empty_first_page:
                pass
            else:
                raise EmptyPage(_('That page contains no results'))
        return number

    def page(self, number):
        """
        Returns a Page object for the given 1-based page number.
        """
        number = self.validate_number(number)
        bottom = (number - 1) * self.per_page
        top = bottom + self.per_page
        if top + self.orphans >= self.count:
            top = self.count
        return self._get_page(self.object_list[bottom:top], number, self)

    def _get_page(self, *args, **kwargs):
        """
        Returns an instance of a single page.

        This hook can be used by subclasses to use an alternative to the
        standard :cls:`Page` object.
        """
        return Page(*args, **kwargs)

    @cached_property
    def count(self):
        """
        Returns the total number of objects, across all pages.
        """
        try:
            return self.object_list.count()
        except (AttributeError, TypeError):
            # AttributeError if object_list has no count() method.
            # TypeError if object_list.count() requires arguments
            # (i.e. is of type list).
            return len(self.object_list)

    @cached_property
    def num_pages(self):
        """
        Returns the total number of pages.
        """
        if self.count == 0 and not self.allow_empty_first_page:
            return 0
        hits = max(1, self.count - self.orphans)
        return int(ceil(hits / float(self.per_page)))

    @property
    def page_range(self):
        """
        Returns a 1-based range of pages for iterating through within
        a template for loop.
        """
        return six.moves.range(1, self.num_pages + 1)

    def _check_object_list_is_ordered(self):
        """
        Warn if self.object_list is unordered (typically a QuerySet).
        """
        ordered = getattr(self.object_list, 'ordered', None)
        if ordered is not None and not ordered:
            obj_list_repr = (
                '{} {}'.format(self.object_list.model, self.object_list.__class__.__name__)
                if hasattr(self.object_list, 'model')
                else '{!r}'.format(self.object_list)
            )
            warnings.warn(
                'Pagination may yield inconsistent results with an unordered '
                'object_list: {}.'.format(obj_list_repr),
                UnorderedObjectListWarning,
                stacklevel=3
            )

 

分页器初始化后进行数据总行数的计算,self.object_list.count(),object_list是初始化分页器传进来的queryset,这条ORM(202行len(self.result_list)也一样)对应的查询语句则是:

SELECT COUNT(*) FROM (SELECT DISTINCT `hr_hruser`.`id` AS Col1, `hr_hruser`.`username` AS Col2, `hr_hruser`.`cv_name` AS Col3, `hr_hruser`.`cv_data` AS Col4, `hr_hruser`.`uploader_id` AS Col5, `hr_hruser`.`skill_tags` AS Col6, `hr_hruser`.`cv_status` AS Col7, `hr_hruser`.`gender` AS Col8, `hr_hruser`.`education` AS Col9, `hr_hruser`.`language` AS Col10, `hr_hruser`.`language_level` AS Col11, `hr_hruser`.`age` AS Col12, `hr_hruser`.`work_life` AS Col13, `hr_hruser`.`mail` AS Col14, `hr_hruser`.`phone` AS Col15, `hr_hruser`.`work_type` AS Col16, `hr_hruser`.`first_impression` AS Col17, `hr_hruser`.`alltxt` AS Col18, `hr_hruser`.`comment` AS Col19, `hr_hruser`.`create_time` AS Col20, `hr_hruser`.`update_time` AS Col21 FROM `hr_hruser`

共耗时:7.1s

手动去掉cv_data这个字段时:

SELECT COUNT(*) FROM (SELECT DISTINCT `hr_hruser`.`id` AS Col1, `hr_hruser`.`username` AS Col2, `hr_hruser`.`cv_name` AS Col3,  `hr_hruser`.`uploader_id` AS Col5, `hr_hruser`.`skill_tags` AS Col6, `hr_hruser`.`cv_status` AS Col7, `hr_hruser`.`gender` AS Col8, `hr_hruser`.`education` AS Col9, `hr_hruser`.`language` AS Col10, `hr_hruser`.`language_level` AS Col11, `hr_hruser`.`age` AS Col12, `hr_hruser`.`work_life` AS Col13, `hr_hruser`.`mail` AS Col14, `hr_hruser`.`phone` AS Col15, `hr_hruser`.`work_type` AS Col16, `hr_hruser`.`first_impression` AS Col17, `hr_hruser`.`alltxt` AS Col18, `hr_hruser`.`comment` AS Col19, `hr_hruser`.`create_time` AS Col20, `hr_hruser`.`update_time` AS Col21 FROM `hr_hruser`

共耗时:0.4s

为做对比,在贴几个精简的查询方式:

SELECT COUNT(*)  FROM `hr_hruser`

共耗时:0.02s

SELECT cv_data  FROM `hr_hruser`

共耗时:3.2s

SELECT count(cv_data)  FROM `hr_hruser`

共耗时:3.2s

绿色部分是queryset惰性特性,也就是说获取queryset时并不会去查询,只有做循环、if判断、查值、计算等才会真正去数据库查询。

结论:

django分页器进行总数量查询时,ORM转SQL时会产生一个包含所有字段的子查询臃肿语句(也就是上面绿色的sql文),并且BinaryField类型的字段会对SQL性能造成非常巨大影响。

2、再看数据渲染

xadmin渲染列表页数据时,会经历以下步骤:

1、首先拿到所有数据的queryset

2、由分页器获取某一页要展示数据的queryset,比如第一页,展示前20条数据

3、遍历这20条数据,对queryset进行加工

queryset做遍历时,才会真正去数据库做查询,对应的sql语句为:

SELECT DISTINCT `hr_hruser`.`id`, `hr_hruser`.`username`, `hr_hruser`.`cv_name`, `hr_hruser`.`cv_data`, `hr_hruser`.`uploader_id`, `hr_hruser`.`skill_tags`, `hr_hruser`.`cv_status`, `hr_hruser`.`gender`, `hr_hruser`.`education`, `hr_hruser`.`language`, `hr_hruser`.`language_level`, `hr_hruser`.`age`, `hr_hruser`.`work_life`, `hr_hruser`.`mail`, `hr_hruser`.`phone`, `hr_hruser`.`work_type`, `hr_hruser`.`first_impression`, `hr_hruser`.`alltxt`, `hr_hruser`.`comment`, `hr_hruser`.`create_time`, `hr_hruser`.`update_time`, `auth_user`.`id`, `auth_user`.`password`, `auth_user`.`last_login`, `auth_user`.`is_superuser`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`date_joined` FROM `hr_hruser` INNER JOIN `auth_user` ON (`hr_hruser`.`uploader_id` = `auth_user`.`id`) ORDER BY `hr_hruser`.`create_time` DESC, `hr_hruser`.`id` DESC LIMIT 20

共耗时:3.75s

以上sql的执行顺序为先拿到数据库所有数据(1332条),并做排序处理,然后在拿到前20条数据。

我尝试了对sql做一些优化测试:

仅去掉cv-data字段:

SELECT DISTINCT `hr_hruser`.`id`, `hr_hruser`.`username`, `hr_hruser`.`cv_name`,`hr_hruser`.`uploader_id`, `hr_hruser`.`skill_tags`, `hr_hruser`.`cv_status`, `hr_hruser`.`gender`, `hr_hruser`.`education`, `hr_hruser`.`language`, `hr_hruser`.`language_level`, `hr_hruser`.`age`, `hr_hruser`.`work_life`, `hr_hruser`.`mail`, `hr_hruser`.`phone`, `hr_hruser`.`work_type`, `hr_hruser`.`first_impression`, `hr_hruser`.`alltxt`, `hr_hruser`.`comment`, `hr_hruser`.`create_time`, `hr_hruser`.`update_time`, `auth_user`.`id`, `auth_user`.`password`, `auth_user`.`last_login`, `auth_user`.`is_superuser`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`date_joined` FROM `hr_hruser` INNER JOIN `auth_user` ON (`hr_hruser`.`uploader_id` = `auth_user`.`id`) ORDER BY `hr_hruser`.`create_time` DESC, `hr_hruser`.`id` DESC LIMIT 20

共耗时:0.26s

仅去掉排序条件:

SELECT DISTINCT `hr_hruser`.`id`, `hr_hruser`.`username`, `hr_hruser`.`cv_name`, `hr_hruser`.`cv_data`, `hr_hruser`.`uploader_id`, `hr_hruser`.`skill_tags`, `hr_hruser`.`cv_status`, `hr_hruser`.`gender`, `hr_hruser`.`education`, `hr_hruser`.`language`, `hr_hruser`.`language_level`, `hr_hruser`.`age`, `hr_hruser`.`work_life`, `hr_hruser`.`mail`, `hr_hruser`.`phone`, `hr_hruser`.`work_type`, `hr_hruser`.`first_impression`, `hr_hruser`.`alltxt`, `hr_hruser`.`comment`, `hr_hruser`.`create_time`, `hr_hruser`.`update_time`, `auth_user`.`id`, `auth_user`.`password`, `auth_user`.`last_login`, `auth_user`.`is_superuser`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`date_joined` FROM `hr_hruser` INNER JOIN `auth_user` ON (`hr_hruser`.`uploader_id` = `auth_user`.`id`) LIMIT 20

共耗时:0.03s

最后,我又尝试对create_time、id字段创建索引,结果对性能提升也是微乎其微。

4、将加工的数据传到模板中进行渲染

 

结论:

       当同时存在BinaryField类型字段和排序条件时,mysql的性能会非常差。即使对排序字段创建索引和联合索引也无济于事。

 

四、解决办法

1、针对分页器的优化

思路:ORM具有惰性,在ORM转成sql真正去数据库查询前进行拦截,获取并修改sql,去掉无关字段再执行sql。

自定义分页器,重写count方法:

1、新建paginator.py文件:

from django.core.paginator import Paginator
from django.utils.functional import cached_property

from utils.ORM_performance import ORM

class HRMSPaginator(Paginator):

    @cached_property
    def count(self):
        """
        Returns the total number of objects, across all pages.
        """
        try:
            return ORM.count(self.object_list)
        except (AttributeError, TypeError):
            return len(self.object_list)

2、新建ORM_performance.py文件:

from django.db import connection

class ORM(object):

    _bloated_field = ["`hr_hruser`.`cv_data`", "`hr_interview`.`cv_safe_data`", "`hr_interview`.`evaluation_1_data`",
                      "`hr_interview`.`evaluation_2_data`", "`hr_interview`.`evaluation_3_data`",
                      "`hr_interview`.`offer_data`"
                      ] #这些字段都是BinaryField类型

    @classmethod
    def count(cls, queryset):
        """receive a queryset and return the length of the queryset"""
        cursor = connection.cursor()
        sql = queryset.query.__str__()
        #去掉无关字段
        for field in cls._bloated_field:
            if field in sql:
                sql = sql.replace(field, "")
        new_sql = sql.replace(", ,", ",")
        try:
            #new_sql中当包含以字符串为参数的查询时,execute会报错,因为new_sql中的字符串并没有被引号引起来。而sql语句要求字符串必须要有引号的。
            return cursor.execute(new_sql)
        except:
            try:
                return queryset.count()
            except (AttributeError, TypeError):
                return len(queryset)
        finally:
            cursor.close()

3、修改\admin\views\list.py

class ListAdminView(ModelAdminView):
    """
    Display models objects view. this class has ordering and simple filter features.
    """
    list_display = ('__str__',)
    list_display_links = ()
    list_display_links_details = False
    list_select_related = None
    list_per_page = 50
    list_max_show_all = 200
    list_exclude = ()
    search_fields = ()
    paginator_class = HRMSPaginator   #修改一:使用自定义paginator
    ordering = None


    def make_result_list(self):
        # Get search parameters from the query string.
        self.base_queryset = self.queryset()
        self.list_queryset = self.get_list_queryset()
        self.ordering_field_columns = self.get_ordering_field_columns()
        self.paginator = self.get_paginator()

        # Get the number of objects, with admin filters applied.
        self.result_count = self.paginator.count

        self.can_show_all = self.result_count <= self.list_max_show_all
        self.multi_page = self.result_count > self.list_per_page

        # Get the list of objects to display on this page.
        if (self.show_all and self.can_show_all) or not self.multi_page:
            self.result_list = self.list_queryset._clone()
        else:
            try:
                self.result_list = self.paginator.page(
                    self.page_num + 1).object_list
            except InvalidPage:
                if ERROR_FLAG in self.request.GET.keys():
                    return SimpleTemplateResponse('xadmin/views/invalid_setup.html', {
                        'title': _('Database error'),
                    })
                return HttpResponseRedirect(self.request.path + '?' + ERROR_FLAG + '=1')
        self.has_more = self.result_count > (
            self.list_per_page * self.page_num + ORM.count(self.result_list)) #修改二:使用自定义方法查询queryset长度

2、针对数据渲染的优化

如果排序需求不是很大,可以先去掉排序。

也可考虑将cv_data这样的字段单独存储到一张表,以外键关联,专门存储二进制或其他较大的文本数据。

 

五、结论

1、数据库最好不用存放大文件,严格说不要使用BinaryField类型字段,如果必须,那就用单独一张表保存,以外键与业务表关联。

2、User.objects.all()、User.objects.filter()、User.objects.get()等类似的ORM转化成SQL时,通常都会产生一个类似select  field1, field2...所有字段  from  ..  where .. 这样的臃肿语句,即使你仅仅只想查找一个字段。所以尽量使用User.objects.values()、User.objects.values_list()来避免以上问题。并且包含BinaryField类型的字段时,会对SQL性能造成非常巨大影响。

3、select * form hr_hruser order by id limit 0,20这样的语句执行顺序为先取出表中所有数据,然后进行排序,最后拿到前20条。如果数据量非常大,那排序将是非常耗时的。当select username, cv_data  .. form hr_hruser order by id limit 0,20这样的语句同时存在BinaryField和排序时,即使数据量很小,查询也会非常耗时。去掉任意一个条件(BinaryField或者order by)即可改善。

 

 

 

 

 

 

  • 4
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值