Django源码01：过滤HTML标签的strip_tags函数是如何实现的？

本文链接：https://blog.csdn.net/gaifuxi9518/article/details/90166552

Django中有一个可以过滤HTML标签的函数，名为strip_tags，它位于的django.utils.html中，使用它可以完成一些特殊的字符串处理任务。好奇的我想知道它是如何实现的，于是打开了对应的Django源码文件，打算一探究竟。使用Pycharm直接定位到函数位置，我们可以看到strip_tags这个函数是这样定义的：

@keep_lazy_text
def strip_tags(value):
    """Return the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    value = force_text(value)
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

先不管顶部的装饰器，首先看一下函数的参数和返回值。该函数的参数是value，在实际使用中value是指一堆HTML代码。返回值同样是value,只不过是经过阉割版的value，是去除了所有HTML标签的内容。

传入的内容首先要经过一个force_text函数，大概能猜出来这个函数的功能是对value做一些预处理，具体是什么样的预处理呢？我们继续定位到force_text定义的地方，查看源码，该函数位于同目录下的encoding.py文件中，从该文件的名称我们可以看出，它的主要功能是进行各种形式的编码：

def force_text(s, encoding='utf-8', strings_only=False, errors='strict'):
    """
    Similar to smart_text, except that lazy instances are resolved to
    strings, rather than kept as lazy objects.

    If strings_only is True, don't convert (some) non-string-like objects.
    """
    # Handle the common case first for performance reasons.
    
    # 首先判断value是否属于字符串类型，如果是直接原样返回
    if issubclass(type(s), str):
        return s
    
    # 这里不会执行，因为Strings_only默认为false
    if strings_only and is_protected_type(s):
        return s
    
    # 判断value是否属于二进制byte数据，甭管是不是都将其转为字符串类型
    try:
        if isinstance(s, bytes):
            s = str(s, encoding, errors)
        else:
            s = str(s)
    except UnicodeDecodeError as e:
        raise DjangoUnicodeDecodeError(s, *e.args)
    return s

相关注释我已经在上面给出，可以看出force_text函数的作用就是保证value是字符串，以便后续处理。但是这个函数里面有几个函数，我这里稍加说明。issubclass(parm1, parm2)函数接收两个参数，主要用于用于判断参数 parm1 是否是类型参数 parm2 的子类，返回值为布尔类型。

type()和isinstance()都可以判断参数的数据类型，type()一般接收一个参数，返回值是此参数的数据类型，比如：

>>> type('django')
<type 'str'>

isinstance(parm1,parm2)接收两个参数，返回值是布尔类型，用于判断parm1是否属于parm2的类型。

>>> isinstance('django',str)
True

但是type()和isinstance()是有区别的：

type() 不会认为子类是一种父类类型，不考虑继承关系。
isinstance() 会认为子类是一种父类类型，考虑继承关系。

如果要判断两个类型是否相同推荐使用 isinstance()。

继续读strip_tags的源码，接下来是一个while循环，此循环的作用是去除HTML代码，循环条件是判断value中是否还含有<或者>，如果有继续去除。

其实在这个地方，源码里也做了注释说明，其实这个循环通常情况下只会执行一次，也就是说一次性就能将HTML标签去除。循环条件的给出也是多余的，但有助于减少_strip_once函数的执行次数。

def strip_tags(value):
    # 对输入做预处理，保证value为字符串类型
    value = force_text(value)
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

我们看到循环中有一个_strip_once函数，这个函数才是核心，因为它完成了标签的去除工作，我们再来定位一下它：

def _strip_once(value):
    """
    Internal tag stripping utility used by strip_tags.
    """
    s = MLStripper()
    try:
        s.feed(value)
    except HTMLParseError:
        return value
    try:
        s.close()
    except HTMLParseError:
        return s.get_data() + s.rawdata
    else:
        return s.get_data()

它就在strip_tags函数的上面，该函数首先实例化了一个MLStripper类的对象，然后调用了该对象中的几个方法，最终返回了去掉HTML标签的数据。我们来看看这个类：

class MLStripper(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.reset()
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)

    def handle_charref(self, name):
        self.fed.append('&#%s;' % name)

    def get_data(self):
        return ''.join(self.fed)

看到这里我明白了，原来这里是Django使用了Python自带的HTML解析工具：HTMLParser。而且仅仅是用了其中的一个功能，那就是解析HTML标签之间内容的功能。其中handle_data、handle_entityref、handle_charref是覆写了HTMLParser的方法，get_data函数是Django自己定义的，用于组合数据。

有关HTMLParser的使用方式这里不在进行扩展，后期再专门写一篇文章详细讲解吧。好了，这篇文章到此结束，最后说一下看源码的感受吧，要想看懂源码的话一定要懂面向对象的相关知识，因为框架中的很多函数都是调用来调用去，很多类都是继承来继承去，如果不熟悉面向对象的内容，很容易套着套着就晕了。

另外一点就是，框架所考虑的东西很全面，很系统，而目前我自己写代码的时候就很难考虑的这么全，所以还是要多看官方源码，多受规范的、好代码的熏陶才行！