python html解析_Python HTML解析器

python html解析

Python html.parser module provides us with the HTMLParser class, which can be sub-classed to parse HTML-formatted text files. We can easily modify the logic to process the HTML from a HTTP request as well using HTTP Client.

Python html.parser模块为我们提供了HTMLParser类,可以将其子类化以解析HTML格式的文本文件。 我们也可以使用HTTP Client轻松修改逻辑以处理来自HTTP请求HTML。

The class definition for HTMLParser looks like:

HTMLParser的类定义如下:

class html.parser.HTMLParser(*, convert_charrefs=True)

In this lesson, we will be sub-classing HTMLParser class to observe the behaviour presented by its functions and play with it. Let’s get started.

在本课程中,我们将对HTMLParser 进行子类化,以观察其功能所呈现的行为并进行操作。 让我们开始吧。

Python HTML解析器 (Python HTML Parser)

As we saw in the class definition of HTMLParser, when the value for convert_charrefs is True, all of the character references (except the ones in script/style elements) are converted to the respective Unicode characters.

正如我们在HTMLParser的类定义中看到的那样,当convert_charrefs值为True时,所有字符引用( script / style元素中的字符引用除外)都将转换为相应的Unicode字符。

The handler methods of this class (which we will see in next section) are called automatically once the instance of the class encounters start tags, end tags, text, comments, and other markup elements in the HTML String passed to it.

一旦该类的实例在传递给它HTML字符串中遇到开始标记,结束标记,文本,注释和其他标记元素时,将自动调用该类的处理程序方法(我们将在下一节中看到)。

When we want to use this class, we should sub-class it to provide our own functionality. Before we present an example for the same, let us also mention all the functions of the class which are available for customisation. Here are they:

当我们要使用此类时,应将其子类化以提供我们自己的功能。 在提供相同的示例之前,让我们还提及该类可用于自定义的所有功能。 他们是:

  • handle_startendtag: This function manages both the start and end tags of the HTML document when encountered by passing control to other functions, which is clear in its definition:

    handle_startendtag :此函数通过将控件传递给其他函数来管理HTML文档的开始和结束标签,这在其定义中很明显:
def handle_startendtag(self, tag, attrs):
    self.handle_starttag(tag, attrs)
    self.handle_endtag(tag)
  • handle_starttag: This function is meant to handle the start tag encounter:

    handle_starttag :此函数用于处理遇到的开始标记:
  • def handle_starttag(self, tag, attrs):
        pass
  • handle_endtag: This function manages the end tag in the HTML String:

    handle_endtag :此函数管理HTML字符串中的结束标记:
  • def handle_endtag(self, tag):
        pass
  • handle_charref: This function handle character references in the String passed to it, its definition is given as:

    handle_charref :此函数处理传递给它的String中的字符引用,其定义为:
  • def handle_charref(self, name):
        pass
  • handle_entityref: This function handle entity reference, its definition is given as:

    handle_entityref :此函数处理实体引用,其定义为:
  • def handle_entityref(self, name):
        pass
  • handle_data: This function manages the data in HTML String and is one of the most important function in this class, its definition is given as:

    handle_data :此函数管理HTML String中的数据,并且是此类中最重要的函数之一,其定义为:
  • def handle_data(self, data):
        pass
  • handle_comment: This function manages the comments in the HTML, its definition is given as:

    handle_comment :此函数管理HTML中的注释,其定义为:
  • def handle_comment(self, data):
        pass
  • handle_pi: This function manages the processing instructions in the HTML, its definition is given as:

    handle_pi :此函数管理HTML中的处理指令,其定义为:
  • def handle_pi(self, data):
        pass
  • handle_decl: This function manages the declarations in the HTML, its definition is given as:

    handle_decl :此函数管理HTML中的声明,其定义为:
  • def handle_decl(self, decl):
        pass

    Let’s get started by providing a sub-class of HTMLParser to see some of these functions in action.

    让我们开始提供HTMLParser的子类,以查看其中的一些功能。

    为HTMLParser制作子类 (Making a sub-class for HTMLParser)

    In this example, we will create a subclass of HTMLParser and see how are the most common handler methods for this class are called. Here is a sample program which subclasses the HTMLParser class:

    在此示例中,我们将创建HTMLParser的子类,并查看如何调用此类的最常见处理程序方法。 这是一个示例程序,该程序继承了HTMLParser类:

    from html.parser import HTMLParser
    
    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print("Found a start tag:", tag)
    
        def handle_endtag(self, tag):
            print("Found an end tag :", tag)
    
        def handle_data(self, data):
            print("Found some data  :", data)
    
    parser = MyHTMLParser()
    parser.feed('<title>JournalDev HTMLParser</title>'
                '<h1>Python html.parse module</h1>')

    Let’s see the output for this program:

    让我们看一下该程序的输出:

    python html parser example

    Subclassing HTMLParser class

    子类化HTMLParser类


    The three handler functions we showed above are the functions which are available for customisation from the class. But these are not the only functions which can be overidden. In the next example, we will cover all the overiddable functions.
    上面显示的三个处理程序函数是可从该类进行自定义的函数。 但是这些并不是唯一可以忽略的功能。 在下一个示例中,我们将介绍所有可覆盖的功能。

    覆盖HTMLParser方法 (Overidding HTMLParser methods)

    In this example, we will overide all the functions of the HTMLParser class. Let’s look at a code snippet of the class:

    在此示例中,我们将覆盖HTMLParser类的所有功能。 让我们看一下该类的代码片段:

    from html.parser import HTMLParser
    from html.entities import name2codepoint
    
    class JDParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print("Start tag:", tag)
            for attr in attrs:
                print("     attr:", attr)
    
        def handle_endtag(self, tag):
            print("End tag  :", tag)
    
        def handle_data(self, data):
            print("Data     :", data)
    
        def handle_comment(self, data):
            print("Comment  :", data)
    
        def handle_entityref(self, name):
            c = chr(name2codepoint[name])
            print("Named ent:", c)
    
        def handle_charref(self, name):
            if name.startswith('x'):
                c = chr(int(name[1:], 16))
            else:
                c = chr(int(name))
            print("Num ent  :", c)
    
        def handle_decl(self, data):
            print("Decl     :", data)
    
    parser = JDParser()

    We will now use this class to parse various parts of an HTML script. Here is a beginning with a doctype String:

    现在,我们将使用此类来解析HTML脚本的各个部分。 这是从文档类型String开始的:

    parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
                 '"https://www.w3.org/TR/html4/strict.dtd">')

    Let’s see the output for this program:

    让我们看一下该程序的输出:

    python html parser doctype

    HTMLParser Doctype Parsing

    HTMLParser Doctype解析

    Let’s look at a code snippet which passes an img tag:

    我们来看一个传递img标签的代码片段:

    parser.feed('<img src="https://cdn.journaldev.com/wp-content/uploads/2014/05/Final-JD-Logo.png" alt="The Python logo">')

    Let’s see the output for this program:

    python html parser img tag

    让我们看一下该程序的输出:

    Notice how tag was broken and attributes for the tag were also extracted.

    请注意,标签是如何断开的,标签的属性也已提取。

    Let’s try the script/style tags as well whose elements are not parsed:

    让我们也尝试不分析其元素的script / style标签:

    parser.feed('<script type="text/javascript">'
                 'alert("<strong>JournalDev Python</strong>");</script>')
    parser.feed('<style type="text/css">#python { color: green }</style>')

    Let’s see the output for this program:

    让我们看一下该程序的输出:

    Parsing comments is also possible with this instance:

    使用此实例也可以解析注释:

    parser.feed('<!-- This marks the beginning of samples. -->'
                '<!--[if IE 9]>IE-specific content<![endif]-->')

    With this method, we can manage many IE related properties as well and see if some webpages supports IE or not:

    使用此方法,我们还可以管理许多与IE相关的属性,并查看某些网页是否支持IE:

    python html parser comments

    Parsing Comments

    解析注释

    解析命名和数字引用 (Parsing Named and Numeric references)

    Here is a sample program with which we can parse character references as well and convert them to correct character at runtime:

    这是一个示例程序,通过它我们还可以解析字符引用,并在运行时将其转换为正确的字符:

    parser.feed('>>>')

    Let’s see the output for this program:

    让我们看一下该程序的输出:

    python html parser char references

    Parsing Character references

    解析字符引用

    解析无效HTML (Parsing Invalid HTML)

    To an extent, we can also feed invalid HTML data to feed function as well. Here is a sample program with no quotes around the link in an anchor tag:

    在某种程度上,我们还可以提供无效HTML数据以提供功能。 这是一个示例程序,在anchor标记中的链接周围没有引号:

    parser.feed('<h1><a class="link" href="#main">Invalid HTML</h1></a>')

    Let’s see the output for this program:

    让我们看一下该程序的输出:

    python html parser invalid html

    Parsing Invalid HTML

    解析无效HTML

    That’s all for parsing html data in python using html.parser module.

    这就是使用html.parser模块在python中解析html数据的全部。

    Reference: API Doc

    参考: API文档

    翻译自: https://www.journaldev.com/19931/python-html-parser

    python html解析

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值