python html解析
Python html.parser
module provides us with the HTMLParser
class, which can be sub-classed to parse HTML-formatted text files. We can easily modify the logic to process the HTML from a HTTP request as well using HTTP Client.
Python html.parser
模块为我们提供了HTMLParser
类,可以将其子类化以解析HTML格式的文本文件。 我们也可以使用HTTP Client轻松修改逻辑以处理来自HTTP请求HTML。
The class definition for HTMLParser
looks like:
HTMLParser
的类定义如下:
class html.parser.HTMLParser(*, convert_charrefs=True)
In this lesson, we will be sub-classing HTMLParser
class to observe the behaviour presented by its functions and play with it. Let’s get started.
在本课程中,我们将对HTMLParser
类进行子类化,以观察其功能所呈现的行为并进行操作。 让我们开始吧。
Python HTML解析器 (Python HTML Parser)
As we saw in the class definition of HTMLParser
, when the value for convert_charrefs
is True, all of the character references (except the ones in script
/style
elements) are converted to the respective Unicode characters.
正如我们在HTMLParser
的类定义中看到的那样,当convert_charrefs
值为True时,所有字符引用( script
/ style
元素中的字符引用除外)都将转换为相应的Unicode字符。
The handler methods of this class (which we will see in next section) are called automatically once the instance of the class encounters start tags, end tags, text, comments, and other markup elements in the HTML String passed to it.
一旦该类的实例在传递给它HTML字符串中遇到开始标记,结束标记,文本,注释和其他标记元素时,将自动调用该类的处理程序方法(我们将在下一节中看到)。
When we want to use this class, we should sub-class it to provide our own functionality. Before we present an example for the same, let us also mention all the functions of the class which are available for customisation. Here are they:
当我们要使用此类时,应将其子类化以提供我们自己的功能。 在提供相同的示例之前,让我们还提及该类可用于自定义的所有功能。 他们是:
handle_startendtag
: This function manages both the start and end tags of the HTML document when encountered by passing control to other functions, which is clear in its definition:handle_startendtag
:此函数通过将控件传递给其他函数来管理HTML文档的开始和结束标签,这在其定义中很明显:
def handle_startendtag(self, tag, attrs):
self.handle_starttag(tag, attrs)
self.handle_endtag(tag)
handle_starttag
: This function is meant to handle the start tag encounter:handle_starttag
:此函数用于处理遇到的开始标记:def handle_starttag(self, tag, attrs): pass
handle_endtag
: This function manages the end tag in the HTML String:handle_endtag
:此函数管理HTML字符串中的结束标记:def handle_endtag(self, tag): pass
handle_charref
: This function handle character references in the String passed to it, its definition is given as:handle_charref
:此函数处理传递给它的String中的字符引用,其定义为:def handle_charref(self, name): pass
handle_entityref
: This function handle entity reference, its definition is given as:handle_entityref
:此函数处理实体引用,其定义为:def handle_entityref(self, name): pass
handle_data
: This function manages the data in HTML String and is one of the most important function in this class, its definition is given as:handle_data
:此函数管理HTML String中的数据,并且是此类中最重要的函数之一,其定义为:def handle_data(self, data): pass
handle_comment
: This function manages the comments in the HTML, its definition is given as:handle_comment
:此函数管理HTML中的注释,其定义为:def handle_comment(self, data): pass
handle_pi
: This function manages the processing instructions in the HTML, its definition is given as:handle_pi
:此函数管理HTML中的处理指令,其定义为:def handle_pi(self, data): pass
handle_decl
: This function manages the declarations in the HTML, its definition is given as:handle_decl
:此函数管理HTML中的声明,其定义为:def handle_decl(self, decl): pass
Let’s get started by providing a sub-class of
HTMLParser
to see some of these functions in action.让我们开始提供
HTMLParser
的子类,以查看其中的一些功能。为HTMLParser制作子类 (Making a sub-class for HTMLParser)
In this example, we will create a subclass of
HTMLParser
and see how are the most common handler methods for this class are called. Here is a sample program which subclasses theHTMLParser
class:在此示例中,我们将创建
HTMLParser
的子类,并查看如何调用此类的最常见处理程序方法。 这是一个示例程序,该程序继承了HTMLParser
类:from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print("Found a start tag:", tag) def handle_endtag(self, tag): print("Found an end tag :", tag) def handle_data(self, data): print("Found some data :", data) parser = MyHTMLParser() parser.feed('<title>JournalDev HTMLParser</title>' '<h1>Python html.parse module</h1>')
Let’s see the output for this program:
让我们看一下该程序的输出:
The three handler functions we showed above are the functions which are available for customisation from the class. But these are not the only functions which can be overidden. In the next example, we will cover all the overiddable functions.
上面显示的三个处理程序函数是可从该类进行自定义的函数。 但是这些并不是唯一可以忽略的功能。 在下一个示例中,我们将介绍所有可覆盖的功能。覆盖HTMLParser方法 (Overidding HTMLParser methods)
In this example, we will overide all the functions of the HTMLParser class. Let’s look at a code snippet of the class:
在此示例中,我们将覆盖HTMLParser类的所有功能。 让我们看一下该类的代码片段:
from html.parser import HTMLParser from html.entities import name2codepoint class JDParser(HTMLParser): def handle_starttag(self, tag, attrs): print("Start tag:", tag) for attr in attrs: print(" attr:", attr) def handle_endtag(self, tag): print("End tag :", tag) def handle_data(self, data): print("Data :", data) def handle_comment(self, data): print("Comment :", data) def handle_entityref(self, name): c = chr(name2codepoint[name]) print("Named ent:", c) def handle_charref(self, name): if name.startswith('x'): c = chr(int(name[1:], 16)) else: c = chr(int(name)) print("Num ent :", c) def handle_decl(self, data): print("Decl :", data) parser = JDParser()
We will now use this class to parse various parts of an HTML script. Here is a beginning with a doctype String:
现在,我们将使用此类来解析HTML脚本的各个部分。 这是从文档类型String开始的:
parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ' '"https://www.w3.org/TR/html4/strict.dtd">')
Let’s see the output for this program:
让我们看一下该程序的输出:
Let’s look at a code snippet which passes an
img
tag:我们来看一个传递
img
标签的代码片段:parser.feed('<img src="https://cdn.journaldev.com/wp-content/uploads/2014/05/Final-JD-Logo.png" alt="The Python logo">')
Let’s see the output for this program:
让我们看一下该程序的输出:
Notice how tag was broken and attributes for the tag were also extracted.
请注意,标签是如何断开的,标签的属性也已提取。
Let’s try the
script
/style
tags as well whose elements are not parsed:让我们也尝试不分析其元素的
script
/style
标签:parser.feed('<script type="text/javascript">' 'alert("<strong>JournalDev Python</strong>");</script>') parser.feed('<style type="text/css">#python { color: green }</style>')
Let’s see the output for this program:
让我们看一下该程序的输出:
Parsing comments is also possible with this instance:
使用此实例也可以解析注释:
parser.feed('<!-- This marks the beginning of samples. -->' '<!--[if IE 9]>IE-specific content<![endif]-->')
With this method, we can manage many IE related properties as well and see if some webpages supports IE or not:
使用此方法,我们还可以管理许多与IE相关的属性,并查看某些网页是否支持IE:
解析命名和数字引用 (Parsing Named and Numeric references)
Here is a sample program with which we can parse character references as well and convert them to correct character at runtime:
这是一个示例程序,通过它我们还可以解析字符引用,并在运行时将其转换为正确的字符:
parser.feed('>>>')
Let’s see the output for this program:
让我们看一下该程序的输出:
解析无效HTML (Parsing Invalid HTML)
To an extent, we can also feed invalid HTML data to feed function as well. Here is a sample program with no quotes around the link in an
anchor
tag:在某种程度上,我们还可以提供无效HTML数据以提供功能。 这是一个示例程序,在
anchor
标记中的链接周围没有引号:parser.feed('<h1><a class="link" href="#main">Invalid HTML</h1></a>')
Let’s see the output for this program:
让我们看一下该程序的输出:
That’s all for parsing html data in python using
html.parser
module.这就是使用
html.parser
模块在python中解析html数据的全部。Reference: API Doc
参考: API文档
python html解析