lxml python
Python lxml is the most feature-rich and easy-to-use library for processing XML and HTML data. Python scripts are written to perform many tasks like Web scraping and parsing XML. In this lesson, we will study about python lxml library and how we can use it to parse XML data and perform web scraping as well.
Python lxml是功能最丰富且易于使用的库,用于处理XML和HTML数据。 编写Python脚本可以执行许多任务,例如Web抓取和解析XML。 在本课程中,我们将研究python lxml库以及如何使用它解析XML数据并执行Web抓取。
Python LXML库 (Python lxml library)
Python lxml is an easy to use and feature rich library to process and parse XML and HTML documents. lxml is really nice API as it provides literally everything to process these 2 types of data. The two main points which make lxml stand out are:
Python lxml是易于使用且功能丰富的库,用于处理和解析XML和HTML文档。 lxml是一个非常不错的API,因为它提供了处理这两种类型数据的所有内容。 使lxml脱颖而出的两个要点是:
- Ease of use: It has very easy syntax than any other library present 易于使用 :它比现有的任何其他库都具有非常简单的语法
- Performance: Processing even large XML files takes very less time 性能 :处理大型XML文件所需的时间非常少
Python lxml安装 (Python lxml install)
We can start using lxml by installing it as a python package using pip tool:
我们可以通过使用pip工具将lxml作为python软件包安装来开始使用lxml:
pip install lxml
Once we are done with installing this tool, we can get started with simple examples.
安装完此工具后,我们就可以从简单的示例开始。
创建HTML元素 (Creating HTML Elements)
With lxml, we can create HTML elements as well. The elements can also be calles as the Nodes. Let’s create basic structure of an HTML page using just the library:
使用lxml,我们也可以创建HTML元素。 元素也可以称为节点。 让我们仅使用库来创建HTML页面的基本结构:
from lxml import etree
root_elem = etree.Element('html')
etree.SubElement(root_elem, 'head')
etree.SubElement(root_elem, 'title')
etree.SubElement(root_elem, 'body')
print(etree.tostring(root_elem, pretty_print=True).decode("utf-8"))
When we run this script, we can see the HTML elements being formed:
We can see HTML elements or nodes being made. The
pretty_print
parameter helps to print indented version of HTML document.
运行此脚本时,我们可以看到正在形成HTML元素:
我们可以看到正在制作HTML元素或节点。 pretty_print
参数有助于打印HTML文档的缩进版本。
These HTML elements are basically a list. We can access this list normally:
这些HTML元素基本上是一个列表 。 我们可以正常访问此列表:
html = root_elem[0]
print(html.tag)
And this will just print head
as that is the tag present right inside html tag. We can also print all elements inside the root tag:
这将只是打印 head
因为那是html标签内的标签。 我们还可以打印root标记内的所有元素:
for element in root_elem:
print(element.tag)
This will print all tags:
这将打印所有标签:
检查HTML元素的有效性 (Checking validity of HTML Elements)
With iselement()
function, we can even check if given element is a valid HTML element:
使用iselement()
函数,我们甚至可以检查给定的元素是否为有效HTML元素:
print(etree.iselement(root_elem))
We just used the last script we wrote. This will give a simple output:
我们只是使用了最后编写的脚本。 这将给出一个简单的输出:
在HTML元素中使用属性 (Using attributes with HTML Elements)
We can add metadata to each HTML element we construct by adding attributes to the elements we make:
通过将属性添加到我们制作的元素中,我们可以将元数据添加到我们构造的每个HTML元素中:
from lxml import etree
html_elem = etree.Element("html", lang="en_GB")
print(etree.tostring(html_elem))
When we run this, we see:
We can now access these attributes as:
运行此命令时,我们看到:
现在,我们可以按以下方式访问这些属性:
print(html_elem.get("lang"))
Value is printed to the console:
Note that is the attribute doesn’t exist for given HTML element, we will get
None
as output.
值将打印到控制台:
注意,对于给定HTML元素,该属性不存在,我们将获得None
作为输出。
We can also set attributes for an HTML element as:
我们还可以将HTML元素的属性设置为:
html_elem.set("best", "JournalDev")
print(html_elem.get("best"))
When we print the value, we get the expected results:
当我们打印值时,我们得到了预期的结果:
带有值的子元素 (Sub-Elements with values)
Sub-elements we constructed above were empty and that is no fun! Let’s make some sub-elements and put some values in it using lxml library.
我们上面构造的子元素是空的,这没什么好玩的! 让我们使用lxml库制作一些子元素并将一些值放入其中。
from lxml import etree
html = etree.Element("html")
etree.SubElement(html, "head").text = "Head of HTML"
etree.SubElement(html, "title").text = "I am the title!"
etree.SubElement(html, "body").text = "Here is the body"
print(etree.tostring(html, pretty_print=True).decode('utf-8'))
This looks like some healthy data. Let’s see the output:
这看起来像一些健康的数据。 让我们看一下输出:
馈送RAW XML以进行序列化 (Feeding RAW XML for Serialisation)
We can provide RAW XML data directly to etree and parse it as well as it completely understands what is passed to it.
我们可以直接将RAW XML数据提供给etree并对其进行解析,也可以完全理解传递给它的内容。
from lxml import etree
html = etree.XML('<html><head>Head of HTML</head><title>I am the title!</title><body>Here is the body</body></html>')
print(etree.tostring(html, pretty_print=True).decode('utf-8'))
Let’s see the output:
If you want the data to include the root XML tag declaration, even that is possible:
让我们看一下输出:
如果您希望数据包括根XML标签声明,那么甚至可以:
from lxml import etree
html = etree.XML('<html><head>Head of HTML</head><title>I am the title!</title><body>Here is the body</body></html>')
print(etree.tostring(html, xml_declaration=True).decode('utf-8'))
Let’s see the output now:
现在看一下输出:
Python LXML etree parse()函数 (Python lxml etree parse() function)
The parse()
function can be used to parse from files and file-like objects:
parse()
函数可用于从文件和类似文件的对象中进行解析:
from lxml import etree
from io import StringIO
title = StringIO("<title>Title Here</title>")
tree = etree.parse(title)
print(etree.tostring(tree))
Let’s see the output now:
现在看一下输出:
Python LXML etree fromstring()函数 (Python lxml etree fromstring() function)
The fromstring()
function can be used to parse Strings:
fromstring()
函数可用于解析字符串:
from lxml import etree
title = "<title>Title Here</title>"
root = etree.fromstring(title)
print(root.tag)
Let’s see the output now:
现在看一下输出:
Python lxml etree XML()函数 (Python lxml etree XML() function)
The fromstring()
function can be used to write XML literals directly into the source:
fromstring()
函数可用于将XML文字直接写入源代码:
from lxml import etree
title = etree.XML("<title>Title Here</title>")
print(title.tag)
print(etree.tostring(title))
Let’s see the output now:
现在看一下输出:
Reference: LXML Documentation.
参考: LXML文档 。
lxml python