lxml python_python lxml

lxml python

Python lxml is the most feature-rich and easy-to-use library for processing XML and HTML data. Python scripts are written to perform many tasks like Web scraping and parsing XML. In this lesson, we will study about python lxml library and how we can use it to parse XML data and perform web scraping as well.

Python lxml是功能最丰富且易于使用的库,用于处理XML和HTML数据。 编写Python脚本可以执行许多任务,例如Web抓取和解析XML。 在本课程中,我们将研究python lxml库以及如何使用它解析XML数据并执行Web抓取。

Python LXML库 (Python lxml library)

python lxml

Python lxml is an easy to use and feature rich library to process and parse XML and HTML documents. lxml is really nice API as it provides literally everything to process these 2 types of data. The two main points which make lxml stand out are:


Python lxml是易于使用且功能丰富的库,用于处理和解析XML和HTML文档。 lxml是一个非常不错的API,因为它提供了处理这两种类型数据的所有内容。 使lxml脱颖而出的两个要点是:

  • Ease of use: It has very easy syntax than any other library present

    易于使用 :它比现有的任何其他库都具有非常简单的语法
  • Performance: Processing even large XML files takes very less time

    性能 :处理大型XML文件所需的时间非常少

Python lxml安装 (Python lxml install)

We can start using lxml by installing it as a python package using pip tool:

我们可以通过使用pip工具将lxml作为python软件包安装来开始使用lxml:

pip install lxml

Once we are done with installing this tool, we can get started with simple examples.

安装完此工具后,我们就可以从简单的示例开始。

创建HTML元素 (Creating HTML Elements)

With lxml, we can create HTML elements as well. The elements can also be calles as the Nodes. Let’s create basic structure of an HTML page using just the library:

使用lxml,我们也可以创建HTML元素。 元素也可以称为节点。 让我们仅使用库来创建HTML页面的基本结构:

from lxml import etree

root_elem = etree.Element('html')
etree.SubElement(root_elem, 'head')
etree.SubElement(root_elem, 'title')
etree.SubElement(root_elem, 'body')
print(etree.tostring(root_elem, pretty_print=True).decode("utf-8"))

When we run this script, we can see the HTML elements being formed:

python lxml example

We can see HTML elements or nodes being made. The pretty_print parameter helps to print indented version of HTML document.

运行此脚本时,我们可以看到正在形成HTML元素:

我们可以看到正在制作HTML元素或节点。 pretty_print参数有助于打印HTML文档的缩进版本。

These HTML elements are basically a list. We can access this list normally:

这些HTML元素基本上是一个列表 。 我们可以正常访问此列表:

html = root_elem[0]
print(html.tag)

And this will just print head as that is the tag present right inside html tag. We can also print all elements inside the root tag:

这将只是打印 head因为那是html标签内的标签。 我们还可以打印root标记内的所有元素:

for element in root_elem:
    print(element.tag)

This will print all tags:

python lxml nodes

这将打印所有标签:

检查HTML元素的有效性 (Checking validity of HTML Elements)

With iselement() function, we can even check if given element is a valid HTML element:

使用iselement()函数,我们甚至可以检查给定的元素是否为有效HTML元素:

print(etree.iselement(root_elem))

We just used the last script we wrote. This will give a simple output:

python lxml iselement

我们只是使用了最后编写的脚本。 这将给出一个简单的输出:

在HTML元素中使用属性 (Using attributes with HTML Elements)

We can add metadata to each HTML element we construct by adding attributes to the elements we make:

通过将属性添加到我们制作的元素中,我们可以将元数据添加到我们构造的每个HTML元素中:

from lxml import etree

html_elem = etree.Element("html", lang="en_GB")
print(etree.tostring(html_elem))

When we run this, we see:

python lxml attributes

We can now access these attributes as:

运行此命令时,我们看到:

现在,我们可以按以下方式访问这些属性:

print(html_elem.get("lang"))

Value is printed to the console:

python lxml element

Note that is the attribute doesn’t exist for given HTML element, we will get None as output.

值将打印到控制台:

注意,对于给定HTML元素,该属性不存在,我们将获得None作为输出。

We can also set attributes for an HTML element as:

我们还可以将HTML元素的属性设置为:

html_elem.set("best", "JournalDev")
print(html_elem.get("best"))

When we print the value, we get the expected results:

python lxml set attribute value

当我们打印值时,我们得到了预期的结果:

带有值的子元素 (Sub-Elements with values)

Sub-elements we constructed above were empty and that is no fun! Let’s make some sub-elements and put some values in it using lxml library.

我们上面构造的子元素是空的,这没什么好玩的! 让我们使用lxml库制作一些子元素并将一些值放入其中。

from lxml import etree

html = etree.Element("html")
etree.SubElement(html, "head").text = "Head of HTML"
etree.SubElement(html, "title").text = "I am the title!"
etree.SubElement(html, "body").text = "Here is the body"

print(etree.tostring(html, pretty_print=True).decode('utf-8'))

This looks like some healthy data. Let’s see the output:

python lxml etree SubElement text

这看起来像一些健康的数据。 让我们看一下输出:

馈送RAW XML以进行序列化 (Feeding RAW XML for Serialisation)

We can provide RAW XML data directly to etree and parse it as well as it completely understands what is passed to it.

我们可以直接将RAW XML数据提供给etree并对其进行解析,也可以完全理解传递给它的内容。

from lxml import etree

html = etree.XML('<html><head>Head of HTML</head><title>I am the title!</title><body>Here is the body</body></html>')
print(etree.tostring(html, pretty_print=True).decode('utf-8'))

Let’s see the output:

python lxml serialization

If you want the data to include the root XML tag declaration, even that is possible:

让我们看一下输出:

如果您希望数据包括根XML标签声明,那么甚至可以:

from lxml import etree

html = etree.XML('<html><head>Head of HTML</head><title>I am the title!</title><body>Here is the body</body></html>')
print(etree.tostring(html, xml_declaration=True).decode('utf-8'))

Let’s see the output now:

python lxml serialization xml data

现在看一下输出:

Python LXML etree parse()函数 (Python lxml etree parse() function)

The parse() function can be used to parse from files and file-like objects:

parse()函数可用于从文件和类似文件的对象中进行解析:

from lxml import etree
from io import StringIO

title = StringIO("<title>Title Here</title>")
tree = etree.parse(title)

print(etree.tostring(tree))

Let’s see the output now:

python lxml etree parse function

现在看一下输出:

Python LXML etree fromstring()函数 (Python lxml etree fromstring() function)

The fromstring() function can be used to parse Strings:

fromstring()函数可用于解析字符串:

from lxml import etree

title = "<title>Title Here</title>"
root = etree.fromstring(title)
print(root.tag)

Let’s see the output now:

python lxml etree fromstring function

现在看一下输出:

Python lxml etree XML()函数 (Python lxml etree XML() function)

The fromstring() function can be used to write XML literals directly into the source:

fromstring()函数可用于将XML文字直接写入源代码:

from lxml import etree

title = etree.XML("<title>Title Here</title>")
print(title.tag)
print(etree.tostring(title))

Let’s see the output now:

python lxml etree XML function

现在看一下输出:

Reference: LXML Documentation.

参考: LXML文档

翻译自: https://www.journaldev.com/18043/python-lxml

lxml python

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值