彻底掌握python中的lxml (一)

3 篇文章 1 订阅

python

文章目录

一、lxml是什么?

1.1 官方介绍

  • lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libxslt in a standards compliant way.
  • lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

1.2 官方文档

https://lxml.de/api/lxml-module.html

1.3 lxml简介

  • lxml 有很多的模块,如 etree 、html 、cssselect 、 BeautifulSoup等 。
  • lxml 是提供了一个 Pythonic API ,并且她也是几乎完全兼容 ElementTree API 的。
  • lxml是python的一个解析库,支持HTML和XML的解析,支持XPath解析方式,而且解析效率非常高。

1.4 XPath简介

  • XPath,全称XML Path Language,即XML路径语言,它是一门在XML文档中查找信息的语言,它最初是用来搜寻XML文档的,但是它同样适用于HTML文档的搜索
  • XPath的选择功能十分强大,它提供了非常简明的路径选择表达式,另外,它还提供了超过100个内建函数,用于字符串、数值、时间的匹配以及节点、序列的处理等,几乎所有我们想要定位的节点,都可以用XPath来选择
  • XPath于1999年11月16日成为W3C标准,它被设计为供XSLT、XPointer以及其他XML解析软件使用,更多的文档可以访问其官方网站:https://www.w3.org/TR/xpath/

1.5 说明

本系列主要围绕lxml的etree模块来介绍。
The lxml.etree module implements the extended ElementTree API for XML.

二、etree模块的的主要函数和类

2.1 Element函数

  • Element(_tag, attrib=None, nsmap=None, **_extra)
  • Element factory. This function returns an object implementing the Element interface.
  • Also look at the _Element.makeelement() and _BaseParser.makeelement() methods, which provide a faster way to create an Element within a specific document or parser context.
  • 该函数返回一个实现Element接口的对象。
  • 还有查看_Element.makeelement()和_BaseParser.makeelement()方法,它们提供了一种在特定文档或解析器上下文中创建Element的更快方法。
from lxml import etree
test = etree.Element('root', attrib={'Test': 'Try'})  # 返回Element对象
print(test)
<Element root at 0x54dfb08>

2.2 SubElement函数

  • SubElement(_parent, _tag, attrib=None, nsmap=None, **_extra)
  • Subelement factory. This function creates an element instance, and appends it to an existing element.
  • nsmap参数:Namespace prefix->URI mapping known in the context of this Element. This includes all namespace declarations of the parents.
  • 此函数创建一个元素实例,并将其附加到现有元素。
  • 使用SubElement方法创建子节点,第一个参数为父节点(Element对象),第二个参数为子节点名称。
a = etree.SubElement(test, 'a', attrib={'x': '123'})
print(a)
<Element a at 0x5041e88>

2.3 tostring函数

  • tostring(element_or_tree, encoding=None, method=“xml”, xml_declaration=None, pretty_print=False, with_tail=True, standalone=None, doctype=None, exclusive=False, inclusive_ns_prefixes=None, with_comments=True, strip_text=False, )
  • 将一个 Element 或者 ElementTree 转换为 string 形式。
  • 这里面有几个可选参数:pretty_print=False 表示是否格式化提高可读性;
    method=“xml” 选择输出后的文档格式,不同的选择,做的修改也不相同,可选参数有 xml 、html 、text (文本内容不包括标签,就是纯文本内容,tail也是) 、c14n (规范化 xml );
    encoding=None 表示以什么编码的字符串输出,在无 xml 文档声明情况下默认是 ASCⅡ ,可通过 encoding=None 进行修改,但是如果所改编码不是 utf-8 兼容,那么将会启用默认声明。
print(etree.tostring(test,pretty_print=True))  # 格式化输出,提高可读性
res = etree.tostring(test)
print(res)
print('type(res) = ', type(res))  # etree.tostring()在python2中返回字符串类型,在python3中返回<class 'bytes'>,可通过decode解码为str
print(res.decode('utf-8'))
print("type(res.decode('utf-8')) = ", type(res.decode('utf-8')))
b'<root Test="Try">\n  <a x="123"/>\n</root>\n'
b'<root Test="Try"><a x="123"/></root>'
type(res) =  <class 'bytes'>
<root Test="Try"><a x="123"/></root>
type(res.decode('utf-8')) =  <class 'str'>

2.4 dump函数

  • dump(elem, pretty_print=True, with_tail=True)
  • Writes an element tree or element structure to sys.stdout. This function should be used for debugging only.
  • 将元素树或元素结构写入系统标准输出. 此函数只能用于调试。
etree.dump(test)
<root Test="Try">
  <a x="123"/>
</root>
etree.dump(test, pretty_print=True)
<root Test="Try">
  <a x="123"/>
</root>

2.5 iselement函数

  • iselement(element)
  • Checks if an object appears to be a valid element object.
  • 检查对象是否为有效的 element对象。
etree.iselement(test)  # 判断是否为element对象
True

2.6 get_default_parser函数

  • get_default_parser()
  • 返回etree默认的解析器
etree.get_default_parser()
<lxml.etree.XMLParser at 0x4fecb90>

2.7 set_default_parser函数

  • set_default_parser(parser=None)

  • 设置默认解析器

  • Set a default parser for the current thread. This parser is used globally whenever no parser is supplied to the various parse functions of the lxml API. If this function is called without a parser (or if it is None), the default parser is reset to the original configuration.

  • Note that the pre-installed default parser is not thread-safe. Avoid the default parser in multi-threaded environments. You can create a separate parser for each thread explicitly or use a parser pool.

2.8 fromstring函数

  • fromstring(text, parser=None, base_url=None)
  • 将text(字符串) 解析为 Element 或者 ElementTree 。
  • Parses an XML document or fragment from a string. Returns the root node (or the result returned by a parser target).
  • To override the default parser with a different parser you can pass it to the parser keyword argument.
  • The base_url keyword argument allows to set the original base URL of the document to support relative Paths when looking up external entities (DTD, XInclude, …).
  • 从字符串中解析XML文档或片段。 返回根节点(或解析器目标返回的结果)。
  • 要使用其他解析器覆盖默认解析器,可以将其传递给parser关键字参数。
  • 使用base_url关键字参数可设置文档的原始基本URL,以在查找外部实体(DTD,XInclude,…)时支持相对路径。
xml_str = """
      <root>
        <a x='123'>aText
            <b/>
            <c/>
            <b/>
        </a>hello
        <a y='3'>Text
            <b/>
            <c/>
            <b/>
        </a>
      </root>
"""
root_xml = etree.fromstring(xml_str)  # 返回根节点
print(root_xml)
print(type(root_xml))
print(etree.iselement(root_xml))  # 判断是否为element对象
<Element root at 0x54dfa88>
<class 'lxml.etree._Element'>
True
root_xml.tag
'root'
sub_elem = root_xml.find('a')
sub_elem
<Element a at 0x54f00c8>
sub_elem.text
'aText\n            '
sub_elem.tail
'hello\n        '
sub_elem.attrib
{'x': '123'}

2.9 ElementTree函数

  • ElementTree(element=None, file=None, parser=None)
  • ElementTree wrapper class.
# 设置etree.XMLParser(remove_blank_text=True)后,输出时pretty_print参数才有效。
parser = etree.XMLParser(remove_blank_text=True) 
my_et = etree.ElementTree(element=test, parser=parser)
my_et
<lxml.etree._ElementTree at 0x54f0548>

2.10 HTML函数

  • HTML(text, parser=None, base_url=None)
  • Parses an HTML document from a string constant. Returns the root node (or the result returned by a - parser target). This function can be used to embed “HTML literals” in Python code.
  • 从字符串常量解析HTML文档。返回根节点(或解析器目标返回的结果)。此函数可用于在Python代码中嵌入“HTML文本”。
html = etree.HTML(xml_str)
html
<Element html at 0x5036c08>
etree.dump(html)
<html>
  <body><root>
        <a x="123">aText
            <b/>
            <c/>
            <b/>
        </a>hello
        <a y="3">Text
            <b/>
            <c/>
            <b/>
        </a>
      </root>
</body>
</html>
etree.iselement(html)
True

2.10.1 etree.HTML(),etree.fromstring()和etree.tostring()三者的区别与联系

 etree.HTML()
表格解读:

  • 从三者的返回值的类型上可以看到,etree.HTML()和etree.fromstring()都是属于同一种“class类”,即Element类, 这个类支持使用xpath。也就说etree.tostring()是“字节bytes类”,不能使用xpath!

  • 从根节点看,etree.HTML()的文档格式已经变成html类型,所以根节点自然就是html标签【这属于html方面的知识点,不清楚的朋友可以查资料了解】
    但是,etree.fromstring()的根节点还是原文档中的根节点,说明这种格式化方式并不改变原文档的整体结构,我比较推荐使用这种方式进行文档格式化,因为这样有利于我们有时使用xpath的绝对路径方式查找信息!
    而etree.tostring()是没有所谓的根节点的,因为这个方法得到的文档类型是‘bytes’类,其实里面的tostring,我们可以理解成to_bytes,这样可以帮助理解记忆。

  • 从编码方式上看,etree.HTML()和etree.fromstring()的括号内参数都要以“utf-8”的方式进行编码!表格中的X是表示用read()方法之后的原文档内容。

2.11 XML函数

  • XML(text, parser=None, base_url=None)

  • Parses an XML document or fragment from a string constant. Returns the root node (or the result returned by a parser target). This function can be used to embed “XML literals” in Python code,

  • 从字符串常量解析XML文档或片段。返回根节点(或解析器目标返回的结果)。此函数可用于在Python代码中嵌入“XML文本”

  • To override the parser with a different XMLParser you can pass it to the parser keyword argument.

  • The base_url keyword argument allows to set the original base URL of the document to support relative Paths when looking up external entities (DTD, XInclude, …).

xml_test = etree.XML("<root><test/></root>")
xml_test
<Element root at 0x54f0b08>
etree.dump(xml_test)
<root>
  <test/>
</root>
etree.iselement(xml_test)
True

2.12 parse函数

  • parse(source, parser=None, base_url=None)

  • Return an ElementTree object loaded with source elements. If no parser is provided as second argument, the default parser is used.

  • 返回加载了源元素的ElementTree对象。如果没有提供解析器作为第二个参数,则使用默认解析器。
    The source can be any of the following:
    a file name/path
    a file object
    a file-like object
    a URL using the HTTP or FTP protocol

  • To parse from a string, use the fromstring() function instead.

  • Note that it is generally faster to parse from a file path or URL than from an open file object or file-like object. Transparent decompression from gzip compressed sources is supported (unless explicitly disabled in libxml2).

  • The base_url keyword allows setting a URL for the document when parsing from a file-like object. This is needed when looking up external entities (DTD, XInclude, …) with relative paths.

test_parse = etree.parse('./sample.xml')  # 返回ElementTree对象
print(test_parse)
print(etree.iselement(test_parse))  # 判断是否为element对象
<lxml.etree._ElementTree object at 0x00000000054F0C88>
False

2.13 strip_attributes函数

  • strip_attributes(tree_or_element, *attribute_names)

  • Delete all attributes with the provided attribute names from an Element (or ElementTree) and its descendants.

  • 从Element对象(或ElementTree对象)及其后代中删除具有所提供属性名称的所有属性。

  • Attribute names can contain wildcards as in _Element.iter.

  • 属性名可以包含通配符,如Example中所示_元素iter.
    Example usage:
    strip_attributes(root_element,
    ‘simpleattr’,
    ‘{http://some/ns}attrname’,
    ‘{http://other/ns}*’)

root_elem = test_parse.getroot()
print(root_elem)  
print(etree.iselement(root_elem))   # 判断是否为element对象
<Element TradingAccounts at 0x54fc1c8>
True
etree.dump(root_elem)
<TradingAccounts>
    <Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
    <Strategies>
        <Strategy name="CTA01" trade="true" commission="flase"/>
        <Strategy name="CTA02" trade="true" commission="flase"/>
        <Strategy name="ALPHA"/>
    </Strategies>
    <Accounts>
        <Account name="RB" max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
            <Strategy name="CTA01" num="3" prior="1" id="997"/>first strategy
            <Strategy name="CTA02" num="10" prior="2" id="998"/>
        </Account>
        <Account name="i" max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
            <Strategy name="CTA01" num="2" prior="1" id="999">this is text
                <Type id="10" name="FOF"/>
                same text
            </Strategy>
            <Strategy name="CTA02" num="5" prior="2" id="1000"/>
        </Account>
        <Account name="IC" max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
            <Strategy name="CTA01" num="5" prior="1" id="1001">
                <Commission id="20" rate="0.01"/>
                <Slip param="1"/>
            </Strategy>
            <Strategy name="CTA02" num="6" prior="2" id="1002"/>last strategy
        </Account>
    </Accounts>
</TradingAccounts>
etree.strip_attributes(root_elem, 'commission', 'name')
etree.dump(root_elem)
<TradingAccounts>
    <Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
    <Strategies>
        <Strategy trade="true"/>
        <Strategy trade="true"/>
        <Strategy/>
    </Strategies>
    <Accounts>
        <Account max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
            <Strategy num="3" prior="1" id="997"/>first strategy
            <Strategy num="10" prior="2" id="998"/>
        </Account>
        <Account max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
            <Strategy num="2" prior="1" id="999">this is text
                <Type id="10"/>
                same text
            </Strategy>
            <Strategy num="5" prior="2" id="1000"/>
        </Account>
        <Account max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
            <Strategy num="5" prior="1" id="1001">
                <Commission id="20" rate="0.01"/>
                <Slip param="1"/>
            </Strategy>
            <Strategy num="6" prior="2" id="1002"/>last strategy
        </Account>
    </Accounts>
</TradingAccounts>
# 如果找不到要删除的属性名,也不会报错
etree.strip_attributes(root_elem, 'xxyyzz')
etree.dump(root_elem)

2.14 strip_elements函数

  • strip_elements(tree_or_element, with_tail=True, *tag_names)

  • Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the with_tail keyword argument option to False.
    从树或子树中删除具有所提供标记名的所有元素。这将删除元素及其整个子树,包括它们的所有属性、文本内容和子体。它还将删除元素的尾部文本,除非您显式地将with_tail关键字参数选项设置为False。

  • Tag names can contain wildcards as in _Element.iter.
    标记名可以包含通配符,如Example中所示_元素iter.

  • Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants. If you want to include the root element, check its tag name directly before even calling this function.
    注意,这不会删除传递的元素(或ElementTree根元素),即使它匹配。它只会对待它的后代。如果要包含根元素,请在调用此函数之前直接检查其标记名。
    Example usage:
    strip_elements(some_element,
    ‘simpletagname’, # non-namespaced tag
    ‘{http://some/ns}tagname’, # namespaced tag
    ‘{http://some/other/ns}*’ # any tag from a namespace
    lxml.etree.Comment # comments
    )

root_elem1 = test_parse.getroot()
etree.strip_elements(root_elem1, 'Strategy')
etree.dump(root_elem1)
<TradingAccounts>
    <Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
    <Strategies>
        </Strategies>
    <Accounts>
        <Account max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
            </Account>
        <Account max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
            </Account>
        <Account max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
            </Account>
    </Accounts>
</TradingAccounts>
# 如果提供的tag不存在,也不会报错
etree.strip_elements(root_elem1, 'hahaha')
etree.dump(root_elem1)

2.15 strip_tags函数

  • strip_tags(tree_or_element, *tag_names)

  • Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their attributes, but not their text/tail content or descendants. Instead, it will merge the text content and children of the element into its parent.
    从树或子树中删除具有所提供标记名的所有元素。这将移除元素及其属性,但不会移除其文本/尾部内容或子体。相反,它将把元素的文本内容和子元素合并到其父元素中。

  • Tag names can contain wildcards as in _Element.iter.
    标记名可以包含通配符,如Example中所示_元素iter.

  • Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants.
    注意,这不会删除传递的元素(或ElementTree根元素),即使它匹配。它只会对待它的后代。

Example usage:
strip_tags(some_element,
‘simpletagname’, # non-namespaced tag
‘{http://some/ns}tagname’, # namespaced tag
‘{http://some/other/ns}*’ # any tag from a namespace
Comment # comments (including their text!)
)

root_elem2 = test_parse.getroot()
etree.strip_elements(root_elem2, 'Strategy')
etree.dump(root_elem2)
<TradingAccounts>
    <Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
    <Strategies>
        </Strategies>
    <Accounts>
        <Account max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
            </Account>
        <Account max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
            </Account>
        <Account max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
            </Account>
    </Accounts>
</TradingAccounts>

2.16 Element类

  • Element是XML处理的核心类,Element对象可以直观的理解为XML的节点,大部分XML节点的处理都是围绕该类进行的。这部分包括三个内容:节点的操作、节点属性的操作、节点内文本的操作。下面将结合对xml的增删改查来进一步介绍。属性:

  • attrib
    Element attribute dictionary. Where possible, use get(), set(), keys(), values() and items() to access element attributes.

  • base
    The base URI of the Element (xml:base or HTML base URL). None if the base URI is unknown.

  • nsmap
    Namespace prefix->URI mapping known in the context of this Element. This includes all namespace declarations of the parents.

  • prefix
    Namespace prefix or None.

  • sourceline
    Original line number as found by the parser or None if unknown.

  • tag
    Element tag

  • tail
    Text after this element’s end tag, but before the next sibling element’s start tag. This is either a string or the value None, if there was no text.

  • text
    Text before the first subelement. This is either a string or the value None, if there was no text.方法:

  • contains(self, element)

  • copy(self)

  • deepcopy(self, memo)

  • delitem(self, x)
    Deletes the given subelement or a slice.

  • getitem(…)
    Returns the subelement at the given position or the requested slice.

  • iter(self)

  • len(self)
    Returns the number of subelements.

  • new(T, S, …)

  • nonzero(x)
    x != 0

  • repr(self)
    repr(x)

  • reversed(self)

  • setitem(self, x, value)
    Replaces the given subelement index or slice.

  • _init(self)
    Called after object initialisation. Custom subclasses may override this if they recursively call _init() in the superclasses.

  • addnext(self, element)
    Adds the element as a following sibling directly after this element.

  • addprevious(self, element)
    Adds the element as a preceding sibling directly before this element.

  • append(self, element)
    Adds a subelement to the end of this element.

  • clear(self, keep_tail=False)
    Resets an element. This function removes all subelements, clears all attributes and sets the text and tail properties to None.

  • cssselect(…)
    Run the CSS expression on this element and its children, returning a list of the results.

  • extend(self, elements)
    Extends the current children by the elements in the iterable.

  • find(self, path, namespaces=None)
    Finds the first matching subelement, by tag name or path.

  • findall(self, path, namespaces=None)
    Finds all matching subelements, by tag name or path.

  • findtext(self, path, default=None, namespaces=None)
    Finds text for the first matching subelement, by tag name or path.

  • get(self, key, default=None)
    Gets an element attribute.

  • getchildren(self)
    Returns all direct children. The elements are returned in document order.

  • getiterator(self, tag=None, *tags)
    Returns a sequence or iterator of all elements in the subtree in document order (depth first pre-order), starting with this element.

  • getnext(self)
    Returns the following sibling of this element or None.

  • getparent(self)
    Returns the parent of this element or None for the root element.

  • getprevious(self)
    Returns the preceding sibling of this element or None.

  • getroottree(self)
    Return an ElementTree for the root node of the document that contains this element.

  • index(self, child, start=None, stop=None)
    Find the position of the child within the parent.

  • insert(self, index, element)
    Inserts a subelement at the given position in this element

  • items(self)
    Gets element attributes, as a sequence. The attributes are returned in an arbitrary order.

  • iter(self, tag=None, *tags)
    Iterate over all elements in the subtree in document order (depth first pre-order), starting with this element.

  • iterancestors(self, tag=None, *tags)
    Iterate over the ancestors of this element (from parent to parent).

  • iterchildren(self, tag=None, reversed=False, *tags)
    Iterate over the children of this element.

  • iterdescendants(self, tag=None, *tags)
    Iterate over the descendants of this element in document order.

  • iterfind(self, path, namespaces=None)
    Iterates over all matching subelements, by tag name or path.

  • itersiblings(self, tag=None, preceding=False, *tags)
    Iterate over the following or preceding siblings of this element.

  • itertext(self, tag=None, with_tail=True, *tags)
    Iterates over the text content of a subtree.

  • keys(self)
    Gets a list of attribute names. The names are returned in an arbitrary order (just like for an ordinary Python dictionary).

  • makeelement(self, _tag, attrib=None, nsmap=None, **_extra)
    Creates a new element associated with the same document.

  • remove(self, element)
    Removes a matching subelement. Unlike the find methods, this method compares elements based on identity, not on tag value or contents.

  • replace(self, old_element, new_element)
    Replaces a subelement with the element passed as second argument.

  • set(self, key, value)
    Sets an element attribute.

  • values(self)
    Gets element attribute values as a sequence of strings. The attributes are returned in an arbitrary order.

  • xpath(self, _path, namespaces=None, extensions=None, smart_strings=True, **_variables)
    Evaluate an xpath expression using the element as context node.

2.17 ElementTree类

  • 通过上面介绍过的parse(source, parser=None, base_url=None)函数可以得到ElementTree对象,ElementTree对象具有和Element对象很多一样的方法。
    具体如下:

  • ElementTree对象方法:

  • find(self, path, namespaces=None)
    Finds the first toplevel element with given tag. Same as tree.getroot().find(path).

  • findall(self, path, namespaces=None)
    Finds all elements matching the ElementPath expression. Same as getroot().findall(path).

  • findtext(self, path, default=None, namespaces=None)
    Finds the text for the first element matching the ElementPath expression. Same as getroot().findtext(path)
    查找与ElementPath表达式匹配的第一个元素的文本。 与getroot().findtext(path)相同

  • getelementpath(self, element)
    Returns a structural, absolute ElementPath expression to find the element. This path can be used in the .find() method to look up the element, provided that the elements along the path and their list of immediate children were not modified in between.
    返回一个结构化的绝对ElementPath表达式以查找该元素。 该路径可以在.find()方法中使用,以查找元素,前提是该路径中的元素及其直接子元素列表在这之间没有被修改

  • getiterator(self, tag=None, *tags)
    Returns a sequence or iterator of all elements in document order (depth first pre-order), starting with the root element.

  • getpath(self, element)
    Returns a structural, absolute XPath expression to find the element.

  • getroot(self)
    Gets the root element for this tree.

  • iter(self, tag=None, *tags)
    Creates an iterator for the root element. The iterator loops over all elements in this tree, in document order. Note that siblings of the root element (comments or processing instructions) are not returned by the iterator.

  • iterfind(self, path, namespaces=None)
    Iterates over all elements matching the ElementPath expression. Same as getroot().iterfind(path).

  • parse(self, source, parser=None, base_url=None)
    Updates self with the content of source and returns its root.

  • relaxng(self, relaxng)
    Validate this document using other document.

  • write(self, file, encoding=None, method=“xml”, pretty_print=False, xml_declaration=None, with_tail=True, standalone=None, doctype=None, compression=0, exclusive=False, inclusive_ns_prefixes=None, with_comments=True, strip_text=False)
    Write the tree to a filename, file or file-like object.
    这个是 ElementTree 特有的方法,是将 ElementTree 写到 a file, a file-like object, or a URL (via FTP PUT or HTTP POST) 。可选参数和etree. tostring() 差不多,也有不同。

  • write_c14n(self, file, exclusive=False, with_comments=True, compression=0, inclusive_ns_prefixes=None)
    C14N write of document. Always writes UTF-8.

  • xinclude(self)
    Process the XInclude nodes in this document and include the referenced XML fragments.

  • xmlschema(self, xmlschema)
    Validate this document using other document.

  • xpath(self, _path, namespaces=None, extensions=None, smart_strings=True, **_variables)
    XPath evaluate in context of document.

  • xslt(self, _xslt, extensions=None, access_control=None, **_kw)
    Transform this document using other document.

三、上代码

结合上面介绍的函数和类,用代码加以演示,综合应用

3.1 节点操作

3.1.1 创建Element对象

使用Element方法,参数即节点名称。

from __future__ import print_function
from lxml import etree
root = etree.Element('root')  # 用Element函数创建Element对象,之后可以用Element类的方法和属性对该对象进行增删改查等操作
root
<Element root at 0x55a3a08>

3.1.2 获取节点名称

使用tag属性,获取节点的名称。

root.tag
'root'

3.1.3 用 etree.SubElement 添加子节点。

使用SubElement方法创建子节点,第一个参数为父节点(Element对象),第二个参数为子节点名称。

child1 = etree.SubElement(root, 'child1')
child2 = etree.SubElement(root, 'child2')

3.1.4 用Element类的 extend方法 添加子节点。

root.extend([etree.Element('child3'), etree.Element('child4')])
etree.dump(root)
<root>
  <child1/>
  <child2/>
  <child3/>
  <child4/>
</root>    

3.1.5 getparent()

  • getparent()
  • Returns the parent of this element or None for the root element.
    返回此元素的父元素,若是根元素则返回None。
print(root.getparent())
None
child1.getparent()
<Element root at 0x55a3a08>

3.1.6 index节点索引

  • index(self, child, start=None, stop=None)
  • Find the position of the child within the parent.
    在父级中查找子级的位置
root.index(child2)
1

3.1.7 getchildren()

  • getchildren()获取所有直接子节点的list,元素按文档顺序返回
all_direct_children = root.getchildren()
print(all_direct_children)
print(type(all_direct_children))
[<Element child1 at 0x55a36c8>, <Element child2 at 0x558ac48>, <Element child3 at 0x558a1c8>, <Element child4 at 0x558ab08>]
<class 'list'>

3.1.8 以列表的方式操作子节点

  • 可以将Element对象的子节点视为列表进行各种操作:
# 下标访问
child = root[0]  # 同 root.find('child1').tag
child.tag
'child1'

3.1.9 insert 插入节点

root.insert(0, etree.Element('child0', attrib={'name': 'ch1'})) # 在root直接子元素中第0个位置插入
child0 = root[0]
child0.insert(0, etree.Element('grandson0', attrib={'name': 'gson', 'age': '3', 'type': 'insert'}))
etree.dump(root)
<root>
  <child0 name="ch1">
    <grandson0 age="3" name="gson" type="insert"/>
  </child0>
  <child1/>
  <child2/>
  <child3/>
  <child4/>
</root>   

3.1.10 append 尾部追加节点

root.append(etree.Element('append_child', attrib={'id': '1'})) # 尾部添加
root.append(etree.Element('append_child', attrib={'id': '2'})) # 尾部添加
child0 = root[0]
child0.append(etree.Element('append_grandson', attrib={'name': 'gson', 'age': '5', 'type': 'append'}))
etree.dump(root)
<root>
  <child0 name="ch1">
    <grandson0 age="3" name="gson" type="insert"/>
    <append_grandson age="5" name="gson" type="append"/>
  </child0>
  <child1/>
  <child2/>
  <child3/>
  <child4/>
  <append_child id="1"/>
  <append_child id="2"/>
</root>

3.1.11 addnext 将元素作为后续同级项直接添加到此元素之后

  • addnext(element)
  • Adds the element as a following sibling directly after this element.
    将元素作为后续同级项直接添加到此元素之后。
add_elem = root.find('child4') # 或 child4 = root[3]
add_elem.addnext(etree.Element('add_cute_child', attrib={'name': 'add', 'kind': 'cute'}))
etree.dump(root)
<root>
  <child0 name="ch1">
    <grandson0 age="3" name="gson" type="insert"/>
    <append_grandson age="5" name="gson" type="append"/>
  </child0>
  <child1/>
  <child2/>
  <child3/>
  <child4/>
  <add_cute_child kind="cute" name="add"/>
  <append_child id="1"/>
  <append_child id="2"/>
</root>

3.1.12 addprevious将元素作为前一个同级项直接添加到此元素之前

  • addprevious(self, element)
  • Adds the element as a preceding sibling directly before this element.
    将元素作为前一个同级项直接添加到此元素之前
add_sibling = root.find('child4') 
add_sibling.addprevious(etree.Element('add_preceding_sibling', attrib={'name': 'add', 'kind': 'sibling', 'site': 'preceding'}))
etree.dump(root)
<root>
  <child0 name="ch1">
    <grandson0 age="3" name="gson" type="insert"/>
    <append_grandson age="5" name="gson" type="append"/>
  </child0>
  <child1/>
  <child2/>
  <child3/>
  <add_preceding_sibling kind="sibling" name="add" site="preceding"/>
  <child4/>
  <add_cute_child kind="cute" name="add"/>
  <append_child id="1"/>
  <append_child id="2"/>
</root>

获取元素属性
get(self, key, default=None)
Gets an element attribute.
注:在3.2中还有介绍

r1 = root.find('add_preceding_sibling').get('kind')  # 获取add_preceding_sibling元素的kind属性值
r1
'sibling'

3.1.13 find

  • find(self, path, namespaces=None)
  • Finds the first matching subelement, by tag name or path.
    按标记名或路径查找第一个匹配的子元素。
  • The optional namespaces argument accepts a prefix-to-namespace mapping that allows the usage of XPath prefixes in the path expression.
    可选的namespaces参数接受一个前缀到命名空间的映射,该映射允许在路径表达式中使用XPath前缀。
root.find('child0')
<Element child0 at 0x504d108>
root.find('child0').find('grandson0')
<Element grandson0 at 0x55a3788>
root.find('child0/grandson0')
<Element grandson0 at 0x55a3788>

3.1.14 findall

  • findall(self, path, namespaces=None)
  • Finds all matching subelements, by tag name or path.
    按标记名或路径查找所有匹配的子元素。
  • The optional namespaces argument accepts a prefix-to-namespace mapping that allows the usage of XPath prefixes in the path expression.
    可选的namespaces参数接受一个前缀到命名空间的映射,该映射允许在路径表达式中使用XPath前缀。
root.findall('child0')
[<Element child0 at 0x504d108>]
root.findall('append_child')
[<Element append_child at 0x55a1e48>, <Element append_child at 0x55a17c8>]

3.1.15 getprevious返回此元素的前一个同级

  • addprevious(self, element)
  • Returns the preceding sibling of this element or None.
    返回此元素的前一个同级,如没有则返回None
child1 = root[1]
print('child1 = ', child1)
print(child1.getprevious())
child1 =  <Element child1 at 0x55a36c8>
<Element child0 at 0x504d108>

3.1.16 getnext返回此元素的下一个同级节点

  • getnext(self)
  • Returns the following sibling of this element or None.
    返回此元素的以下同级项,若无则返回None
child1.getnext()
<Element child2 at 0x558ac48>

3.1.17 getparent获取父节点

  • 使用getparent方法可以获取父节点。
child1.getparent().tag
'root'

3.1.18 getchildren获取所有直接子节点

  • getchildren(self)
  • Returns all direct children. The elements are returned in document order.
    返回所有直接子节点。元素按文档顺序返回。
root.getchildren()
[<Element child0 at 0x504d108>,
 <Element child1 at 0x55a36c8>,
 <Element child2 at 0x558ac48>,
 <Element child3 at 0x558a1c8>,
 <Element add_preceding_sibling at 0x558a188>,
 <Element child4 at 0x558ab08>,
 <Element add_cute_child at 0x55a16c8>,
 <Element append_child at 0x55a1e48>,
 <Element append_child at 0x55a17c8>]

3.1.19 getiterator返回所有子节点的迭代器

  • getiterator(self, tag=None, *tags)
  • Returns a sequence or iterator of all elements in the subtree in document order (depth first pre-order), starting with this element.
    返回子树中所有元素的序列或迭代器,按文档顺序(深度优先的前置顺序),从该元素开始
root_iterator = root.getiterator()
root_iterator
<lxml.etree.ElementDepthFirstIterator at 0x5574dc8>
for i in root_iterator:
    print(i)
<Element root at 0x55a3a08>
<Element child0 at 0x504d108>
<Element grandson0 at 0x55a3788>
<Element append_grandson at 0x54e9608>
<Element child1 at 0x55a36c8>
<Element child2 at 0x558ac48>
<Element child3 at 0x558a1c8>
<Element add_preceding_sibling at 0x558a188>
<Element child4 at 0x558ab08>
<Element add_cute_child at 0x55a16c8>
<Element append_child at 0x55a1e48>
<Element append_child at 0x55a17c8>

3.1.20 iter返回子树中所有节点的迭代器

  • iter(self, tag=None, *tags)
  • Iterate over all elements in the subtree in document order (depth first pre-order), starting with this element.
    以文档顺序(深度优先的前置顺序)迭代子树中的所有元素,从这个元素开始。
root_iter = root.iter()
root_iter
<lxml.etree.ElementDepthFirstIterator at 0x55a5558>
for i in root_iter:
    print(i)
<Element root at 0x55a3a08>
<Element child0 at 0x504d108>
<Element grandson0 at 0x55a3788>
<Element append_grandson at 0x55a1808>
<Element child1 at 0x55a36c8>
<Element child2 at 0x558ac48>
<Element child3 at 0x558a1c8>
<Element add_preceding_sibling at 0x558a188>
<Element child4 at 0x558ab08>
<Element add_cute_child at 0x55a16c8>
<Element append_child at 0x55a1e48>
<Element append_child at 0x55a17c8>

3.1.21 iterancestors 返回此节点所有祖先节点的迭代器

  • iterancestors(self, tag=None, *tags)
  • Iterate over the ancestors of this element (from parent to parent).
    迭代此元素的祖先(从父元素到父元素)。
iterancestors = root.find('child0').find('grandson0').iterancestors()  # 返回grandson0的所有祖先节点的迭代器
print(type(iterancestors))
iterancestors
<class 'lxml.etree.AncestorsIterator'>
<lxml.etree.AncestorsIterator at 0x55a5828>
for i in iterancestors:
    print(i)
<Element child0 at 0x504d108>
<Element root at 0x55a3a08>

3.1.22 iterchildren 返回此节点所有子节点的迭代器

  • iterchildren(self, tag=None, reversed=False, *tags)
  • Iterate over the children of this element.
    迭代此元素的子元素。
iterchildren = root.find('child0').iterchildren()  # 返回child0的所有直接子节点的迭代器
iterchildren
<lxml.etree.ElementChildIterator at 0x55a59d8>
for i in iterchildren:
    print(i)
<Element grandson0 at 0x55a3788>
<Element append_grandson at 0x5593cc8>

3.1.23 iterdescendants按文档顺序迭代此元素的所有后代

  • iterdescendants(self, tag=None, *tags)
  • Iterate over the descendants of this element in document order.
    按文档顺序迭代此元素的所有后代。
iterdescendants = root.iterdescendants()  # 按文档顺序返回该元素的所有后代的迭代器
iterdescendants
<lxml.etree.ElementDepthFirstIterator at 0x55a5c60>
for i in iterdescendants:
    print(i)
<Element child0 at 0x504d108>
<Element grandson0 at 0x55a3788>
<Element append_grandson at 0x5593cc8>
<Element child1 at 0x55a36c8>
<Element child2 at 0x558ac48>
<Element child3 at 0x558a1c8>
<Element add_preceding_sibling at 0x558a188>
<Element child4 at 0x558ab08>
<Element add_cute_child at 0x55a16c8>
<Element append_child at 0x55a1e48>
<Element append_child at 0x55a17c8>

3.1.24 itersiblings返回该节点的同级节点的迭代器

  • itersiblings(self, tag=None, preceding=False, *tags)
  • Iterate over the following or preceding siblings of this element.
    迭代此元素的以下或前面的同级。
itersiblings = root.find('child0').find('grandson0').itersiblings()
itersiblings
<lxml.etree.SiblingsIterator at 0x55a5d38>
for i in itersiblings:
    print(i)
<Element append_grandson at 0x559dfc8>

3.1.25 iterfind 返回按标记名或路径匹配的所有节点的迭代器

  • iterfind(self, path, namespaces=None)
  • Iterates over all matching subelements, by tag name or path.
    按标记名或路径迭代所有匹配的子元素。
iterfind = root.iterfind('child0/')
child0_child = [i for i in iterfind]
child0_child
  [<Element grandson0 at 0x516a7c8>, <Element append_grandson at 0x4eb1f48>]	

3.1.26 getroottree 返回ElementTree

  • getroottree(self)
  • Return an ElementTree for the root node of the document that contains this element.
    返回包含此元素的文档的根节点的ElementTree。
root.getroottree()
<lxml.etree._ElementTree at 0x55a7fc8>

3.1.27 节点遍历、切片、索引

len(root) # 子节点数量
9
root.index(child2) # 获取索引号
2
for child in root: # 遍历
    print(child.tag)
child0
child1
child2
child3
add_preceding_sibling
child4
add_cute_child
append_child
append_child
start = root[1:] # 切片
start[0].tag
'child1'
end = root[-1:]
end[0].tag
'append_child'

3.1.28 replace 节点替换

  • replace(self, old_element, new_element)
  • Replaces a subelement with the element passed as second argument.
    用作为第二个参数传递的元素替换子元素
root.replace(root.find('child2'), etree.Element('replace_child2', attrib={'type': 'replace'}))
etree.dump(root)
<root>
  <child0 name="ch1">
    <grandson0 age="3" name="gson" type="insert"/>
    <append_grandson age="5" name="gson" type="append"/>
  </child0>
  <child1/>
  <replace_child2 type="replace"/>
  <child3/>
  <add_preceding_sibling kind="sibling" name="add" site="preceding"/>
  <child4/>
  <add_cute_child kind="cute" name="add"/>
  <append_child id="1"/>
  <append_child id="2"/>
</root>

3.1.29 remove | clear节点删除

  • 删除子节点
    使用remove方法删除指定节点,参数为Element对象。clear方法清空所有节点。
  • remove(self, element)
    Removes a matching subelement. Unlike the find methods, this method compares elements based on identity, not on tag value or contents.
  • clear(self, keep_tail=False)
    Resets an element. This function removes all subelements, clears all attributes and sets the text and tail properties to None.
 root.remove(child1) # 删除指定子节点
etree.dump(root)
<root>
  <child0 name="ch1">
    <grandson0 age="3" name="gson" type="insert"/>
    <append_grandson age="5" name="gson" type="append"/>
  </child0>
  <replace_child2 type="replace"/>
  <child3/>
  <add_preceding_sibling kind="sibling" name="add" site="preceding"/>
  <child4/>
  <add_cute_child kind="cute" name="add"/>
  <append_child id="1"/>
  <append_child id="2"/>
</root>
 root.clear() # 清除所有子节点
etree.dump(root)
<root/>

3.2 属性操作

属性是以key-value的方式存储的,就像字典一样。

3.2.1 创建属性

  • 可以在创建Element对象时同步创建属性,第二个参数即为属性名和属性值:
root = etree.Element('root', interesting='totally')
etree.dump(root)
<root interesting="totally"/>
  • 也可以使用set方法给已有的Element对象添加属性,两个参数分别为属性名和属性值。
 root.set('hello', 'Huhu')
etree.dump(root)
<root interesting="totally" hello="Huhu"/>

3.2.2 items获取属性

  • items()
  • Gets element attributes, as a sequence. The attributes are returned in an arbitrary order.
    获取元素属性,作为序列。属性以任意顺序返回。
root.items()
[('interesting', 'totally'), ('hello', 'Huhu')]

3.2.3 用makeelement 创建Element对象

  • 创建与同一Element对象一致的Element对象
  • makeelement(self,_tag,attrib=None,nsmap=None,**\u extra)
  • Creates a new element associated with the same document.
    创建与同一文档关联的新元素。
xxx = root.makeelement('make_element', attrib={'att': 'make'})
xxx
<Element make_element at 0x559d5c8>
etree.dump(xxx)
<make_element att="make"/>
etree.dump(root)
<root interesting="totally" hello="Huhu"/>

3.2.4 get获取属性

  • get(self, key, default=None)
  • Gets an element attribute.
    属性是以key-value的方式存储的,就像字典一样。直接看例子
# get方法获得某一个属性值
root.get('interesting')
'totally'
root.get('xyz', default='123')
'123'

如果获取的属性不存在,也不会报错。类似字典的get,获取不到key,也不会报错。

root.get('xyz')
my_dic = {'a': 1, 'b': 2}
my_dic.get('xxx')

根节点的tag可以重新设置,但其他节点不行,如果重命名其他节点,相当于添加新节点。

root.tag = 'rootxuy'
etree.dump(root)
<rootxuy interesting="totally"/>
child = etree.SubElement(root, 'child', attrib={"a": '123'})
child.tag = 'great_child'
etree.dump(root)
<rootxuy interesting="totally">
  <child a="123"/>
  <great_child a="123"/>
</rootxuy>
root.tag
'rootxuy'

3.2.5 keys获取所有属性名

  • keys(self)
    Gets a list of attribute names. The names are returned in an arbitrary order (just like for an ordinary Python dictionary).
    keys方法获取所有的属性名
sorted(root.keys())
['hello', 'interesting']

3.2.6 items获取所有的键值对

# items方法获取所有的键值对
for name, value in sorted(root.items()):
     print('%s = %r' % (name, value))
hello = 'Huhu'
interesting = 'totally'

也可以用attrib属性一次拿到所有的属性及属性值存于字典中

attributes = root.attrib
attributes
{'hello': 'Huhu', 'interesting': 'totally'}
attributes['good'] = 'Bye' # 字典的修改影响节点
root.get('good')
'Bye'

3.2.7 value获取节点的属性值

  • values(self)
  • Gets element attribute values as a sequence of strings. The attributes are returned in an arbitrary order.
    获取作为字符串序列的元素属性值。属性以任意顺序返回。
root.values()
['totally', 'Huhu', 'Bye']

3.3文本操作

标签及标签的属性操作介绍完了,最后就剩下标签内的文本了。可以使用text和tail属性、或XPath的方式来访问文本内容。

3.3.1 text和tail属性

一般情况,可以用Element的text属性访问标签的文本。

  • text
    Text before the first subelement. This is either a string or the value None, if there was no text.
    第一个子节点之前的文本。如果没有文本,则为字符串或None。

  • tail
    Text after this element’s end tag, but before the next sibling element’s start tag. This is either a string or the value None, if there was no text.
    文本位于此节点的结束标记之后,但位于下一个同级节点的开始标记之前。如果没有文本,则为字符串或None。

root = etree.parse('./sample.xml')
xml_root = root.getroot()
etree.dump(xml_root)
<TradingAccounts>
    <Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
    <Strategies>
        <Strategy name="CTA01" trade="true" commission="flase"/>
        <Strategy name="CTA02" trade="true" commission="flase"/>
        <Strategy name="ALPHA"/>
    </Strategies>
    <Accounts>
        <Account name="RB" max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
            <Strategy name="CTA01" num="3" prior="1" id="997"/>first strategy
            <Strategy name="CTA02" num="10" prior="2" id="998"/>
        </Account>
        <Account name="i" max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
            <Strategy name="CTA01" num="2" prior="1" id="999">this is text
                <Type id="10" name="FOF"/>
                same text
            </Strategy>
            <Strategy name="CTA02" num="5" prior="2" id="1000"/>
        </Account>
        <Account name="IC" max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
            <Strategy name="CTA01" num="5" prior="1" id="1001">
                <Commission id="20" rate="0.01"/>
                <Slip param="1"/>
            </Strategy>
            <Strategy name="CTA02" num="6" prior="2" id="1002"/>last strategy
        </Account>
    </Accounts>
</TradingAccounts>
xml_root.text = 'Hello, World!\n'
xml_root.find('Constants').text = 'this is Constants'
xml_root.text
'Hello, World!\n'
etree.dump(xml_root)
<TradingAccounts>Hello, World!
<Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10">this is Constants</Constants>
    <Strategies>
        <Strategy name="CTA01" trade="true" commission="flase"/>
        <Strategy name="CTA02" trade="true" commission="flase"/>
        <Strategy name="ALPHA"/>
    </Strategies>
    <Accounts>
        <Account name="RB" max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
            <Strategy name="CTA01" num="3" prior="1" id="997"/>first strategy
            <Strategy name="CTA02" num="10" prior="2" id="998"/>
        </Account>
        <Account name="i" max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
            <Strategy name="CTA01" num="2" prior="1" id="999">this is text
                <Type id="10" name="FOF"/>
                same text
            </Strategy>
            <Strategy name="CTA02" num="5" prior="2" id="1000"/>
        </Account>
        <Account name="IC" max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
            <Strategy name="CTA01" num="5" prior="1" id="1001">
                <Commission id="20" rate="0.01"/>
                <Slip param="1"/>
            </Strategy>
            <Strategy name="CTA02" num="6" prior="2" id="1002"/>last strategy
        </Account>
    </Accounts>
</TradingAccounts>

3.3.2 itertext 返回所有子节点文本内容的迭代器

  • itertext(self, tag=None, with_tail=True, *tags)
  • Iterates over the text content of a subtree.
    迭代子树的文本内容。
itertext = xml_root.itertext()
itertext
<lxml.etree.ElementTextIterator at 0x559af60>
for i in itertext:
    if str.strip(i):
        print('str.strip(i) = ', str.strip(i), '---------->', len(str.strip(i)))
        
str.strip(i) =  Hello, World! ----------> 13
str.strip(i) =  this is Constants ----------> 17
str.strip(i) =  first strategy ----------> 14
str.strip(i) =  this is text ----------> 12
str.strip(i) =  same text ----------> 9
str.strip(i) =  last strategy ----------> 13

3.3.3 findtext

  • 返回第一个匹配元素的 .text 内容,如果存在匹配,但是没有 .text 内容,那么将返回一个空字符串,如果没有一个匹配的元素,那么将会返回一个 None ,但是有 default 参数,返回 default 所指定的。
  • findtext(self, path, default=None, namespaces=None)
  • Finds text for the first matching subelement, by tag name or path.
  • The optional namespaces argument accepts a prefix-to-namespace mapping that allows the usage of XPath prefixes in the path expression.
text = xml_root.findtext('Accounts') # 查找第一个匹配到的元素为Accounts的text
print('text = ', text)
print('len(text)= ', len(text))
print(type(text))
text =  
        
len(text)=  9
<class 'str'>
text = xml_root.findtext('Accounts/Account/Strategy')
print('text = ', text)
print('len(text)= ', len(text))
print(type(text))
text =  
len(text)=  0
<class 'str'>
text = xml_root.findtext('Constants')
print('text = ', text)
print('len(text)= ', len(text))
print(type(text))
text =  this is Constants
len(text)=  17
<class 'str'>
print(xml_root.xpath('Accounts/Account/Strategy//text()'))
['this is text\n                ', '\n                same text\n            ', '\n                ', '\n                ', '\n            ']

3.3.4 tail属性支持单一标签的文本获取

  • XML的标签一般是成对出现的,有开有关,但像HTML则可能出现单一的标签,如下面这段代码中的<br/>

<html><body>Text<br/>Tail</body></html>
  • Element类提供了tail属性支持单一标签的文本获取。
html = etree.Element('html')
body = etree.SubElement(html, 'body')
body.text = 'Text'
etree.dump(html)
<html>
  <body>Text</body>
</html>
br = etree.SubElement(body, 'br')
etree.dump(html)
<html>
  <body>Text<br/></body>
</html>
# tail仅在该标签后面追加文本
br.tail = 'Tail'
etree.dump(br)
<br/>Tail
etree.tostring(html)
b'<html><body>Text<br/>Tail</body></html>'
# tostring方法增加method参数,过滤单一标签,输出全部文本
etree.tostring(html, method='text')  # method参数默认是xml
b'TextTail'

3.3.5 XPath方式

# 方式一:过滤单一标签,返回文本
html.xpath('string()')
'TextTail'
# 方式二:返回列表,以单一标签为分隔
html.xpath('//text()')
['Text', 'Tail']
# 方法二获得的列表,每个元素都会带上它所属节点及文本类型信息,如下:
texts = html.xpath('//text()')
texts[0]
'Text'
type(texts[0])
lxml.etree._ElementUnicodeResult
etree.iselement(texts[0])  # 判断是否为element对象
False
# 所属节点
parent = texts[0].getparent() 
parent.tag
'body'
print(texts[1], texts[1].getparent().tag)
Tail br
# 文本类型:是普通文本还是tail文本
print(texts[0].is_text)
True
print(texts[1].is_text)
False
print(texts[1].is_tail)
True

3.4 文件解析、输出、EtreeTree写入

这部分讲述如何将XML文件解析为Element对象,以及如何将Element对象输出为XML文件。

3.4.1 文件解析

文件解析常用的有fromstring、XML和HTML三个方法。接受的参数都是字符串。

xml_data = '<root>data</root>'
  • fromstring方法
root1 = etree.fromstring(xml_data)
root1.tag
'root'
etree.tostring(root1)
b'<root>data</root>'
  • XML方法,与fromstring方法基本一样
root2 = etree.XML(xml_data)
print(root2.tag)
root
print(etree.tostring(root2))
b'<root>data</root>'
  • HTML方法,如果没有和标签,会自动补上
root3 = etree.HTML(xml_data)
print(root3.tag)
html
print(etree.tostring(root3))
b'<html><body><root>data</root></body></html>'

3.4.2 输出

输出其实就是前面一直在用的tostring方法了,这里补充xml_declaration和encoding两个参数,前者是XML声明,后者是指定编码。

root = etree.XML('<root><a><b/></a></root>')
print(etree.tostring(root))
b'<root><a><b/></a></root>'
# XML声明
print(etree.tostring(root, xml_declaration=True))
b"<?xml version='1.0' encoding='ASCII'?>\n<root><a><b/></a></root>"
# 指定编码
print(etree.tostring(root, encoding='iso-8859-1'))
b"<?xml version='1.0' encoding='iso-8859-1'?>\n<root><a><b/></a></root>"

3.4.3 EtreeTree写入

et = etree.parse('./sample.xml')
# 也可以用ElementTree类的parse方法, 结果是一样的。
# et = etree.ElementTree().parse('./sample.xml')
print(type(et))
et.getroot().set('add_root_attrib', 'attrib_value') # 为root节点添加/修改属性值
etree.dump(et.getroot())
<class 'lxml.etree._ElementTree'>
<TradingAccounts add_root_attrib="attrib_value">
    <Constants ProjectName="DOTA" path="/home/DOTA/Trade" cpu="10"/>
    <Strategies>
        <Strategy name="CTA01" trade="true" commission="flase"/>
        <Strategy name="CTA02" trade="true" commission="flase"/>
        <Strategy name="ALPHA"/>
    </Strategies>
    <Accounts>
        <Account name="RB" max="25" diff="0.01" ip="192.168.1.1" path="/home/RB">
            <Strategy name="CTA01" num="3" prior="1" id="997"/>first strategy
            <Strategy name="CTA02" num="10" prior="2" id="998"/>
        </Account>
        <Account name="i" max="15" diff="0.02" ip="192.168.1.1" path="/home/i">
            <Strategy name="CTA01" num="2" prior="1" id="999">this is text
                <Type id="10" name="FOF"/>
                same text
            </Strategy>
            <Strategy name="CTA02" num="5" prior="2" id="1000"/>
        </Account>
        <Account name="IC" max="3" diff="0.02" ip="192.168.1.2" path="/home/IC">
            <Strategy name="CTA01" num="5" prior="1" id="1001">
                <Commission id="20" rate="0.01"/>
                <Slip param="1"/>
            </Strategy>
            <Strategy name="CTA02" num="6" prior="2" id="1002"/>last strategy
        </Account>
    </Accounts>
</TradingAccounts>
print(type(et))
<class 'lxml.etree._ElementTree'>
et.write('./update_XML.xml')  # 生成新的xml文件

四、最后一个小例子

xml = etree.parse('./sample.xml')  # 解析xml,返回ElementTree对象
print(xml)
print(type(xml))
<lxml.etree._ElementTree object at 0x000000000559D3C8>
<class 'lxml.etree._ElementTree'>
# 找根元素
print(xml.getroot())
print(xml.getroot().tag)
print(xml.find('TradingAccounts'))  # xml解析后返回的ElementTree对象,不可以这样查找根元素
print(xml.getroot())# 应该这样找根元素
<Element TradingAccounts at 0x5510c48>
TradingAccounts
None
<Element TradingAccounts at 0x5510c48>
# 下面两者等价
print(xml.find('Constants'))  
print(xml.getroot().find('Constants'))
print(xml.find('Constants').tag)
print(xml.getroot().find('Constants').tag)
# xml 和 xml.getroot()的区别:
print(type(xml), '  <-----VS----->  ', type(xml.getroot()))
# ElementTree 和 Element对象 都具有find、findall方法
<Element Constants at 0x55b8ac8>
<Element Constants at 0x55b8ac8>
Constants
Constants
<class 'lxml.etree._ElementTree'>   <-----VS----->   <class 'lxml.etree._Element'>
# attrib返回属性-值(key-value)的dict
print(xml.find('Constants').attrib)
print(xml.getroot().find('Constants').attrib)
{'path': '/home/DOTA/Trade', 'cpu': '10', 'ProjectName': 'DOTA'}
{'path': '/home/DOTA/Trade', 'cpu': '10', 'ProjectName': 'DOTA'}
# find()方法:返回匹配到的第一个元素,从直接子元素开始找
first_elem = xml.find('Constants')
print('first_elem= ', first_elem)
print(first_elem.tag)
first_elem = xml.find('Strategy')  # 直接子元素中没有Strategy元素,因此返回None
print('first_elem= ', first_elem)
first_elem=  <Element Constants at 0x55b8208>
Constants
first_elem=  None
search_first_elem = xml.find('.//Strategy')  # 在全部元素中查找第一个出现的Strategy元素
print('search_first_elem= ', search_first_elem)
print(search_first_elem.tag)
print('search_first_elem.attrib = ', search_first_elem.attrib)  # attrib返回dict
search_first_elem=  <Element Strategy at 0x55b8a48>
Strategy
search_first_elem.attrib =  {'name': 'CTA01', 'trade': 'true', 'commission': 'flase'}
# 查找Accounts元素下的所有元素中第一个Strategy元素;//表示从当前节点选取子孙节点;/表示从当前节点选取直接子节点
search_elem = xml.find('./Accounts//Strategy')  
print('search_elem= ', search_elem)
print(search_elem.tag)
print('search_elem.attrib = ', search_elem.attrib)
search_elem=  <Element Strategy at 0x55b8ac8>
Strategy
search_elem.attrib =  {'id': '997', 'name': 'CTA01', 'num': '3', 'prior': '1'}
# 找直接子元素Strategies下的Strategy元素的name属性的值
print(xml.find('Strategies').find('Strategy').attrib.get('name'))
CTA01
# findall()方法  返回所有匹配的元素的列表
all = xml.findall('.//Strategy') # 返回匹配到的所有的Strategy元素的列表
print('all= ', all)
print('len(all)=', len(all))
all_names_1 = [i.get('name') for i in all]  # i为Element对象
all_names_2 = [i.attrib.get('name') for i in all]  # i.attrib.get('name') 与 i.get('name')等价
print('all_names_1 = ', all_names_1)
print('all_names_2 = ', all_names_2)
all=  [<Element Strategy at 0x55b8a48>, <Element Strategy at 0x55b8cc8>, <Element Strategy at 0x55b8fc8>, <Element Strategy at 0x55b8ac8>, <Element Strategy at 0x55b8ec8>, <Element Strategy at 0x55b8f48>, <Element Strategy at 0x55b8dc8>, <Element Strategy at 0x55b8f08>, <Element Strategy at 0x55b8d08>]
len(all)= 9
all_names_1 =  ['CTA01', 'CTA02', 'ALPHA', 'CTA01', 'CTA02', 'CTA01', 'CTA02', 'CTA01', 'CTA02']
all_names_2 =  ['CTA01', 'CTA02', 'ALPHA', 'CTA01', 'CTA02', 'CTA01', 'CTA02', 'CTA01', 'CTA02']
# 返回直接子元素Accounts下的Account元素下的Strategy元素下的所有元素
child_all_elem = xml.findall('Accounts/Account/Strategy/')
print('child_all_elem = ', child_all_elem)
child_all_elem_tags = [i.tag for i in child_all_elem]
print('child_all_elem_tags = ', child_all_elem_tags)
child_all_elem =  [<Element Type at 0x558ae48>, <Element Commission at 0x558ab88>, <Element Slip at 0x558a9c8>]
child_all_elem_tags =  ['Type', 'Commission', 'Slip']

附录

XML基础

XML基础

彻底掌握python中的lxml (二) lxml封装

需要代码及xml文件的点击下载或留言@博主发邮箱

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值