深入Python3(十一) XML

最新推荐文章于 2024-08-06 17:00:20 发布

csu_xiji

最新推荐文章于 2024-08-06 17:00:20 发布

阅读量404

点赞数

分类专栏： python学习

本文链接：https://blog.csdn.net/xiji333/article/details/110739688

版权

文章目录

0.摘要
1.5分钟XML速成
2.Atom Feed的结构
3.解析XML
- 3.1元素即列表
- 3.2属性即字典
4.在XML文档中查找结点
5.深入lxml
6.生成XML
7.解析破损的XML

0.摘要

基本是和 $x m l$ 相关的一些知识，个人感觉用处不是特别大，有需求再细致了解也不迟。
本书的大部分章节都是以样例代码为中心的。但是 $x m l$ 这章不是；它以数据为中心。最常见的 $x m l$ 应用为“聚合订阅( $syndication\ feeds$ )”，它用来展示博客，论坛或者其他会经常更新的网站的最新内容。大多数的博客软件都会在新文章，新的讨论区，或者新博文发布的时候自动生成和更新 $f e e d$ 。
在这里插入图片描述

<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
  <title>dive into mark</title>
  <subtitle>currently between addictions</subtitle>
  <id>tag:diveintomark.org,2001-07-29:/</id>
  <updated>2009-03-27T21:56:07Z</updated>
  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
  <link rel='self' type='application/atom+xml' href='http://diveintomark.org/feed/'/>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Dive into history, 2009 edition</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
    <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
    <updated>2009-03-27T21:56:07Z</updated>
    <published>2009-03-27T17:20:42Z</published>
    <category scheme='http://diveintomark.org' term='diveintopython'/>
    <category scheme='http://diveintomark.org' term='docbook'/>
    <category scheme='http://diveintomark.org' term='html'/>
  <summary type='html'>Putting an entire chapter on one page sounds
    bloated, but consider this &amp;mdash; my longest chapter so far
    would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
    On dialup.</summary>
  </entry>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Accessibility is a harsh mistress</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>
    <id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
    <updated>2009-03-22T01:05:37Z</updated>
    <published>2009-03-21T20:09:28Z</published>
    <category scheme='http://diveintomark.org' term='accessibility'/>
    <summary type='html'>The accessibility orthodoxy does not permit people to
      question the value of features that are rarely useful and rarely used.</summary>
  </entry>
  <entry>
    <author>
      <name>Mark</name>
    </author>
    <title>A gentle introduction to video encoding, part 1: container formats</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>
    <id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
    <updated>2009-01-11T19:39:22Z</updated>
    <published>2008-12-18T15:54:22Z</published>
    <category scheme='http://diveintomark.org' term='asf'/>
    <category scheme='http://diveintomark.org' term='avi'/>
    <category scheme='http://diveintomark.org' term='encoding'/>
    <category scheme='http://diveintomark.org' term='flv'/>
    <category scheme='http://diveintomark.org' term='GIVE'/>
    <category scheme='http://diveintomark.org' term='mp4'/>
    <category scheme='http://diveintomark.org' term='ogg'/>
    <category scheme='http://diveintomark.org' term='video'/>
    <summary type='html'>These notes will eventually become part of a
      tech talk on video encoding.</summary>
  </entry>
</feed>

1.5分钟XML速成

$x m l$ 是一种描述层次结构化数据的通用方法。 $x m l$ 文档包含由起始和结束标签( $t a g$ )分隔的一个或多个元素( $e l e m e n t$ )。以下也是一个完整的(虽然空洞) $x m l$ 文件：
在这里插入图片描述

第一行是 $f o o$ 元素的起始标签。
第二行是 $f o o$ 元素对应的结束标签。就如写作、数学或者代码中需要平衡括号一样，每一个起始标签必须有对应的结束标签来闭合（匹配）。
元素可以嵌套到任意层次。位于 $f o o$ 中的元素 $b a r$ 可以被称作其子元素。
在这里插入图片描述

$x m l$ 文档中的第一个元素叫做根元素( $root\ element$ )。并且每份 $x m l$ 文档只能有一个根元素。以下不是一个 $x m l$ 文档，因为它存在两个“根元素”。
在这里插入图片描述

元素可以有其属性( $a t t r i b u t e$ )，它们是一些名字-值( $n a m e - v a l u e$ )对。属性由空格分隔列举在元素的起始标签中。一个元素中属性名不能重复。属性值必须用引号包围起来。单引号、双引号都是可以。
在这里插入图片描述

$f o o$ 元素有一个叫做 $l a n g$ 的属性。 $l a n g$ 的值为 $e n$ 。
$b a r$ 元素则有两个属性，分别为 $i d$ 和 $l a n g$ 。其中 $l a n g$ 属性的值为 $f r$ 。它不会与 $f o o$ 的那个属性产生冲突。每个元素都其独立的属性集。
如果元素有多个属性，书写的顺序并不重要。元素的属性是一个无序的键-值对集，跟 $P y t h o n$ 中的字典对象一样。另外，元素中属性的个数是没有限制的。
元素可以有其文本内容( $text\ content$ )。
在这里插入图片描述

如果某一元素既没有文本内容，也没有子元素，它也叫做空元素。
在这里插入图片描述

表达空元素有一种简洁的方法。通过在起始标签的尾部添加/字符，我们可以省略结束标签。上一个例子中的 $x m l$ 文档可以写成这样：
在这里插入图片描述

就像 $P y t h o n$ 函数可以在不同的模块( $m o d u l e s$ )中声明一样，也可以在不同的名字空间( $n a m e s p a c e$ )中声明 $x m l$ 元素。 $x m l$ 文档的名字空间通常看起来像 $U R L$ 。我们可以通过声明 $x m l n s$ 来定义默认名字空间。名字空间声明跟元素属性看起来很相似，但是它们的作用是不一样的。
在这里插入图片描述

$f e e d$ 元素处在名字空间 $h t t p : / / w w w . w 3 . o r g / 2005 / A t o m$ 中。
$t i t l e$ 元素也是。名字空间声明不仅会作用于当前声明它的元素，还会影响到该元素的所有子元素。
也可以通过 $x m l n s : p r e f i x$ 声明来定义一个名字空间并取其名为 $p r e f i x$ 。然后该名字空间中的每个元素都必须显式地使用这个前缀( $p r e f i x</$