运行下列命令,安装成功
apt-get install python2.6-dev
apt-get install libxml2-dev
apt-get install libxslt1-dev
easy_install lxml
另外,安装python idle
apt-get install idle
- 例子:dblp.xml(dblp数据的片段)
- <?xml version='1.0' encoding='utf-8'?>
- <dblp>
- <article mdate="2012-11-28" key="journals/entropy/BellucciFMY08">
- <author>Stefano Bellucci</author>
- <author>Sergio Ferrara</author>
- <author>Alessio Marrani</author>
- <author>Armen Yeranyan</author>
- <title>ES<sup>2</sup>: A cloud data storage system for supporting both OLTP and OLAP.</title>
- <pages>507-555</pages>
- <year>2008</year>
- <volume>10</volume>
- <journal>Entropy</journal>
- <number>4</number>
- <ee>http://dx.doi.org/10.3390/e10040507</ee>
- <url>db/journals/entropy/entropy10.html#BellucciFMY08</url>
- </article>
- <article mdate="2013-03-04" key="journals/entropy/Knuth13">
- <author>Kevin H. Knuth</author>
- <title><i>Entropy</i> Best Paper Award 2013.</title>
- <pages>698-699</pages>
- <year>2013</year>
- <volume>15</volume>
- <journal>Entropy</journal>
- <number>2</number>
- <ee>http://dx.doi.org/10.3390/e15020698</ee>
- <url>db/journals/entropy/entropy15.html#Knuth13</url>
- </article>
- </dblp>
为了将xml解析为树结构,并得到该树的根,要进行如下的操作:
- #!/usr/bin/python
- #-*-coding:utf-8-*-
- from lxml import etree#导入lxml库
- tree = etree.parse("dblp.xml")#将xml解析为树结构
- root = tree.getroot()#获得该树的树根
- xml文件中含有dtd声明的例子:
- <?xml version="1.0" encoding="ISO-8859-1"?>
- <!DOCTYPE dblp SYSTEM "dblp.dtd">
- <dblp>
- <article mdate="2002-01-03" key="persons/Codd71a">
- <author>E. F. Codd</author>
- <title>Further Normalization of the Data Base Relational Model.</title>
- <journal>IBM Research Report, San Jose, California</journal>
- <volume>RJ909</volume>
- <month>August</month>
- <year>1971</year>
- hadoop@hadoop:~/20130722dblpxml$ head -15 dblp.xml
- <?xml version="1.0" encoding="ISO-8859-1"?>
- <!DOCTYPE dblp SYSTEM "dblp.dtd">
- <dblp>
- <article mdate="2002-01-03" key="persons/Codd71a">
- <author>E. F. Codd</author>
- <title>Further Normalization of the Data Base Relational Model.</title>
- <journal>IBM Research Report, San Jose, California</journal>
- <volume>RJ909</volume>
- <month>August</month>
- <year>1971</year>
- <cdrom>ibmTR/rj909.pdf</cdrom>
- <ee>db/labs/ibm/RJ909.html</ee>
- </article>
- </dblp>
这时候,要想将xml数据解析为树结构并得到该树的树根,必须进行如下的操作:
- #!/usr/bin/python
- #-*-coding:utf-8-*-
- from lxml import etree#导入lxml库
- parser=etree.XMLParser(load_dtd= True)#首先根据dtd得到一个parser(注意dtd文件要放在和xml文件相同的目录)
- tree = etree.parse("dblp.xml",parser)#用上面得到的parser将xml解析为树结构
- root = tree.getroot()#获得该树的树根
- for article in root:#这样便可以遍历根元素的所有子元素(这里是article元素)
- print "元素名称:",article.tag#用.tag得到该子元素的名称
- for field in article:#遍历article元素的所有子元素(这里是指article的author,title,volume,year等)
- print field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容
- mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值
- key=article.get("key")
- print "mdate:",mdate
- print "key",key
- print ""#隔行分开不同的article元素
3、解析xml数据的例子
用下面的代码解析文章开头的名为dblp.xml数据。
- #!/usr/bin/python
- #-*-coding:utf-8-*-
- from lxml import etree#导入lxml库
- tree = etree.parse("dblp.xml")#将xml解析为树结构
- root = tree.getroot()#获得该树的树根
- for article in root:#这样便可以遍历根元素的所有子元素(这里是article元素)
- print "元素名称:",article.tag#用.tag得到该子元素的名称
- for field in article:#遍历article元素的所有子元素(这里是指article的author,title,volume,year等)
- print field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容
- mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值
- key=article.get("key")
- print "mdate:",mdate
- print "key",key
- print ""#隔行分开不同的article元素
- 元素名称: article
- author : Stefano Bellucci
- author : Sergio Ferrara
- author : Alessio Marrani
- author : Armen Yeranyan
- title : ES
- pages : 507-555
- year : 2008
- volume : 10
- journal : Entropy
- number : 4
- ee : http://dx.doi.org/10.3390/e10040507
- url : db/journals/entropy/entropy10.html#BellucciFMY08
- mdate: 2012-11-28
- key: journals/entropy/BellucciFMY08
- 元素名称: article
- author : Kevin H. Knuth
- title : None
- pages : 698-699
- year : 2013
- volume : 15
- journal : Entropy
- number : 2
- ee : http://dx.doi.org/10.3390/e15020698
- url : db/journals/entropy/entropy15.html#Knuth13
- mdate: 2013-03-04
- key: journals/entropy/Knuth13
4、元素既有sub-element,又有text的处理
可以看到在上面的例子中,title元素的内容是不正确的。由于title元素及包含sub-element,又有text内容(如下),这时简单的用.text,并不能正确的得到title元素的内容。上面的例子中,第一个article元素的title只取到了ES,而第二个article元素的title则什么都没取到,None。
- <title>ES<sup>2</sup>: A cloud data storage system for supporting both OLTP and OLAP.</title>
- <title><i>Entropy</i> Best Paper Award 2013.</title>
- #!/usr/bin/python
- #-*-coding:utf-8-*-
- from lxml import etree#导入lxml库
- tree = etree.parse("dblp.xml")#将xml解析为树结构
- root = tree.getroot()#获得该树的树根
- for article in root:#这样便可以遍历根元素的所有子元素(这里是article元素)
- print "元素名称:",article.tag#用.tag得到该子元素的名称
- for field in article:#遍历article元素的所有子元素(这里是指article的author,title,volume,year等)
- if field.tag=="title":
- print field.tag,":",etree.tostring(field,encoding='utf-8',pretty_print=False)#将元素text连同sub_element一起打印
- else:
- print field.tag,":",field.text#同样地,用.tag可以得到元素的名称,而.text可以得到元素的内容
- mdate=article.get("mdate")#用.get("属性名")可以得到article元素相应属性的值
- key=article.get("key")
- print "mdate:",mdate
- print "key:",key
- print ""#隔行分开不同的article元素
- 元素名称: article
- author : Stefano Bellucci
- author : Sergio Ferrara
- author : Alessio Marrani
- author : Armen Yeranyan
- title : <title>ES<sup>2</sup>: A cloud data storage system for supporting both OLTP and OLAP.</title>
- pages : 507-555
- year : 2008
- volume : 10
- journal : Entropy
- number : 4
- ee : http://dx.doi.org/10.3390/e10040507
- url : db/journals/entropy/entropy10.html#BellucciFMY08
- mdate: 2012-11-28
- key: journals/entropy/BellucciFMY08
- 元素名称: article
- author : Kevin H. Knuth
- title : <title><i>Entropy</i> Best Paper Award 2013.</title>
- pages : 698-699
- year : 2013
- volume : 15
- journal : Entropy
- number : 2
- ee : http://dx.doi.org/10.3390/e15020698
- url : db/journals/entropy/entropy15.html#Knuth13
- mdate: 2013-03-04
- key: journals/entropy/Knuth13