最近几天接触到爬虫框架scrapy需要对返回的数据进行解析和处理,python提供了很好的XML格式数据的处理模式,在这里使用的是lxml第三方的python库来进行XML文档的解析,初步学习了一些xpath的内容,XPath 使用路径表达式来选取 XML 文档中的节点或者节点集。这些路径表达式和我们在常规的电脑文件系统中看到的表达式非常相似。在这里推荐一个初学者的学习地址:http://www.runoob.com/xpath,当然网上也是有着很多的学习资源的,这是我初步看的一个教程,在这里推荐给大家。关于xpath的简介、语法什么的在这里我就不介绍了,因为网上一搜索很多教程都是可以免费学习的。话不多少了,在这里知己给出两个多小时的实践代码,都是很简单的东西,但是应该是系统的把xpath常用的一些语法和函数包含在了这里面,今天下下来,以后需要的时候还可以翻出来看看。
#!/usr/bin/python
#-*-coding:utf-8-*-
from lxml import etree
'''
book.xml的内容为:
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
<book category="CHINA">
<title lang="ZN">Learning XML</title>
<author>Chen</author>
<year>2003</year>
<price>36.95</price>
</book>
<book category="JAPAN">
<title>Killing Myself</title>
<author>SSSBBB</author>
<year>2003</year>
<price>36.95</price>
</book>
</bookstore>
'''
def xpath_test(file='book.xml'):
fp = open(file)
book = fp.read()
#print book
#print len(book)
book_source = etree.XML(book)
# print len(book_source)
# print book_source
#book_source = etree.HTML(book.decode('utf-8'))
#选取文档中所有的title节点
all_title = book_source.xpath('/bookstore/book/title')
print 'all_title', all_title
print '________________________________________________________________________'
#选取bookstore元素下面的第一个book节点的title
first_title = book_source.xpath('/bookstore/book[1]/title')
print 'first_title', first_title
#使用text()函数选取 price 节点中的所有文本
all_text = book_source.xpath('/bookstore/book/price/text()')
print 'all_text', all_text
#选取价格高于 35 的所有 price 节点
price_beyond_35 = book_source.xpath("/bookstore/book[price>35]/price")
print 'price_beyond_35', price_beyond_35
#选取title的属性lang
title_attribute = book_source.xpath('//title/@lang')
print 'title_attribute', title_attribute
#选取title的属性lang值为en
title_attribute_en = book_source.xpath("//title[@lang = 'en']")
print 'title_attribute_en', title_attribute_en
#选取book所有的类别
all_category = book_source.xpath("//book/@category")
print 'all_category', all_category
print '________________________________________________________________________'
#选取book中所有的作者
all_author = book_source.xpath("//author")
print 'all_author', all_author
all_author2 = book_source.xpath("//book/author")
print 'all_author2', all_author2
all_author3 = book_source.xpath("/bookstore/book/author")
print 'all_author3', all_author3
print '________________________________________________________________________'
#选取作者为:J K. Rowling的书类别
one_author = book_source.xpath("/bookstore/book[author='J K. Rowling']/@category")
print 'one_author', one_author
#选取年份为:2003年的书的作者和类别
author_category = book_source.xpath("//book[year=2003]/author | //book[year=2003]/category")
print 'author_category', author_category
#选取年份为:2005年的书的作者和类别
author_category1 = book_source.xpath("//book[year=2005]/author | //book[year=2005]/category")
print 'author_category1', author_category1
#选取年份为:2013年的书的作者和类别
author_category2 = book_source.xpath("//book[year=2013]/author | //book[year=2013]/category")
print 'author_category2', author_category2
#使用contains()函数选取属性lang中包含'Z'的类别
print '________________________________________________________________________'
countain = book_source.xpath("//book[title[contains(@lang, 'Z')]]/@category")
print 'countain', countain
#使用starts-with()函数选取以'W'开头的类别的题目
print '________________________________________________________________________'
start = book_source.xpath("//book[starts-with(@category, 'J')]/title")
print 'start', start
start2 = book_source.xpath("/bookstore/book[title[not(@lang)]]/title")
print 'start2', start2
print '________________________________________________________________________'
#使用not()函数匹配出title不含有lang属性的作者
not_list = book_source.xpath("/bookstore/book[title[not(@lang)]]/author")
print 'not_list', not_list
xpath_test()
下面这个是在虚拟机里面运行得到的结果:
all_title [<Element title at 0x7f507c271830>, <Element title at 0x7f507c271998>, <Element title at 0x7f507c2719e0>, <Element title at 0x7f507c271908>, <Element title at 0x7f507c271a28>, <Element title at 0x7f507c271bd8>]
________________________________________________________________________
first_title [<Element title at 0x7f507c271830>]
all_text ['30.00', '29.99', '49.99', '39.95', '36.95', '36.95']
price_beyond_35 [<Element price at 0x7f507c271d40>, <Element price at 0x7f507c271ab8>, <Element price at 0x7f507c271c68>, <Element price at 0x7f507c271e18>]
title_attribute ['en', 'en', 'en', 'en', 'ZN']
title_attribute_en [<Element title at 0x7f507c271830>, <Element title at 0x7f507c271998>, <Element title at 0x7f507c2719e0>, <Element title at 0x7f507c271908>]
all_category ['COOKING', 'CHILDREN', 'WEB', 'WEB', 'CHINA', 'JAPAN']
________________________________________________________________________
all_author [<Element author at 0x7f507c2710e0>, <Element author at 0x7f507c271cf8>, <Element author at 0x7f507c271290>, <Element author at 0x7f507c271320>, <Element author at 0x7f507c271368>, <Element author at 0x7f507c271518>, <Element author at 0x7f507c271440>, <Element author at 0x7f507c2714d0>, <Element author at 0x7f507c271560>, <Element author at 0x7f507c2715a8>]
all_author2 [<Element author at 0x7f507c2710e0>, <Element author at 0x7f507c271cf8>, <Element author at 0x7f507c271290>, <Element author at 0x7f507c271320>, <Element author at 0x7f507c271368>, <Element author at 0x7f507c271518>, <Element author at 0x7f507c271440>, <Element author at 0x7f507c2714d0>, <Element author at 0x7f507c271560>, <Element author at 0x7f507c2715a8>]
all_author3 [<Element author at 0x7f507c2710e0>, <Element author at 0x7f507c271cf8>, <Element author at 0x7f507c271290>, <Element author at 0x7f507c271320>, <Element author at 0x7f507c271368>, <Element author at 0x7f507c271518>, <Element author at 0x7f507c271440>, <Element author at 0x7f507c2714d0>, <Element author at 0x7f507c271560>, <Element author at 0x7f507c2715a8>]
________________________________________________________________________
one_author ['CHILDREN']
author_category [<Element author at 0x7f507c271290>, <Element author at 0x7f507c271320>, <Element author at 0x7f507c271368>, <Element author at 0x7f507c271518>, <Element author at 0x7f507c271440>, <Element author at 0x7f507c2714d0>, <Element author at 0x7f507c271560>, <Element author at 0x7f507c2715a8>]
author_category1 [<Element author at 0x7f507c2710e0>, <Element author at 0x7f507c271cf8>]
author_category2 []
________________________________________________________________________
countain ['CHINA']
________________________________________________________________________
start [<Element title at 0x7f507c271bd8>]
start2 [<Element title at 0x7f507c271bd8>]
________________________________________________________________________
not_list [<Element author at 0x7f507c2715a8>]
有的地方针对一个需求我是用了多个不同的方式来提取我需要的东西,在这里拿其中一个例子解释一下其他的我就不一一解释了:
start = book_source.xpath("//book[starts-with(@category, 'J')]/title")
print 'start', start
start2 = book_source.xpath("/bookstore/book[title[not(@lang)]]/title")
print 'start2', start2
这两个解析模式语句得到了相同的结果:第一个意思是使用了starts-with函数来提取元素book的category属性是以'J'开头的题目;第二个的意思是提取出元素book的子元素title中没有lang属性的title,我在代码的最开始粘贴给出了book.xml这个实验用到的文件,打开可以看到两个解析语句最后提取得到的是同一个结果,证明了解析正确。