最近写爬虫,简单了解了一下Xpath,用个小实例作为练习。
先上Xpath路劲语法
编写一个XML作为提取文档:
<superhero>
<class>
<name lang="en">Tony stark</name>
<alias>Iron man</alias>
<sex>male</sex>
<birthday>1969</birthday>
<age>47</age>
</class>
<class>
<name lang="en">Peter Benjamin Parker</name>
<alias>Spider Man</alias>
<sex>male</sex>
<birthday>unknown</birthday>
<age>unknown</age>
</class>
<class>
<name lang="en">Steven Rogers</name>
<alias>Captain America</alias>
<sex>male</sex>
<birthday>19200704</birthday>
<age>96</age>
</class>
</superhero>
写个比较简陋的xpath试试看:
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
with open('./superHero.xml','r') as fp:
body = fp.read()
content=Selector(text=body).xpath('./*').extract()
print(content)
print("#######################")
#第一个class的内容
content = Selector(text=body).xpath("//class[1]").extract()
print(content)
print("#######################")
#最后一个class的内容
content = Selector(text=body).xpath("//class[last()]").extract()
print(content)
print("#######################")
#采集name属性为en的数据
content = Selector(text=body).xpath("//name[@lang='en']").extract()
print(content)
print("#######################")
#采集第二个class的name节点的文本
content = Selector(text=body).xpath("//class[last()-1]/name/text()").extract()
print(content)
print("#######################")
输出为:
['<body><superhero>\n<class>\n\t<name lang="en">Tony stark</name>\n\t<alias>Iron man</alias>\n\t<sex>male</sex>\n\t<birthday>1969</birthday>\n\t<age>47</age>\n</class>\n<class>\n\t<name lang="en">Peter Benjamin Parker</name>\n\t<alias>Spider Man</alias>\n\t<sex>male</sex>\n\t<birthday>unknown</birthday>\n\t<age>unknown</age>\n</class>\n<class>\n\t<name lang="en">Steven Rogers</name>\n\t<alias>Captain America</alias>\n\t<sex>male</sex>\n\t<birthday>19200704</birthday>\n\t<age>96</age>\n</class>\n</superhero></body>']
#######################
['<class>\n\t<name lang="en">Tony stark</name>\n\t<alias>Iron man</alias>\n\t<sex>male</sex>\n\t<birthday>1969</birthday>\n\t<age>47</age>\n</class>']
#######################
['<class>\n\t<name lang="en">Steven Rogers</name>\n\t<alias>Captain America</alias>\n\t<sex>male</sex>\n\t<birthday>19200704</birthday>\n\t<age>96</age>\n</class>']
#######################
['<name lang="en">Tony stark</name>', '<name lang="en">Peter Benjamin Parker</name>', '<name lang="en">Steven Rogers</name>']
#######################
['Peter Benjamin Parker']
#######################