xpath的使用
xpath在爬虫里面解析html,由于xpath的语法非常简单,并且效率高,也非常受喜爱,在爬虫解析html中常用的三种方法:xpath、re正则表达、bs4模块,其中xpath也是当中效率最高的
/表示层级关系,第一个/是根节点
下面是几个用xpath方法来解析html的例子
<book>
<id>1</id>
<name>野花追地香</name>
<price>1.23</price>
<nick>臭豆腐</nick>
<author>
<nick id="10086">周大强</nick2
<nick id="10010">周芷若</nick
<nick class="joy">周杰伦</nick>
<nick class="jolin">蔡依林</nick>
<div>
<nick>热l</nick>
</div>
<span>
<nick>热2</nick>
</span>
</author>
<partner>
<nick id='ppc'>盼盼东</nick>
<nick id='ppbc'>你好</nick>
</partner>
</book>
from lxml import etree
xml=是上面那个
tree=etree.XML(xml)
result=tree.xpath('/book/name') >>> [<Element name at 0x11b7f8c30]
result1=tree.xpath('/book/name/text()') #text()拿文本
>>> ['野花追地香']
result2=tree.xpath('/book/author/nick/text()')
>>> ['周大强','周芷若','周杰伦','蔡依林']
result3=tree.xpath('/book/author//nick/text()') #//后代
>>> ['周大强','周芷若','周杰伦','蔡依林','热1','热2']
result4=tree.xpath("/bokk/author/*/nick/text()") #*通配符,任意节点
>>> ['热1',‘热2’]