Python爬虫:获取DOM树各个节点的xpath路径

   在使用python进行网络爬虫并对网页解析成DOM树时,有时需要获取各个DOM树节点的xpath路径。具体代码如下:

1. 生成DOM各节点的xpath路径

方法1:

import lxml
from lxml import etree
import collections

doc='''
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html' id="xxx">Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <h5>test</h5>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
   <a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
  </div>
 </body>
</html>
'''
html=etree.HTML(doc)
tree=html.getroottree()
all_nodes=html.xpath('//*')
xpath=[]
for node in all_nodes:
    xpath.append(tree.getpath(node))
for node,path in zip(all_nodes,xpath):
    print("{}:{}".format(node.tag,path))

方法2:

import lxml
from lxml import etree
import collections

doc='''
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html' id="xxx">Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <h5>test</h5>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
   <a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
  </div>
 </body>
</html>
'''
html=etree.HTML(doc)
all_nodes=html.xpath('/html') #用于保存DOM树上的所有节点
idx=0
start=0
end=len(all_nodes)
xpath=['/'+str(all_nodes[0].tag)]
while start<end:
    for i in range(start,end):
        c_nodes=list(all_nodes[i]) #main_nodes[i]的子节点
        tmp_tag_count={key:1 for key,val in collections.Counter(node.tag for node in c_nodes).items()
                       if val>1}
        all_nodes.extend(c_nodes)
        tmp_xpath=xpath[i]
        for node in c_nodes:
            if node.tag in tmp_tag_count.keys():
                xpath.append(tmp_xpath+'/'+node.tag+'['+str(tmp_tag_count[node.tag])+']')
                tmp_tag_count[node.tag]+=1
            else:
                xpath.append(tmp_xpath+'/'+node.tag)
        idx+=1
    start=idx
    end=len(all_nodes)
    
for node,path in zip(all_nodes,xpath):
    print("{} {}".format(node.tag,path))

其计算结果如下:

html  /html
head  /html/head
body  /html/body
base  /html/head/base
title  /html/head/title
div  /html/body/div
a  /html/body/div/a[1]
h5  /html/body/div/h5
a  /html/body/div/a[2]
a  /html/body/div/a[3]
a  /html/body/div/a[4]
a  /html/body/div/a[5]
a  /html/body/div/a[6]
br  /html/body/div/a[1]/br
img  /html/body/div/a[1]/img
br  /html/body/div/a[2]/br
img  /html/body/div/a[2]/img
br  /html/body/div/a[3]/br
img  /html/body/div/a[3]/img
br  /html/body/div/a[4]/br
img  /html/body/div/a[4]/img
br  /html/body/div/a[5]/br
img  /html/body/div/a[5]/img
span  /html/body/div/a[6]/span
br  /html/body/div/a[6]/br
img  /html/body/div/a[6]/img
h5  /html/body/div/a[6]/span/h5
参考资料
  1. https://blog.csdn.net/together_cz/article/details/74015599
  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值