XPATH元素定位

最新推荐文章于 2024-08-21 14:32:33 发布

牧夫

最新推荐文章于 2024-08-21 14:32:33 发布

阅读量1.9k

点赞数

文章标签： xml xpath 网络爬虫

本文链接：https://blog.csdn.net/bob71/article/details/78500803

版权

XPATH是在XML中进行元素定位的一种强大工具。在网络爬虫中大量使用。

比如下面的文档：

<table width="98%" cellpadding="3" cellspacing="1" align="center"> 
  <tbody>
    <tr> 
      <td width="90" class="f_b">公司名称：</td>  
      <td width="260">广州xxx有限公司</td>  
      <td width="90" class="f_b">公司类型：</td>  
      <td width="260">个体经营 (服务商)</td>
    </tr>  
    <tr> 
      <td class="f_b">所 在 地：</td>  
      <td>广东/广州市</td>  
      <td class="f_b">公司规模：</td>  
      <td/>
    </tr>  
    <tr> 
      <td class="f_b">注册资本：</td>  
      <td>3000万</td>  
      <td class="f_b">注册年份：</td>  
      <td>2004</td>
    </tr>
  </tbody>
</table>

可以下面的XPATH表达式读取公司名字

//tr[1]/td[2]/text()

对于这种比较好的树形，并且元素有属性值可以利用的文档，XPATH表达式比较好写。但是对于一些比较平的树，在一层中有很多元素，更糟糕的是元素中还没有属性值的文档，定位一个元素就比较困难了。比如下面的文档：

<?xml version="1.0" encoding="utf-8"?>

<form name="Standard" method="POST" action="process_request.php" id="Standard" enctype="multipart/form-data"> 
  <h1>Compact device for low-cost, real-time monitoring of blood coagulation</h1>  
  <br/>  
  <h2>Stanford Reference:</h2>  
  <h3>14-045</h3>  
  <br/>  
  <h2>Abstract</h2>  
  <hr size="1"/>  
  <div id="wrap">Engineers in Prof. James Harris’ laboratory ...</div>  
  <br/>  
  <h3>Applications</h3>
  <br/>  
  <ul>
    <li>
      <b>Patient monitoring</b>- continuous real-time monitoring of blood proteins related to coagulation, with end-user applications such as heart surgery or dialysis
    </li>
    <li>
      <b>Drug delivery</b>- potential for microfluidic system to be adapted to alter dosage of anti-coagulants or other drugs in response to activity or conditions, such as coronary artery thrombosis
    </li>
  </ul>
  <br/>  
  <h3>Advantages</h3>
  <br/>  
  <ul>
    <li>
      <b>Compact and inexpensive:</b>
      <ul/>
    </li>
    <li>on-chip analysis with small, low-cost VCSEL, photodetector and microfluidic chip instead of bulky optics imaging system</li>
    <li>external circuit small enough to attach to standard blood tubes</li>
    <li>enables treatment at point-of-care</li>
  </ul>
  <ul>
    <li>
      <b>Real-time monitoring</b>- microfluidic system enables faster analysis than conventional techniques that typically require at least half an hour for sample preparation and processing
    </li>
    <li>
      <b>Compatible with wireless communication</b>
    </li>
    <li>
      <b>Streamlined data collection and analysis</b>
    </li>
  </ul>
  <br/>  
  <h3>Innovators & Portfolio</h3>
  <br/>  
  <ul>
    <li>James Harris</li>
    <li>Meredith Lee</li>
    <li>Jelena Levi</li>
    <li>James Zehnder</li>
  </ul>
  <br/>  
  <h3>Date Released</h3>11/6/2017 
  <br/>  
  <br/>  
  <h3>Licensing Contact</h3>
  <br/>Scott Elrod, Associate Director 
  <br/>650.725.9409 (Business) 
  <br/>
  <a href="mailto:scott.elrod@stanford.edu">Request Info</a>
  <br/>
  <br/> 
</form>

我们想读取发布日期，XPATH应该怎么写呢？同一级的元素太多了，用/form/*[position()=10]这种方法很容易出错。有没有简单一点的方法呢？有，可以利用轴的概念。

下表是轴的一些解释。

轴名称	结果
ancestor	选取当前节点的所有先辈（父、祖父等）。
ancestor-or-self	选取当前节点的所有先辈（父、祖父等）以及当前节点本身。
attribute	选取当前节点的所有属性。
child	选取当前节点的所有子元素。
descendant	选取当前节点的所有后代元素（子、孙等）。
descendant-or-self	选取当前节点的所有后代元素（子、孙等）以及当前节点本身。
following	选取文档中当前节点的结束标签之后的所有节点。
namespace	选取当前节点的所有命名空间节点。
parent	选取当前节点的父节点。
preceding	选取文档中当前节点的开始标签之前的所有节点。
preceding-sibling	选取当前节点之前的所有同级节点。
self	选取当前节点。

我们可以用轴的概念简化定位。比如我们可以用下面的表达式读取应用和优点：

//h3[text()='Applications'] | //h3[text()='Applications']/following-sibling::*[position() < 3] | //h3[text()='Advantages'] | //h3[text()='Advantages']/following-sibling::*[position() < 3]

用下面的表达式读取发明人：

//h3[text()='Innovators & Portfolio']/following-sibling::*[position() = 2]

用下面的表达式读取发布日期：

//h3[text()='Date Released']/following-sibling::node()[position()=1]

这里需要注意：一定要用node()，不要用*。* 不包括文本，注释，指令等节点，如果也要包含这些节点需要用node()函数。

还有一点要注意：preceding-sibling和following-sibling中position()=1的都是最靠近当前节点的那一个，也就是说preceding-sibling是从下往上数，following-sibling是从上往下数。

牧夫

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫