XPath数据提取(Java版)(网络爬虫常用方法)

盛者无名

已于 2022-03-27 21:14:45 修改

阅读量2.6k

点赞数 1

分类专栏：编程 Java 文章标签： java 网络爬虫

于 2022-03-27 21:07:54 首次发布

本文链接：https://blog.csdn.net/weixin_41489136/article/details/123781586

版权

编程同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

Java

18 篇文章 1 订阅

订阅专栏

XML

XML是一种用于标记电子文件使其具有结构性的标记语言.

虽然HTML和XML同宗同源,但是两者还是存在着重要的区别:

与HTML不同,XML是大小写敏感的.例如：<H1>与<h1>是不同的XML标签.
在HTML中,如果从上下文中可以分清哪里是段落或列表项的结尾,那么结束标签(</p>或</li>)就可以省略,而XML中结束标签绝对不能省略.
在XML中,只有单个标签而没有相对应的结束标签的元素必须以"/"结尾，如<img src="coffeecup.png"/>这样解析器就不去查找标签了
在XML中,属性值必须用括号括起来,而在HTML中,引号是可有可无的.如<applet code="MyApplet.class" width=300 height=300>对HTML来说是合法的,但是对XML来说则是不合法的,在XML中必须使用引号,即 width="300" height="300"
在HTML中,属性名可以没有值,如<input type="radio" name="language" value="Java" checked>,在XML中属性必须都有属性值,如checked="true"或checked="checked".

XML文档应该以一个文档头开始:

<?xml version="1.0"?>或<?xml version="1.0 eencoding="UTF-8""?>

XPath是一门在XML文档中查找信息的语言,可用来在XML文档中对元素和属性进行遍历.它使用路径表达式来选取XML文档中的节点或节点集,节点是通过沿着路径(path)或者步(steps)来选取的.

XPath语法

实例(关于实例语法选自:RUNOOB.COM):

<?xml version="1.0" encoding="UTF-8"?>
 
<bookstore>
 
<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>
 
<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>
 
</bookstore>

Java利用XPath方法解析XML时所需依赖包:

<dependency>
    <groupId>javax.xml</groupId>
    <artifactId>jaxp-api</artifactId>
    <version>1.4.2</version>
</dependency>

选取节点

表达式	描述
nodename	选取此节点的所有子节点.
/	从根节点选取(取子节点)
//	从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置(取子孙节点)
.	选取当前节点
…	选取当前节点的父节点
@	选取属性

表达式及结果:

路径表达式	结果
bookstore	选取bookstore元素所有子节点
/bookstore	选取根元素bookstore,假如路径起始于"/",则此路径始终代表到某元素的绝对路径
/bookstore/book	选取属于bookstore的子元素的所有book元素
//book	选取所有book子元素,而不管它在文档中的位置
bookstore//book	选取属于bookstore元素的后代的所有book元素,而不管它们位于bookstore之下的什么位置
//@lang	选取名为lang的所有属性

Java代码示例:

import org.w3c.dom.Document;
import org.w3c.dom.NodeList;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import java.io.ByteArrayInputStream;
import java.io.FileReader;
import java.io.InputStream;
import java.io.Reader;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) {
        try
        {

            /* io部分 */
            Reader reader=null;
            int length=0;
            char[] ch=null;
            reader=new FileReader("test.xml");
            ch=new  char[1024];
            length=reader.read(ch);
            String testtext=new String(ch,0,length);

            /* XPath部分 */
            Document doc=null;
            XPath xPath=null;
            DocumentBuilderFactory dbf=DocumentBuilderFactory.newDefaultInstance();
            dbf.setValidating(false);
            InputStream inputStream=new ByteArrayInputStream(testtext.getBytes(StandardCharsets.UTF_8));
            DocumentBuilder db=dbf.newDocumentBuilder();
            doc=db.parse(inputStream);
            XPathFactory factory=XPathFactory.newInstance();
            xPath=factory.newXPath();

            NodeList nodeList_bookstore=(NodeList) xPath.evaluate("bookstore",doc, XPathConstants.NODESET);
            int nodeList_bookstoreLength=nodeList_bookstore.getLength();
            for(int i=0;i<nodeList_bookstoreLength;i++)
            {
                System.out.println(nodeList_bookstore.item(i).getTextContent());
            }

            NodeList nodeList__bookstore=(NodeList) xPath.evaluate("/bookstore",doc, XPathConstants.NODESET);
            int nodeList__bookstoreLength=nodeList__bookstore.getLength();
            for(int i=0;i<nodeList__bookstoreLength;i++) 
            {
                System.out.println(nodeList__bookstore.item(i).toString());
            }
            
            NodeList nodeList_bookstore_book=(NodeList) xPath.evaluate("/bookstore/book",doc, XPathConstants.NODESET);
            int nodeList_bookstore_bookLength=nodeList_bookstore_book.getLength();
            for(int i=0;i<nodeList_bookstore_bookLength;i++)
            {
                System.out.println(nodeList_bookstore_book.item(i).getTextContent());
            }

            NodeList nodeList___book=(NodeList) xPath.evaluate("//book",doc, XPathConstants.NODESET);
            int nodeList___bookLength=nodeList___book.getLength();
            for(int i=0;i<nodeList___bookLength;i++)
            {
                System.out.println(nodeList___book.item(i).getTextContent());
            }

            NodeList nodeList_bookstore__book=(NodeList) xPath.evaluate("bookstore//book",doc, XPathConstants.NODESET);
            int nodeList_bookstore__bookLength=nodeList_bookstore__book.getLength();
            for(int i=0;i<nodeList_bookstore__bookLength;i++)
            {
                System.out.println(nodeList_bookstore__book.item(i).getTextContent());
            }

            NodeList nodeList_lang=(NodeList) xPath.evaluate("//@lang",doc, XPathConstants.NODESET);
            int nodeList_longLength=nodeList_lang.getLength();
            for(int i=0;i<nodeList_longLength;i++)
            {
                System.out.println(nodeList_lang.item(i).getTextContent());
            }
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }
}

谓语(Predicates)

谓语用来查找某个特定的节点或者包含某个指定的值的节点.谓语被嵌在方括号中:

路径表达式	结果
/bookstore/book[1]	选取属于bookstore子元素的第一个book元素
/bookstore/book[last()]	选取属于bookstore子元素的最后一个book元素
/bookstore/book[last()-1]	选取属于bookstore子元素的倒数第二个book元素
/bookstore/book[position()< 3]	选取最前面的两个属于bookstore元素的子元素的book元素
//title[@lang]	选取所有拥有名为lang的属性的title元素
//title[@lang=‘eng’]	选取所有title元素,且这些元素拥有值为eng的lang属性
/bookstore/book[price>35.00]	选取bookstore元素的所有book元素,且其中的price元素的值须大于35.00
/bookstore/book[price>35.00]//title	选取bookstore元素中的book元素的title元素,且其中的price元素的值须大于35.00

选取未知节点

通配符	描述
*	匹配任何元素节点
@*	匹配任何属性节点
node()	匹配任何类型的节点

表达式及结果:

路径表达式	结果
/bookstore/*	选取bookstore元素的所有子元素
//*	选取文档中的所有元素
//title[@*]	选取所有带有属性的title元素

选取若干路径

路径表达式	结果
//book/title\|//book/price	选取book元素的所有title和price元素
//title\|//price	选取文档中的所有title和price元素
/bookstore/book/title\|//price	选取属于bookstore元素的book元素的所有title元素,以及文档中所有的price元素

Java代码示例:

import org.w3c.dom.Document;
import org.w3c.dom.NodeList;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import java.io.ByteArrayInputStream;
import java.io.FileReader;
import java.io.InputStream;
import java.io.Reader;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) {
        try
        {
            /* io部分 */
            Reader reader=null;
            int length=0;
            char[] ch=null;
            reader=new FileReader("test.xml");
            ch=new  char[1024];
            length=reader.read(ch);
            String testtext=new String(ch,0,length);

            /* XPath部分 */
            Document doc=null;
            XPath xPath=null;
            DocumentBuilderFactory dbf=DocumentBuilderFactory.newDefaultInstance();
            dbf.setValidating(false);
            InputStream inputStream=new ByteArrayInputStream(testtext.getBytes(StandardCharsets.UTF_8));
            DocumentBuilder db=dbf.newDocumentBuilder();
            doc=db.parse(inputStream);
            XPathFactory factory=XPathFactory.newInstance();
            xPath=factory.newXPath();

            NodeList nodeList=(NodeList) xPath.evaluate("此处代入表达式或通配符",doc, XPathConstants.NODESET);
            int nodeListLength=nodeList.getLength();
            for(int i=0;i<nodeListLength;i++)
            {
                System.out.println(nodeList.item(i).getTextContent());
            }
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }
}