Jsoup和JsoupXpath来解析html和xml文件

最新推荐文章于 2024-04-03 01:59:16 发布

Nigtunt

最新推荐文章于 2024-04-03 01:59:16 发布

阅读量813

点赞数

分类专栏： xml

本文链接：https://blog.csdn.net/weixin_44462294/article/details/104410755

版权

xml 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Jsoup介绍

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。同时也可以使用jsoup来解析xml文件
主要功能

从一个URL，文件或字符串中解析HTML
使用DOM或CSS选择器来查找、取出数据使用DOM或CSS选择器来查找、取出数据
可操作HTML元素、属性、文本可操作HTML元素、属性、文本

1、用Jsoup解析xml

<?xml version="1.0"?>
<students>
    <student id="1" sex="男">
        <name>小民</name>
        <age>93</age>
    </student>

    <student id="2" sex="男">
        <name>小泽</name>
        <age>13</age>
    </student>
</students>

public static void main(String args[]) throws IOException {
        Document parse = Jsoup.parse(new File("src/Jsoup/student.xml"),"utf-8");
        //获取student标签
        Elements students = parse.getElementsByTag("student");
        for (Element student:students){
        //获取标签的属性集合
            Attributes attributes = student.attributes();
            for (Attribute attribute:attributes){
                System.out.println(attribute.getKey()+"="+attribute.getValue());
            }
            //获取所有标签child
            Elements children = student.children();
            for (Element c:children){
                System.out.println(c.tag().getName()+":"+c.text());
            }
        }
    }

结果
在这里插入图片描述
2、Jsoup解析HTML

public static void main(String args[]) throws IOException {
        System.out.println("方法1使用parse传入一个URL对象");
        URL url = new URL("http://www.baidu.com");
        Document parse = Jsoup.parse(url, 3000);
        //获取结果的title
        System.out.println(parse.title());
        System.out.println("方法2使用connect()直接连接");
        Connection connect = Jsoup.connect("http://www.baidu.com");
        System.out.println(connect.get().title());
    }

结果
在这里插入图片描述
3、Jsoup的css选择器

public static void main(String args[]) throws IOException {
        Document document = Jsoup.connect("http://www.baidu.com").get();
        //选取所有含有href属性的结点
        System.out.println(document.select("[href]"));
        System.out.println("----------------------------------------------------------");
        //选取所有含有href属性里含有baidu的结点
        System.out.println(document.select("[href~=baidu]"));
    }

部分结果
在这里插入图片描述

Jsoup的Xpath查询

xpath

XPath 是一门在 XML 文档中查找信息的语言。

参考xpath教程：https://www.runoob.com/xpath/xpath-tutorial.html

需要这4个jar包
在这里插入图片描述

路径表达式	结果
bookstore	选取 bookstore 元素的所有子节点。
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()❤️]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=‘eng’]	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]//title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

使用测试

public static void main(String args[]) throws IOException {
        Document document = Jsoup.parse(new File("src/Jsoup/student.xml"), "utf-8");
        JXDocument jxDocument = JXDocument.create(document);
        //1.
        List<JXNode> jxNodes = jxDocument.selN("//student");
        System.out.println(jxNodes);
        System.out.println("++++++++++++++++++++++++++++++");
        //2.
        JXNode jxNode = jxDocument.selNOne("//student[@id='1']/name");
        System.out.println(jxNode.asElement().text());
    }

结果
在这里插入图片描述

Nigtunt

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Jsoup和JsoupXpath来解析html和xml文件

Jsoup介绍jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。同时也可以使用jsoup来解析xml文件主要功能从一个URL，文件或字符串中解析HTML使用DOM或CSS选择器来查找、取出数据使用DOM或CSS选择器来查找、取出数据可操作...
复制链接

扫一扫