JAVA爬虫入门（二）Jsoup解析数据-CSDN博客

Jsoup简介

对于一个需要爬取网页信息的我来说，jsoup的存在无疑是减轻了很多负担。使用Jsoup的一些API可以很方便且快捷的获取到我们想要的内容。

jsoup 是一款 Java 的 HTML 解析器，可直接解析某个 URL 地址、HTML 文本内容。它提供了一套非常省力的 API，可通过 DOM，CSS 以及类似于 jQuery 的操作方法来取出和操作数据。

它的主要功能有:

从一个URL，文件，或字符串中解析HTM（目前所需）
使用DOM或CSS选择器来查找、取出数据（目前所需）
可操作HTML元素、属性、文本（可选）

POM.XML(包含Jsoup依赖)

    <dependencies>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>

        </dependency>


        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.4-alpha1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4-alpha1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>

    </dependencies>

复制代码

Jsoup解析HTML

 @Test
    public void testUrl() throws Exception{
        // 解析URL地址  第一个参数为访问的url 第二个参数是访问时候的超时时间
      Document doc =  Jsoup.parse(new URL("http://www.itcast.cn/"),1000);
      String text = doc.getElementsByTag("title").first().text();
      System.out.println(text);

    }
    @Test
    public void testString() throws Exception{
           //获取字符串 第一个参数为文件对象 第二个参数为字符集
        String context = FileUtils.readFileToString(new File("D:\\javaCode\\HttpClient\\src\\main\\java\\test.html"),"utf8");
        //直接对字符串进行解析
        Document doc = Jsoup.parse(context);
        //获取并输出满足标签的第一个元素的文本内容
        String text = doc.getElementsByTag("title").first().text();
        System.out.println(text);


    }
    @Test
    public void testFile() throws Exception{
       //解析File
        Document doc = Jsoup.parse(new File("D:\\javaCode\\HttpClient\\src\\main\\java\\test.html"),"utf8");
        String title = doc.getElementsByTag("title").first().text();
        System.out.println(title);
    }
复制代码

DOM解析

@Test
    public void testDOM() throws Exception{
        //首先解析文件  获取doc文件

        Document doc = Jsoup.parse(new File("D:\\javaCode\\HttpClient\\src\\main\\resources\\test.html"),"utf8");

        //根据ID   getElementById
        //Element element = doc.getElementById("city_bj");
        //根据标签获取元素 getElementByTag
        //Element element=doc.getElementsByTag("span").last();
        //根据属性  getElementsByAttribute
        //Elements element = doc.getElementsByAttribute("href");
        //根据属性-值键值对 getElementsByAttributeValue
         Element elements = doc.getElementsByAttributeValue("href","http://yun.itheima.com/map/24.html").first();
        System.out.println("获取的内容： "+elements.text());

    }
复制代码

JsoupSelector （Jsoup最强大的功能来了）

使用Jsoup选择器使得解析并获取HTML数据更简单了，也是更多人热衷于使用Jsoup的原因。

简单的选择器操作

描述	select
通过标签名	select("span")注：通过标签，直接写标签名就好了
通过id	select("#myspan") 注: 通过id来查找就用#
通过class	select(".myclass") 注: 通过class来查找就用.
通过属性名	select("span[class01=value01]span[class02 = value02]") 注:查询规则为标签名[属性名=属性值],标签名可写可不写，多个属性就多个[]
通过属性名前缀	select("span[^cl]") 注: 表示查询以cl开头的属性
通过属性名+正则表达式	select("span[class~= ^AB]") 注:表示查询以AB开头的class属性值
通过文本内容	select("span:contains(3)") 注:查询规则为标签名：contains(文本值)

选择器的组合

    @Test
    public void testSelector() throws Exception{
        Document doc = Jsoup.parse(new File("D:\\javaCode\\HttpClient\\src\\main\\resources\\test.html"),"gbk");
        //标签+ID  a#abc  表示 a标签下  id=abc
        //Element element = doc.select("a#abc").first();
        //标签+class属性值  a.class_b
        // Element element = doc.select("a.class_b").first();
        //a[attr]  标签名加属性名  也可以弄属性值 a[class=value]
        // Element element = doc.select("a[href]").first();
        //查询某个元素下的子元素 .city_con li  查询class值为city_on下的所有为li
        //查找某个父元素下的直接子元素 .city_con>ul>li  查找city_con第一级的ul下的第一级li
        Elements elements = doc.select(".city_con>ul>li");
        for(Element e:elements){
            System.out.println(e.text());
        }
      System.out.println(elements.text());
    }
复制代码