Jsoup 一款Java的HTML解析器

最新推荐文章于 2024-08-11 22:04:26 发布

ra1nlove

最新推荐文章于 2024-08-11 22:04:26 发布

阅读量547

点赞数

分类专栏： Java

本文链接：https://blog.csdn.net/qq_31600497/article/details/50352040

版权

Java 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

==================================官网====================================

网址：http://jsoup.org/

里面有文档、下载地址

===================================简介====================================

　　 jsoup是一款Java的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似jQurey的操作方法来取出和操作数据。

主要功能：

　　１、从一个URL，文件或字符串中解析HTML

　　２、使用DOM或CSS选择器来查找、取出数据

　　３、可操作HTML元素、属性、文本

　　jsoup是基于MIT协议发布的，可放心适用于商业项目中。

================================Maven中依赖====================================

<span style="font-size:14px;"><dependency>
  <!-- jsoup HTML parser library @ http://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.8.3</version>
</dependency></span>

==============================input========================================

１、Parse a document from a String

          <span style="font-size:14px;">   String html = "<html><head><title>First parse</title></head>"
                                    + "<body><p>Parsed HTML into a doc.</p></body></html>";
             Document doc = Jsoup.parse(html);</span>

2、Parsing a body fragment

            String html = "<div><p>Lorem ipsum.</p>";
            Document doc = Jsoup.parseBodyFragment(html);
            Element body = doc.body();

3、Load a Document from a URL

get

           Document doc = Jsoup.connect("http://example.com/").get();
           String title = doc.title();

post

           Document doc = Jsoup.connect("http://example.com")
              .data("query", "Java")
              .userAgent("Mozilla")
              .cookie("auth", "token")
              .timeout(3000)
              .post();

４、Load a Document from a File

           File input = new File("/tmp/input.html");
           Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

==================================Extracting-data================================

1、Use DOM methods to navigate a document

<span style="font-size:14px;">File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}</span>

２、Use selector-syntax to find elements

<span style="font-size:14px;">File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png

Element masthead = doc.select("div.masthead").first();
  // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3</span>

３、Extract attributes, text, and HTML from elements

<span style="font-size:14px;">String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"</span>

４、Working with URLs

<span style="font-size:14px;">Document doc = Jsoup.connect("http://jsoup.org").get();

Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"</span>

５、Example program: list links

<span style="font-size:14px;">package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Example program to list links from a URL.
 */
public class ListLinks {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

        print("\nMedia: (%d)", media.size());
        for (Element src : media) {
            if (src.tagName().equals("img"))
                print(" * %s: <%s> %sx%s (%s)",
                        src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
                        trim(src.attr("alt"), 20));
            else
                print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
        }

        print("\nImports: (%d)", imports.size());
        for (Element link : imports) {
            print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
        }

        print("\nLinks: (%d)", links.size());
        for (Element link : links) {
            print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
        }
    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}</span>

================================Modifying data====================================

1、Set attribute values

doc.select("div.comments a").attr("rel", "nofollow");

2、Set the HTML of an element

Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
// now: <div><p>First</p><p>lorem ipsum</p><p>Last</p></div>

Element span = doc.select("span").first(); // <span>One</span>
span.wrap("<li><a href='http://example.com/'></a></li>");
// now: <li><a href="http://example.com"><span>One</span></a></li>

３、Setting the text content of elements

Element div = doc.select("div").first(); // <div></div>
div.text("five > four"); // <div>five > four</div>
div.prepend("First ");
div.append(" Last");
// now: <div>First five > four Last</div>