jsoup Java HTML解析器

jsoup is an open source Java HTML parser that we can use to parse HTML and extract useful information. You can also think of jsoup as web page scraping tool in java programming language.

jsoup是一个开放源代码Java HTML解析器,我们可以用来解析HTML并提取有用的信息。 您也可以将jsoup视为Java编程语言中的网页抓取工具。

so (jsoup)

jsoup example, jsoup tutorial, java html parser, web page scraping with jsoup

jsoup API can be used to fetch HTML from URL or parse it from HTML string or from HTML file.


jsoup API可用于从URL提取HTML或从HTML字符串或HTML文件解析它。

Some of the cool features of jsoup API are;

jsoup API的一些很酷的功能是;

  1. scrape HTML from URL or read it from String or from a file.

    从URL抓取HTML或从String或文件中读取HTML。
  2. Extract data from html through DOM based traversal or using CSS like selectors.

    通过基于DOM的遍历或使用类似选择器CSS从html提取数据。
  3. jsoup API can be used to edit HTML too.

    jsoup API也可以用于编辑HTML。
  4. jsoup API is self contained, we don’t need any other jars to use it.

    jsoup API是自包含的,我们不需要任何其他jar即可使用它。

You can download jsoup jar from it’s website or if you are using maven, then add below dependency for it.

您可以从其网站下载jsoup jar,或者如果您使用的是maven,则为其添加以下依赖项。

<dependency>
	<groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.8.1</version>
</dependency>

Let’s look into different jsoup features one by one.

让我们逐一研究不同的jsoup功能。

jsoup示例从URL加载HTML文档 (jsoup example to load HTML document from URL)

We can do it with a one liner code as shown below.

我们可以使用一个班轮代码来完成它,如下所示。

org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect("https://www.journaldev.com").get();
System.out.println(doc.html()); // prints HTML data

jsoup示例从String解析HTML文档 (jsoup example to parse HTML document from String)

If we have HTML data as String, we can use below code to parse it.

如果我们将HTML数据作为字符串,则可以使用以下代码对其进行解析。

String source = "<html><head><title>Jsoup Example</title></head>"
		+ "<body><h1>Welcome to JournalDev!!</h1><br />"
		+ "</body></html>";

Document doc = Jsoup.parse(source);

jsoup示例从文件加载文档 (jsoup example to load a document from file)

If HTML data is saved in a file, we can load it using below code.

如果HTML数据保存在文件中,我们可以使用以下代码加载它。

Document doc = Jsoup.parse(new File("data.html"), "UTF-8");

解析HTML正文片段 (Parsing HTML Body Fragment)

One of the best feature of jsoup is that if we supply html body fragmented data, it tries hard to generate a valid HTML for us, as shown in below example.

jsoup的最佳功能之一是,如果我们提供html正文片段数据,它将努力为我们生成有效HTML,如下例所示。

String html = "<div><p>Test Data</p>";
Document doc1 = Jsoup.parseBodyFragment(html);
System.out.println(doc1.html());

Above code prints following HTML.

上面的代码显示以下HTML。

<html>
 <head></head>
 <body>
  <div>
   <p>Test Data</p>
  </div>
 </body>
</html>

Let’s now look at different methods to extract data from HTML.

现在让我们看一下从HTML提取数据的不同方法。

Jsoup DOM方法 (Jsoup DOM Methods)

Just like HTML, Jsoup parse the HTML into Document. A document consists of different elements and there are many useful methods that we can use to find elements. Some of these methods in Document are;

就像HTML一样,Jsoup将HTML解析为Document。 文档包含不同的元素,可以使用许多有用的方法来查找元素。 Document中的一些方法是:

  1. getElementById(String id)

    getElementById(字符串ID)
  2. getElementsByTag(String tag)

    getElementsByTag(字符串标签)
  3. getElementsByClass(String className)

    getElementsByClass(String className)
  4. getElementsByAttribute(String key)

    getElementsByAttribute(字符串键)
  5. siblingElements(), firstElementSibling(), lastElementSibling() etc.

    siblingElements(),firstElementSibling(),lastElementSibling()等。

Element has different attributes, so we have some methods for element data too.

元素具有不同的属性,因此我们也有一些元素数据的方法。

  1. attr(String key) to get and attr(String key, String value) to set attributes

    attr(String key)获取和attr(String key,字符串值)设置属性
  2. id(), className() and classNames()

    id(),className()和classNames()
  3. text() to get and text(String value) to set the text content

    text()获取和text(String值)设置文本内容
  4. html() to get and html(String value) to set the inner HTML content

    html()获取和html(String value)设置内部HTML内容
  5. tag() and tagName()

    tag()和tagName()

There are some methods for manipulating HTML data as well.

也有一些用于处理HTML数据的方法。

  1. append(String html), prepend(String html)

    append(String html),prepend(String html)
  2. appendText(String text), prependText(String text)

    appendText(字符串文本),prependText(字符串文本)
  3. appendElement(String tagName), prependElement(String tagName)

    appendElement(String tagName),prependElement(String tagName)
  4. html(String value)

    html(字符串值)

Below is a simple example where I am using jsoup DOM methods to parse my website home page and list all the links.

下面是一个简单的示例,其中我使用jsoup DOM方法来解析我的网站主页并列出所有链接。

package com.journaldev.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupExtractLinks {

	public static void main(String[] args) throws IOException {

		Document doc = Jsoup.connect("https://www.journaldev.com").get();
		Element content = doc.getElementById("content");
		Elements links = content.getElementsByTag("a");
		for (Element link : links) {
		  String linkHref = link.attr("href");
		  String linkText = link.text();
		  System.out.println("Text::"+linkText+", URL::"+linkHref);
		}
	}

}

Above program produces following output.

上面的程序产生以下输出。

Text::jQuery Popup and Tooltip Window Animation Effects, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::Jobin Bennett, URL::https://www.journaldev.com/author/jobin
Text::March 7, 2015, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::jQuery, URL::https://www.journaldev.com/dev/jquery
Text::jQuery Plugins, URL::https://www.journaldev.com/dev/jquery/jquery-plugins
Text::Permalink, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::Apache HttpClient Example to send GET/POST HTTP Requests, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 6, 2015, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Java, URL::https://www.journaldev.com/dev/java
Text::Permalink, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Java HttpURLConnection Example to send HTTP GET/POST Requests, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 6, 2015, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::Java, URL::https://www.journaldev.com/dev/java
Text::Permalink, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::How to integrate Google reCAPTCHA in Java Web Application, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 4, 2015, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::Java EE, URL::https://www.journaldev.com/dev/java/j2ee
Text::Permalink, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::JSF Spring Hibernate Integration Example Tutorial, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 3, 2015, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::Hibernate, URL::https://www.journaldev.com/dev/hibernate
Text::JSF, URL::https://www.journaldev.com/dev/jsf
Text::Spring, URL::https://www.journaldev.com/dev/spring
Text::Permalink, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::JSF Spring Integration Example Tutorial, URL::https://www.journaldev.com/7112/spring-jsf-integration
Text::Oracle Webcenter Portal Framework Application – Modifying Home Page And Login/Logout Target Pages & Deploying Your Application Into Custom Portal Managed Server Instance, URL::https://www.journaldev.com/6938/oracle-webcenter-portal-framework-application-modifying-home-page-and-loginlogout-target-pages-deploying-your-application-into-custom-portal-managed-server-instance
Text::JSF and JDBC Integration Example Tutorial, URL::https://www.journaldev.com/7068/jsf-database-example-mysql-jdbc
Text::Count the Number of Triangles in Given Picture – Programmatic Solution, URL::https://www.journaldev.com/7064/count-the-number-of-triangles-in-given-picture-programmatic-solution
Text::JSF Expression Language (EL) Example Tutorial, URL::https://www.journaldev.com/7058/jsf-expression-language-jsf-el
Text::Read all Articles →, URL::https://www.journaldev.com/page/2

Jsoup选择器语法 (Jsoup selector syntax)

We can also use CSS or jQuery like syntax to find and manipulate HTMl elements. Document and Element contains select(String cssQuery) that we can use for this.

我们还可以使用CSS或jQuery之类的语法来查找和操作HTMl元素。 DocumentElement包含select(String cssQuery)可以用于此目的。

Some quick examples are;

一些简单的例子是:

  1. doc.select(“a”): returns all “a” tag elements from HTML.

    doc.select(“ a”):从HTML返回所有“ a”标记元素。
  2. doc.select(c|if): finds <c:if> elements

    doc.select(c | if):查找<c:if>元素
  3. doc.select(“#id1″): returns all tags with id=”id1”

    doc.select(“#id1”):返回所有ID =“ id1”的标签
  4. doc.select(“.cl1″): returns all tags with class=”cl1”

    doc.select(“。cl1”):返回所有带有class =“ cl1”的标签
  5. doc.select(“[href]”): returns all tags with attribute href

    doc.select(“ [href]”):返回所有带有href属性的标签

We can combine selectors too, you can find more details at Selectors API.

我们也可以组合选择器,您可以在Selectors API上找到更多详细信息。

Let’s now look at an example where I will fetch my Google+ author URL from my website using both DOM and Selector API.

现在让我们看一个示例,在该示例中,我将同时使用DOM和Selector API从我的网站中获取Google+作者URL。

package com.journaldev.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupFindAuthor {

	public static void main(String[] args) throws IOException {
		//journaldev.com posts have author set as below
		//<div class="g-person" data-width="350" data-href="//plus.google.com/u/0/118104420597648001532"
		//data-layout="landscape" data-rel="author"></div>
		
		findAuthorUsingDOM();

		findAuthorUsingSelector();
	}

	private static void findAuthorUsingSelector() throws IOException {
		Document doc = Jsoup.connect("https://www.journaldev.com").get();
		Elements authors = doc.select("div.g-person"); //selector combination
		for(Element author : authors){
			System.out.println("Selector:: Author Google+ URL::"+author.attr("data-href"));
		}
	}

	private static void findAuthorUsingDOM() throws IOException {

		Document doc = Jsoup.connect("https://www.journaldev.com").get();
		Elements authors = doc.getElementsByClass("g-person");
		for(Element author : authors){
			System.out.println("DOM:: Author Google+ URL::"+author.attr("data-href"));
		}
	}

}

Above program prints following output.

上面的程序打印以下输出。

DOM:: Author Google+ URL:://plus.google.com/u/0/118104420597648001532
Selector:: Author Google+ URL:://plus.google.com/u/0/118104420597648001532

jsoup修改HTML的示例 (jsoup example to modify HTML)

Let’s now look at jsoup example where I will parse input HTML and manipulate it.

现在让我们看一下jsoup示例,在这里我将解析输入HTML并对其进行操作。

package com.journaldev.jsoup;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupModifyHTML {

	public static final String SOURCE_HTML = "<html><head><title>Jsoup Example</title></head>"
			+ "<body><h1>Welcome to JournalDev!!</h1><br />"
			+ "<div id=\"id1\">Hello</div>"
			+ "<div class=\"class1\">Pankaj</div>"
			+ "<a href=\"https://journaldev.com\">Home</a>"
			+ "<a href=\"https://wikipedia.org\">Wikipedia</a>"
			+ "</body></html>";

	public static void main(String[] args) {
		Document doc = Jsoup.parse(SOURCE_HTML);
		System.out.println("Title="+doc.title());
		
		//let's add attribute target="_blank" to all the links
		doc.select("a[href]").attr("rel", "nofollow");
		//System.out.println(doc.html());
		
		//change div class="class1" to class2
		doc.select("div.class1").attr("class", "class2");
		//System.out.println(doc.html());
		
		//change the HTML value of first h1 element
		doc.select("h1").first().html("Welcome to JournalDev.com");
		doc.select("h1").first().append("!!");
		//System.out.println(doc.html());
		
		//let's make Home link bold
		doc.select("a[href]").first().html("<strong>Home</strong>");
		System.out.println(doc.html());
		
	}

}

Please have a look at above program carefully to understand what’s modifications are done to the input html string. Also compare it with the final document as shown below in output.

请仔细查看上述程序,以了解对输入html字符串所做的修改。 还要将其与最终文档进行比较,如下图所示。

Title=Jsoup Example
<html>
 <head>
  <title>Jsoup Example</title>
 </head>
 <body>
  <h1>Welcome to JournalDev.com!!</h1>
  <br>
  <div id="id1">
   Hello
  </div>
  <div class="class2">
   Pankaj
  </div>
  <a href="https://journaldev.com" target="_blank"><strong>Home</strong></a>
  <a href="https://wikipedia.org" target="_blank">Wikipedia</a>
 </body>
</html>

jsoup示例来解析Google搜索页面并查找结果 (jsoup example to parse Google Search Page and Find out Results)

Before I conclude this post, here is an example where I am parsing google search results first page and fetching all the links.

在结束本文之前,这里有一个示例,其中我解析google搜索结果的第一页并获取所有链接。

package com.journaldev.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParsingGoogleSearch {

	public static void main(String[] args) throws IOException {
		Document doc = Jsoup.connect("https://www.google.com/search?q=java").userAgent("Mozilla/5.0").get();
		//System.out.println(doc.html());
		Elements resultsH3 = doc.select("h3.r > a");

		for (Element result : resultsH3) {
			String linkHref = result.attr("href");
			String linkText = result.text();
			System.out.println("Text::" + linkText + ", URL::" + linkHref.substring(6, linkHref.indexOf("&")));
		}
	}

}

It prints following output.

打印以下输出。

Text::Download Free Java Software, URL::=https://java.com/download
Text::java.com: Java + You, URL::=https://www.java.com/
Text::Oracle Technology Network for Java Developers | Oracle ..., URL::=https://www.oracle.com/technetwork/java/
Text::Java (software platform) - Wikipedia, the free encyclopedia, URL::=https://en.wikipedia.org/wiki/Java_(software_platform)
Text::Java (programming language) - Wikipedia, the free encyclopedia, URL::=https://en.wikipedia.org/wiki/Java_(programming_language)
Text::Java Tutorial - TutorialsPoint, URL::=https://www.tutorialspoint.com/java/
Text::Welcome to JavaWorld.com, URL::=https://www.javaworld.com/
Text::Java.net: Welcome, URL::=https://www.java.net/
Text::News for java, URL::h?q=java
Text::Javalobby | The heart of the Java developer community, URL::=https://java.dzone.com/

Note that currently google search results are part of h3 tag with class “r” and obviously “a” is used for the link. So if in future there is any change such as h3 tag class name is changed, then it won’t work properly and we will have to do slight modification by looking at the source html structure.

请注意,当前的google搜索结果是h3标签中“ r”类的一部分,显然“ a”用于链接。 因此,如果将来有任何更改,例如h3标签类名称已更改,则它将无法正常工作,我们将不得不通过查看源html结构来进行一些修改。

That’s all for jsoup example tutorial, I hope it will help you in parsing HTML data easily when required.

这就是jsoup示例教程的全部内容,我希望它可以帮助您在需要时轻松地解析HTML数据。

Reference: Official Website

参考: 官方网站

翻译自: https://www.journaldev.com/7144/jsoup-java-html-parser

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值