解析HTML工具

最新推荐文章于 2024-08-13 08:06:46 发布

浅灰色、邂逅

最新推荐文章于 2024-08-13 08:06:46 发布

阅读量3.2k

点赞数 2

分类专栏：随手记 Jsoup 文章标签： java

本文链接：https://blog.csdn.net/z_hongchang/article/details/108801597

版权

随手记同时被 2 个专栏收录

17 篇文章 0 订阅

订阅专栏

Jsoup

1 篇文章 0 订阅

订阅专栏

Java爬虫利器HTML解析工具-Jsoup

Jsoup简介

Java爬虫解析HTML文档的工具有：htmlparser, Jsoup。本文将会详细介绍Jsoup的使用方法，10分钟搞定Java爬虫HTML解析。

Jsoup可以直接解析某个URL地址、HTML文本内容，它提供非常丰富的处理Dom树的API。如果你使用过JQuery，那你一定会非常熟悉。

Jsoup最强大的莫过于它的CSS选择器支持了。比如：document.select("div.content > div#image > ul > li:eq(2)。

包引入方法

Maven

添加下面的依赖声明即可，最新版本是（1.12.1）

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.11.3</version>
</dependency>

Gradle

// jsoup HTML parser library @ https://jsoup.org/
compile 'org.jsoup:jsoup:1.11.3'

源码安装

当然也可以直接把jar包下载下来，下载地址：https://jsoup.org/download

 #git获取代码
git clone https://github.com/jhy/jsoup.git
cd jsoup
mvn install

#下载代码
curl -Lo jsoup.zip https://github.com/jhy/jsoup/archive/master.zip
unzip jsoup.zip
cd jsoup-master
mvn install

Jsoup解析方法

Jsoup支持四种方式解析Document，即可以输入四种内容得到一个Document：

解析字符串
解析body片段
从一个URL解析
从一个文件解析

字符串解析示例

字符串中必须包含head和body元素。

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

HTML片段解析

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

从URL解析

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

还可以携带cookie等参数：

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

从文件解析

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Jsoup遍历DOM树的方法

使用标准的DOM方法

Jsoup封装并实现了DOM里面常用的元素遍历方法：

根据id查找元素: getElementById(String id)
根据标签查找元素: getElementsByTag(String tag)
根据class查找元素: getElementsByClass(String className)
根据属性查找元素: getElementsByAttribute(String key)
兄弟遍历方法: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(),previousElementSibling()
层级之间遍历: parent(), children(), child(int index)
如查看博客园

/**
                 * 下面是Jsoup展现自我的平台
                 */
                //6.Jsoup解析html
                Document document = Jsoup.parse(html);
                //像js一样，通过标签获取title
                System.out.println(document.getElementsByTag("title").first());
                //像js一样，通过id 获取文章列表元素对象
                Element postList = document.getElementById("post_list");
                //像js一样，通过class 获取列表下的所有博客
                Elements postItems = postList.getElementsByClass("post_item");
                //循环处理每篇博客
                for (Element postItem : postItems) {
                    //像jquery选择器一样，获取文章标题元素
                    Elements titleEle = postItem.select(".post_item_body a[class='titlelnk']");
                    System.out.println("文章标题:" + titleEle.text());;
                    System.out.println("文章地址:" + titleEle.attr("href"));
                    //像jquery选择器一样，获取文章作者元素
                    Elements footEle = postItem.select(".post_item_foot a[class='lightblue']");
                    System.out.println("文章作者:" + footEle.text());;
                    System.out.println("作者主页:" + footEle.attr("href"));
                    System.out.println("*********************************");
                }

这些方法会返回Element或者Elements节点对象，这些对象可以使用下面的方法获取一些属性：

attr(String key): 获取某个属性值
attributes(): 获取节点的所有属性
id(): 获取节点的id
className(): 获取当前节点的class名称
classNames(): 获取当前节点的所有class名称
text(): 获取当前节点的textNode内容
html(): 获取当前节点的 inner HTML
outerHtml(): 获取当前节点的 outer HTML
data(): 获取当前节点的内容，用于script或者style标签等
tag(): 获取标签
tagName(): 获取当前节点的标签名称

有了这些API，就像JQuery一样很便利的操作DOM。

强大的CSS选择器支持

你可能会说htmlparse支持xpath，可以很方便的定位某个元素，而不用一层一层地遍历DOM树。调用方法如下：

document.select(String selector): 选择匹配选择器的元素，返回是Elements对象
document.selectFirst(String selector): 选择匹配选择器的第一个元素，返回是一个Element对象
element.select(String selector): 也可以直接在Element对象上执行选择方法
Jsoup能够完美的支持CSS的选择器语法，可以说对应有前端经验的开发者来说简直是福音，不用特意去学习XPath的语法。比如一个XPath：

//*[@id="docs"]/div[1]/h4/a， 可以转成等效的CSS选择器语法： document.select("#docs > div:eq(1) > h4 > a").attr("href");。

看下面的示例：

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://baidu.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png

Element masthead = doc.select("div.masthead").first(); // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

下面列出一些常见的选择器：

标签选择(如div): tag
id选择(#logo): #id
class选择(.head): .class
属性选择([href]): [attribute]
属性值选择: [attr=value]
属性前缀匹配: [^attr]
属性简单正则匹配: [attr^=value], [attr$=value], [attr*=value], [attr~=regex]
另外还支持下面的组合选择器：
element#id: (div#logo: 选取id为logo的div元素)
element.class: (div.content: 选择class包括content的div元素)
element[attr]: (a[href]: 选择包含href的a元素)
ancestor child: (div p: 选择div元素的所有p后代元素)
parent > child: (p > span: 选择p元素的直接子元素中的span元素)
siblingA + siblingB: (div.head + div: 选取div.head的下一个兄弟div元素)
siblingA ~ siblingX: (h1 ~ p: 选取h1后面的所有p兄弟元素)
el, el, el: (div.content, div.footer: 同时选取div.content和div.footer)
当然，还支持伪元素选择器：
:lt(n): (div#logo > li:lt(2): 选择id为logo的div元素的前3个li子元素)
:gt(n)
:eq(n)
:has(selector)
:not(selector)
:contains(text)
详细可参考官方选择器语法文档： https://jsoup.org/cookbook/extracting-data/selector-syntax

Jsoup修改DOM树结构

当然Jsoup还支持修改DOM树结构，真的很像JQuery。

// 设置属性
doc.select("div.comments a").attr("rel", "nofollow");

// 设置class
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");

下面的API可以直接操作DOM树结构：

text(String value): 设置内容
html(String value): 直接替换HTML结构
append(String html): 元素后面添加节点
prepend(String html): 元素前面添加节点
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)

最后Jsoup另一个值得一提的功能

你肯定有过这种经历，在你的页面文本框中，如果输入html元素的话，保存后再查看很大概率会导致页面排版乱七八糟，如果能对这些内容进行过滤的话，就完美了。

public static void main(String[] args) {
        String unsafe = "<p><a href='网址' οnclick='stealCookies()'>博客园</a></p>";
        System.out.println("unsafe: " + unsafe);
        String safe = Jsoup.clean(unsafe, Whitelist.basic());
        System.out.println("safe: " + safe);
    }