什么是Jsoup

Program Debug

于 2022-06-27 15:59:36 发布

阅读量315

点赞数

分类专栏： Jsoup 文章标签：前端 javascript html

本文链接：https://blog.csdn.net/weixin_46990454/article/details/125485849

版权

HTML解析 jsoup 数据提取 CSS选择器 XSS防护

关键词由CSDN通过智能技术生成

Jsoup 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

翻译官网文档：https://jsoup.org/

如果有需要的话尽可能看下官网文档

jsoup：Java HTML Parser（Java HTML 解析器）

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Jsoup 是一个用于处理真实的HTML的Java库。它提供了一个非常方便的API来获取url、提取和操作数据，使用了最佳的HTML5 DOM方法和CSS选择器。

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
jsoup 实现了WHATWG HTML5规范，并将HTML解析为现代浏览器相同的DOM。

jsoup特点：
scrape and parse HTML from a URL, file, or string
从url、文件或字符串中抓取和解析HTML
find and extract data, using DOM traversal or CSS selectors
使用DOM遍历或CSS选择器查找和提取数据
manipulate the HTML elements, attributes, and text
操作HTML元素、属性和文本
clean user-submitted content against a safelist, to safelist, to prevent XXS attacks
清除用户提交的内容，以防止XSS攻击
output tidy HTML
输出整洁的HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
jsoup旨在处理各种常见的HTML；从原始且有效的到无效的soup标签；jsoup将创建一个合理的解析树。

Example
Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements
获取Wikipedia主页，将其解析为DOM，然后从新闻部分中选择标题到元素列表中：
Document doc = Jsoup.connection(“https://en.wikipedia.org”).get();
System.out.println(doc.text());
Elements newsHeadlines = doc.select(“#mp-itn b a”);
for(ELement headline : newsHeadlines){
String newsTitle = headline.attr(“title”);
System.out.println("新闻标题 – " + newsTitle);
}

Open source
jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.
jsoup是一个分布在MIT许可下的开源项目。源代码可在GitHub上获得。