翻译官网文档:https://jsoup.org/
如果有需要的话尽可能看下官网文档
jsoup:Java HTML Parser(Java HTML 解析器)
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Jsoup 是一个用于处理真实的HTML的Java库。它提供了一个非常方便的API来获取url、提取和操作数据,使用了最佳的HTML5 DOM方法和CSS选择器。
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
jsoup 实现了WHATWG HTML5规范,并将HTML解析为现代浏览器相同的DOM。
jsoup特点:
scrape and parse HTML from a URL, file, or string
从url、文件或字符串中抓取和解析HTML
find and extract data, using DOM traversal or CSS selectors
使用DOM遍历或CSS选择器查找和提取数据
manipulate the HTML elements, attributes, and text
操作HTML元素、属性和文本
clean user-submitted content against a safelist, to safelist, to prevent XXS attacks
清除用户提交的内容,以防止XSS攻击
output tidy HTML
输出整洁的HTML
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
jsoup旨在处理各种常见的HTML;从原始且有效的到无效的soup标签;jsoup将创建一个合理的解析树。
Example
Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements
获取Wikipedia主页,将其解析为DOM,然后从新闻部分中选择标题到元素列表中:
Document doc = Jsoup.connection(“https://en.wikipedia.org”).get();
System.out.println(doc.text());
Elements newsHeadlines = doc.select(“#mp-itn b a”);
for(ELement headline : newsHeadlines){
String newsTitle = headline.attr(“title”);
System.out.println("新闻标题 – " + newsTitle);
}
Open source
jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.
jsoup是一个分布在MIT许可下的开源项目。源代码可在GitHub上获得。