1.简介
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
jsoup是一个用于处理实际HTML的Java库。它使用HTML5最佳DOM方法和CSS选择器,为提取URL以及提取和处理数据提供了非常方便的API。
2.使用
2.1 导入依赖
implementation 'org.jsoup:jsoup:1.13.1'
2.2 解析字符串内容
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
2.3 解析页面内容的一部分内容
String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
2.4 从网页加载
Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();
也可以执行post
Document doc = Jsoup.connect("http://example.com")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
2.5 从文件加载内容
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
3.提取数据
3.1 使用Dom方法浏览文档
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
3.2 使用选择器 查找元素
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
//有链接的a标签
Elements links = doc.select("a[href]"); // a with href
//以.png结束的img
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
//以class名为mastthead的div
Element masthead = doc.select("div.masthead").first();
// div with class=masthead
//h3 后a 标签
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
4 案例
List<TopicListBean> mList = new ArrayList<>();
Elements itemElements = doc.select("div.cell.item"); //item根节点
int count = itemElements.size();
for (int i = 0; i < count; i++) {
Elements titleElements = itemElements.get(i).select("div.cell.item table tr td span.item_title > a"); //标题
Elements imgElements = itemElements.get(i).select("div.cell.item table tr td img.avatar"); //头像
Elements commentElements = itemElements.get(i).select("div.cell.item table tr a.count_livid"); //评论数
Elements nodeElements = itemElements.get(i).select("div.cell.item table tr span.small.fade a.node"); //节点
Elements nameElements = itemElements.get(i).select("div.cell.item table tr span.small.fade strong a"); //作者 & 最后回复
Elements timeElements = itemElements.get(i).select("div.cell.item table tr span.small.fade"); //更新时间
TopicListBean bean = new TopicListBean();
if (titleElements.size() > 0) {
bean.setTitle(titleElements.get(0).text());
bean.setTopicId(parseId(titleElements.get(0).attr("href")));
}
if (imgElements.size() > 0) {
bean.setImgUrl(parseImg(imgElements.get(0).attr("src"))); // http:
}
if (nodeElements.size() > 0) {
bean.setNode(nodeElements.get(0).text());
}
if (nameElements.size() > 0) {
bean.setName(nameElements.get(0).text());
}
//存在没有 最后回复者、评论数、更新时间的情况
if (nameElements.size() > 1) {
bean.setLastUser(nameElements.get(1).text());
}
if (commentElements.size() > 0) {
bean.setCommentNum(Integer.valueOf(commentElements.get(0).text()));
}
if (timeElements.size() > 1) {
bean.setUpdateTime(parseTime(timeElements.get(1).text()));
}
mList.add(bean);
}