上一节以一个小Demo开始了Java的爬虫之旅,熟悉了HttpClient请求资源得到返回结果,得到初步处理的结果。但对于得到的网页是怎么解析的呢?这里讨论一下Jsoup的使用。
Jsoup是一款Java的HTML解析器,提供了一套非常省力的API,可以方便的从一个URL、文件、或字符串中解析出HTML,然后使用DOM或者Select选择出页面元素、取出数据。如下:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Document doc = Jsoup.connect("http://www.zhihu.com").get(); //网页内容
通过以上方式就可以解析一个HTML文档,然后可以
使用DOM方法来遍历文档
,也可以
使用选择器来查找元素
。下面以获取知乎首页所有链接为例,进行小的演示:
通过打印Document可以看到解析TML得到的文档的内容信息,观察发现知乎首页包含的所有链接有以下三种形式:<a href>超链接形式、<link href>以及<src>的脚本,因此分别针对三种形式进行链接的提取。
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* 提取知乎首页中的所有链接(包括图片和其他页面的链接)
*/
public class GetPageLinks {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "Usage: supply url to fetch");
String url = args[0];
System.out.println("Fetching ..." + url);
Document document = Jsoup.connect(url).get();
// System.out.println(document);//解析HTML得到一个文档
Elements links = document.select("a[href]");
// System.out.println(links); //选择a[href]元素
Elements medias = document.select("[src]");
// System.out.println(medias); //一般图片、脚本之类的是以src形式嵌入
Elements imports = document.select("link[href]");
// System.out.println(imports);
print("Links: (%d)", links.size());
for(Element link : links){
print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35));
}
print("Medias: (%d)", medias.size());
for(Element media : medias){
print(" * %s: <%s>", media.tagName(), media.attr("abs:src"));
}
print("Imports: (%d)", imports.size());
for(Element imp : imports){
print(" * %s <%s> (%s)", imp.tagName(),imp.attr("abs:href"), imp.attr("rel"));
}
}
private static void print(String msg, Object... args){
System.out.println(String.format(msg, args));
}
private static String trim(String str, int width){
if(str.length() > width)
return str.substring(0, width+1) + ".";
else return str;
}
}
得到结果如下:
Fetching ...http://www.zhihu.com
Links: (21)
* a: <https://www.zhihu.com/#signup> (注册)
* a: <https://www.zhihu.com/#signin> (登录)
* a: <https://www.zhihu.com/#> (无法登录?)
* a: <https://www.zhihu.com/#> ()
* a: <https://www.zhihu.com/#> ()
* a: <https://www.zhihu.com/#> ()
* a: <https://www.zhihu.com/app/> (知乎 App)
* a: <https://www.zhihu.com/terms> (《知乎协议》)
* a: <https://zhuanlan.zhihu.com> (知乎专栏)
* a: <https://www.zhihu.com/roundtable> (知乎圆桌)
* a: <https://www.zhihu.com/explore> (发现)
* a: <https://www.zhihu.com/app> (移动应用)
* a: <https://www.zhihu.com/org/signin> (使用机构帐号登录)
* a: <https://www.zhihu.com/contact> (联系我们)
* a: <https://www.zhihu.com/careers> (来知乎工作)
* a: <http://www.miibeian.gov.cn/> (京 ICP 证 110745 号)
* a: <http://zhstatic.zhihu.com/assets/zhihu/publish-license.jpg> (出版物经营许可证)
* a: <https://zhuanlan.zhihu.com/p/28561671> (侵权投诉)
* a: <http://www.12377.cn> (网上有害信息举报专区)
* a: <https://www.zhihu.com/jubao> (儿童色情信息举报专区)
* a: <https://credit.szfw.org/CX20170607038331320388.html> ()
Medias: (7)
* script: <https://static.zhihu.com/static/revved/-/js/instant.14757a4a.js>
* img: <https://www.zhihu.com/static/img/spinner/grey-loading.gif>
* img: <https://static.zhihu.com/static/revved/img/index/chengxing_logo@2x.65dc76e8.png>
* script: <https://static.zhihu.com/static/revved/-/js/vendor.cb14a042.js>
* script: <https://static.zhihu.com/static/revved/-/js/closure/base.ba831c49.js>
* script: <https://static.zhihu.com/static/revved/-/js/closure/common.f088f26f.js>
* script: <https://static.zhihu.com/static/revved/-/js/closure/page-index.1f1461c3.js>
Imports: (14)
* link <https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png> (apple-touch-icon)
* link <https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png> (apple-touch-icon)
* link <https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png> (apple-touch-icon)
* link <https://static.zhihu.com/static/revved/img/ios/touch-icon-60.9911cffb.png> (apple-touch-icon)
* link <https://static.zhihu.com/static/favicon.ico> (shortcut icon)
* link <https://www.zhihu.com/p1.zhimg.com> (dns-prefetch)
* link <https://www.zhihu.com/p2.zhimg.com> (dns-prefetch)
* link <https://www.zhihu.com/p3.zhimg.com> (dns-prefetch)
* link <https://www.zhihu.com/p4.zhimg.com> (dns-prefetch)
* link <https://www.zhihu.com/comet.zhihu.com> (dns-prefetch)
* link <https://www.zhihu.com/static.zhihu.com> (dns-prefetch)
* link <https://www.zhihu.com/upload.zhihu.com> (dns-prefetch)
* link <https://static.zhihu.com/static/revved/-/css/pages/unlogin-index/main.4df360a5.css> (stylesheet)
* link <http://www.zhihu.com> (canonical)
参考资料:
1、Jsoup中文指南http://www.open-open.com/jsoup/