Java爬虫系列之二网页解析【爬取知乎首页信息】

最新推荐文章于 2024-02-28 21:00:00 发布

行者小朱

最新推荐文章于 2024-02-28 21:00:00 发布

阅读量3.2k

点赞数

分类专栏： Crawler 网络爬虫

本文链接：https://blog.csdn.net/u012050154/article/details/77499296

版权

网络爬虫同时被 2 个专栏收录

11 篇文章 10 订阅

订阅专栏

Crawler

10 篇文章 1 订阅

订阅专栏

上一节以一个小Demo开始了Java的爬虫之旅，熟悉了HttpClient请求资源得到返回结果，得到初步处理的结果。但对于得到的网页是怎么解析的呢？这里讨论一下Jsoup的使用。

Jsoup是一款Java的HTML解析器，提供了一套非常省力的API，可以方便的从一个URL、文件、或字符串中解析出HTML，然后使用DOM或者Select选择出页面元素、取出数据。如下：

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

Document doc = Jsoup.connect("http://www.zhihu.com").get(); //网页内容

通过以上方式就可以解析一个HTML文档，然后可以使用DOM方法来遍历文档，也可以使用选择器来查找元素。下面以获取知乎首页所有链接为例，进行小的演示：

通过打印Document可以看到解析TML得到的文档的内容信息，观察发现知乎首页包含的所有链接有以下三种形式：<a href>超链接形式、<link href>以及<src>的脚本，因此分别针对三种形式进行链接的提取。

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * 提取知乎首页中的所有链接（包括图片和其他页面的链接）
 */

public class GetPageLinks {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "Usage: supply url to fetch");
        String url = args[0];
        System.out.println("Fetching ..." + url);

        Document document = Jsoup.connect(url).get();
//        System.out.println(document);//解析HTML得到一个文档

        Elements links = document.select("a[href]");
//        System.out.println(links);  //选择a[href]元素

        Elements medias = document.select("[src]");
//        System.out.println(medias);  //一般图片、脚本之类的是以src形式嵌入

        Elements imports = document.select("link[href]");
//        System.out.println(imports); 

        print("Links: (%d)", links.size());
        for(Element link : links){
            print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
        }

        print("Medias: (%d)", medias.size());
        for(Element media : medias){
             print(" * %s: <%s>", media.tagName(), media.attr("abs:src"));
        }

        print("Imports: (%d)", imports.size());
        for(Element imp : imports){
            print(" * %s <%s> (%s)", imp.tagName(),imp.attr("abs:href"), imp.attr("rel"));
        }
    }

    private static void print(String msg, Object... args){
        System.out.println(String.format(msg, args));
    }

    private static String trim(String str, int width){
        if(str.length() > width)
            return str.substring(0, width+1) + ".";
        else return str;
    }
}

得到结果如下：

Fetching ...http://www.zhihu.com
Links: (21)
 * a: <https://www.zhihu.com/#signup>  (注册)
 * a: <https://www.zhihu.com/#signin>  (登录)
 * a: <https://www.zhihu.com/#>  (无法登录？)
 * a: <https://www.zhihu.com/#>  ()
 * a: <https://www.zhihu.com/#>  ()
 * a: <https://www.zhihu.com/#>  ()
 * a: <https://www.zhihu.com/app/>  (知乎 App)
 * a: <https://www.zhihu.com/terms>  (《知乎协议》)
 * a: <https://zhuanlan.zhihu.com>  (知乎专栏)
 * a: <https://www.zhihu.com/roundtable>  (知乎圆桌)
 * a: <https://www.zhihu.com/explore>  (发现)
 * a: <https://www.zhihu.com/app>  (移动应用)
 * a: <https://www.zhihu.com/org/signin>  (使用机构帐号登录)
 * a: <https://www.zhihu.com/contact>  (联系我们)
 * a: <https://www.zhihu.com/careers>  (来知乎工作)
 * a: <http://www.miibeian.gov.cn/>  (京 ICP 证 110745 号)
 * a: <http://zhstatic.zhihu.com/assets/zhihu/publish-license.jpg>  (出版物经营许可证)
 * a: <https://zhuanlan.zhihu.com/p/28561671>  (侵权投诉)
 * a: <http://www.12377.cn>  (网上有害信息举报专区)
 * a: <https://www.zhihu.com/jubao>  (儿童色情信息举报专区)
 * a: <https://credit.szfw.org/CX20170607038331320388.html>  ()
Medias: (7)
 * script: <https://static.zhihu.com/static/revved/-/js/instant.14757a4a.js>
 * img: <https://www.zhihu.com/static/img/spinner/grey-loading.gif>
 * img: <https://static.zhihu.com/static/revved/img/index/chengxing_logo@2x.65dc76e8.png>
 * script: <https://static.zhihu.com/static/revved/-/js/vendor.cb14a042.js>
 * script: <https://static.zhihu.com/static/revved/-/js/closure/base.ba831c49.js>
 * script: <https://static.zhihu.com/static/revved/-/js/closure/common.f088f26f.js>
 * script: <https://static.zhihu.com/static/revved/-/js/closure/page-index.1f1461c3.js>
Imports: (14)
 * link <https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png> (apple-touch-icon)
 * link <https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png> (apple-touch-icon)
 * link <https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png> (apple-touch-icon)
 * link <https://static.zhihu.com/static/revved/img/ios/touch-icon-60.9911cffb.png> (apple-touch-icon)
 * link <https://static.zhihu.com/static/favicon.ico> (shortcut icon)
 * link <https://www.zhihu.com/p1.zhimg.com> (dns-prefetch)
 * link <https://www.zhihu.com/p2.zhimg.com> (dns-prefetch)
 * link <https://www.zhihu.com/p3.zhimg.com> (dns-prefetch)
 * link <https://www.zhihu.com/p4.zhimg.com> (dns-prefetch)
 * link <https://www.zhihu.com/comet.zhihu.com> (dns-prefetch)
 * link <https://www.zhihu.com/static.zhihu.com> (dns-prefetch)
 * link <https://www.zhihu.com/upload.zhihu.com> (dns-prefetch)
 * link <https://static.zhihu.com/static/revved/-/css/pages/unlogin-index/main.4df360a5.css> (stylesheet)
 * link <http://www.zhihu.com> (canonical)

参考资料：

1、Jsoup中文指南http://www.open-open.com/jsoup/

行者小朱

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
Java爬虫系列之二网页解析【爬取知乎首页信息】

上一节以一个小Demo开始了Java的爬虫之旅，熟悉了HttpClient请求资源得到返回结果，得到初步处理的结果。但对于得到的网页是怎么解析的呢？这里讨论一下Jsoup的使用。 Jsoup是一款Java的HTML解析器，提供了一套非常省力的API，可以方便的从一个URL、文件、或字符串中解析出HTML，然后使用DOM或者Select选择出页面元素、取出数据。如下：Strin
复制链接

扫一扫