jsoup –基本的Web爬网程序示例

最新推荐文章于 2020-09-16 21:21:58 发布

cyan20115

最新推荐文章于 2020-09-16 21:21:58 发布

阅读量414

点赞数

文章标签：爬虫移动开发 json

Web爬网程序是一种程序，它可以在Web上导航并找到新的或更新的页面以进行索引。搜寻器从种子网站或各种流行的URL（也称为frontier ）开始，然后在深度和宽度中搜索要提取的超链接。

Web爬网程序必须友好且健壮。抓取工具的善意意味着它会遵守robots.txt设置的规则，并避免过于频繁地访问网站。健壮性是指避免蜘蛛陷阱和其他恶意行为的能力。 Web Crawler的其他良好属性是多个分布式计算机之间的分布性，可扩展性，连续性和基于页面质量进行优先级排序的能力。

1.创建网络爬虫的步骤

编写Web爬网程序的基本步骤是：

从边界选择一个URL
提取HTML代码
解析HTML以提取到其他URL的链接
检查您是否已经爬网了URL和/或之前是否看过相同的内容

如果没有添加到索引

对于每个提取的URL

确认同意接受检查（robots.txt，抓取频率）

实话实说，在Internet上所有页面上开发和维护一个Web爬网程序是……很难，即使不是不可能，因为目前有超过10亿个在线网站。如果您正在阅读本文，那么您可能不是在寻找创建Web爬网程序的指南，而是在寻找Web Scraper的指南。为什么将这篇文章称为“基本网络抓取工具”？好吧……因为它很吸引人……真的！很少有人知道爬虫和抓取工具之间的区别，因此我们所有人都倾向于将“抓取”一词用于所有内容，甚至用于脱机数据抓取。另外，由于要构建Web Scraper，您还需要一个爬网代理。最后，因为本文旨在提供信息并提供可行的示例。

2.履带的骨架

对于HTML解析，我们将使用jsoup 。以下示例是使用jsoup 1.10.2版开发的。

pom.xml

<dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>

因此，让我们从Web爬网程序的基本代码开始。

BasicWebCrawler.java

package com.mkyong.basicwebcrawler;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;

public class BasicWebCrawler {

    private HashSet<String> links;

    public BasicWebCrawler() {
        links = new HashSet<String>();
    }

    public void getPageLinks(String URL) {
        //4. Check if you have already crawled the URLs 
        //(we are intentionally not checking for duplicate content in this example)
        if (!links.contains(URL)) {
            try {
                //4. (i) If not add it to the index
                if (links.add(URL)) {
                    System.out.println(URL);
                }

                //2. Fetch the HTML code
                Document document = Jsoup.connect(URL).get();
                //3. Parse the HTML to extract links to other URLs
                Elements linksOnPage = document.select("a[href]");

                //5. For each extracted URL... go back to Step 4.
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"));
                }
            } catch (IOException e) {
                System.err.println("For '" + URL + "': " + e.getMessage());
            }
        }
    }

    public static void main(String[] args) {
        //1. Pick a URL from the frontier
        new BasicWebCrawler().getPageLinks("http://www.mkyong.com/");
    }

}

注意
不要让这段代码运行太久。可能要花几个小时才能结束。

样本输出：

http://www.mkyong.com/
  
   
   
   Android Tutorial 
  
  
  
   
   
   Android Tutorial 
  
  
  
   
   
   Java I/O Tutorial 
  
  
  
   
   
   Java I/O Tutorial 
  
  
  
   
   
   Java XML Tutorial 
  
  
  
   
   
   Java XML Tutorial 
  
  
  
   
   
   Java JSON Tutorial 
  
  
  
   
   
   Java JSON Tutorial 
  
  
  
   
   
   Java Regular Expression Tutorial 
  
  
  
   
   
   Java Regular Expression Tutorial 
  
  
  
   
   
   JDBC Tutorial

就像我们之前提到的，网络爬虫在宽度和深度上搜索链接。如果我们以树状结构想象网站上的链接，则根节点或零级将是我们开始的链接，下一级将是我们在零级中找到的所有链接，依此类推。

3.考虑爬行深度

我们将修改前面的示例以设置链接提取的深度。请注意，此示例与前一个示例之间的唯一真正区别是递归getPageLinks()方法具有一个整数参数，该整数参数表示链接的深度，它也作为条件添加到if...else语句中。

WebCrawlerWithDepth.java

package com.mkyong.depthwebcrawler;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;

public class WebCrawlerWithDepth {
    private static final int MAX_DEPTH = 2;
    private HashSet<String> links;

    public WebCrawlerWithDepth() {
        links = new HashSet<>();
    }

    public void getPageLinks(String URL, int depth) {
        if ((!links.contains(URL) && (depth < MAX_DEPTH))) {
            System.out.println(">> Depth: " + depth + " [" + URL + "]");
            try {
                links.add(URL);

                Document document = Jsoup.connect(URL).get();
                Elements linksOnPage = document.select("a[href]");

                depth++;
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"), depth);
                }
            } catch (IOException e) {
                System.err.println("For '" + URL + "': " + e.getMessage());
            }
        }
    }

    public static void main(String[] args) {
        new WebCrawlerWithDepth().getPageLinks("http://www.mkyong.com/", 0);
    }
}

注意
随意运行上面的代码。在我的笔记本电脑上，深度设置为2时只花了几分钟。请记住，深度越大，完成时间越长。

样本输出：

...
>> Depth: 1 [https://docs.gradle.org/current/userguide/userguide.html]
>> Depth: 1 [http://hibernate.org/orm/]
>> Depth: 1 [https://jax-ws.java.net/]
>> Depth: 1 [http://tomcat.apache.org/tomcat-8.0-doc/index.html]
>> Depth: 1 [http://www.javacodegeeks.com/]
For 'http://www.javacodegeeks.com/': HTTP error fetching URL
>> Depth: 1 [http://beust.com/weblog/]
>> Depth: 1 [https://dzone.com]
>> Depth: 1 [https://wordpress.org/]
>> Depth: 1 [http://www.liquidweb.com/?RID=mkyong]
>> Depth: 1 [http://www.mkyong.com/privacy-policy/]

4.数据抓取与数据抓取

到目前为止，这对于此问题的理论方法是一件好事。事实是，您几乎不会构建通用的搜寻器，并且如果您想要一个“真实的”搜寻器，则应该使用已经存在的工具。一般开发人员所做的大部分工作都是从特定网站中提取特定信息，即使其中包括构建Web爬网程序，它实际上也称为Web Scraping。

PropantCloud的Arpan Jha撰写了一篇非常不错的文章，内容涉及数据爬网与数据爬网，这对我个人很有帮助，可以理解这种区别，因此建议阅读。

总结一下本文中的表格：

数据搜集	资料检索
涉及从包括Web在内的各种来源中提取数据	指从网络下载页面
可以任意规模完成	大多是大规模的
重复数据删除不一定是其中的一部分	重复数据删除是必不可少的部分
需要爬网代理和解析器	只需要爬网代理

像介绍中所承诺的那样，是时候脱离理论而走向可行的例子了。让我们想象一个场景，我们想要从mkyong.com获取与Java 8相关的文章的所有URL。我们的目标是在尽可能短的时间内检索该信息，从而避免在整个网站上进行爬网。此外，这种方法不仅浪费服务器资源，而且浪费我们的时间。

5.案例研究–提取mkyong.com上有关“ Java 8”的所有文章

5.1我们首先要做的就是看网站代码。快速浏览mkyong.com，我们可以很容易地注意到首页的分页，并且每页遵循/page/xx模式。

这使我们认识到，通过检索所有包含/page/的链接，可以轻松访问所需的信息。因此，我们将使用document.select("a[href^=\"http://www.mkyong.com/page/\"]")来限制搜索，而不是遍历整个网站。使用此css selector我们仅收集以http://mkyong.com/page/开头的链接。

5.2接下来我们要注意的是，文章的标题（这就是我们想要的）包装在<h2></h2>和<a href=""></a>标签中。

因此，要提取文章标题，我们将使用css selector访问该特定信息，该css selector将我们的select方法限制为该确切信息： document.select("h2 a[href^=\"http://www.mkyong.com/\"]");

5.3最后，我们仅保留标题中包含“ Java 8”的链接，并将其保存到file 。

Extractor.java

package com.mkyong.extractor;

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

public class Extractor {
    private HashSet<String> links;
    private List<List<String>> articles;

    public Extractor() {
        links = new HashSet<>();
        articles = new ArrayList<>();
    }

    //Find all URLs that start with "http://www.mkyong.com/page/" and add them to the HashSet
    public void getPageLinks(String URL) {
        if (!links.contains(URL)) {
            try {
                Document document = Jsoup.connect(URL).get();
                Elements otherLinks = document.select("a[href^=\"http://www.mkyong.com/page/\"]");

                for (Element page : otherLinks) {
                    if (links.add(URL)) {
                        //Remove the comment from the line below if you want to see it running on your editor
                        System.out.println(URL);
                    }
                    getPageLinks(page.attr("abs:href"));
                }
            } catch (IOException e) {
                System.err.println(e.getMessage());
            }
        }
    }

    //Connect to each link saved in the article and find all the articles in the page
    public void getArticles() {
        links.forEach(x -> {
            Document document;
            try {
                document = Jsoup.connect(x).get();
                Elements articleLinks = document.select("h2 a[href^=\"http://www.mkyong.com/\"]");
                for (Element article : articleLinks) {
                    //Only retrieve the titles of the articles that contain Java 8
                    if (article.text().matches("^.*?(Java 8|java 8|JAVA 8).*$")) {
                        //Remove the comment from the line below if you want to see it running on your editor, 
                        //or wait for the File at the end of the execution
                        //System.out.println(article.attr("abs:href"));

                        ArrayList<String> temporary = new ArrayList<>();
                        temporary.add(article.text()); //The title of the article
                        temporary.add(article.attr("abs:href")); //The URL of the article
                        articles.add(temporary);
                    }
                }
            } catch (IOException e) {
                System.err.println(e.getMessage());
            }
        });
    }

    public void writeToFile(String filename) {
        FileWriter writer;
        try {
            writer = new FileWriter(filename);
            articles.forEach(a -> {
                try {
                    String temp = "- Title: " + a.get(0) + " (link: " + a.get(1) + ")\n";
                    //display to console
                    System.out.println(temp);
                    //save to file
                    writer.write(temp);
                } catch (IOException e) {
                    System.err.println(e.getMessage());
                }
            });
            writer.close();
        } catch (IOException e) {
            System.err.println(e.getMessage());
        }
    }

    public static void main(String[] args) {
        Extractor bwc = new Extractor();
        bwc.getPageLinks("http://www.mkyong.com");
        bwc.getArticles();
        bwc.writeToFile("Java 8 Articles");
    }
}

输出：