WebCollector 爬虫解析

最新推荐文章于 2021-02-13 15:28:08 发布

jtf111

最新推荐文章于 2021-02-13 15:28:08 发布

阅读量521

点赞数

分类专栏： WebCollector 解析文章标签： java

本文链接：https://blog.csdn.net/jtf19901223/article/details/107407467

版权

本文介绍了WebCollector爬虫的核心类Crawler，特别是AutoParseCrawler，它是一个自动解析爬虫，实现了Executor、Visitor和Requester接口，用于从URL获取页面。还提到了不同类型的dbManager实现，如BreadthCrawler和RamCrawler。

摘要由CSDN通过智能技术生成

WebCollector 爬虫解析

WebCollection 爬虫核心类 cn.edu.hfut.dmic.webcollector.crawler.Crawler

	protected int status;  //运行状态 
    public final static int RUNNING = 1;// 运行状态 运行中
    public final static int STOPED = 2; // 运行状态 停止
    protected boolean resumable = false;// 是否中断可恢复,即程序中断是否重新爬取已爬取的网址
    protected int threads = 50;// 爬取的线程树


    protected CrawlDatums seeds = new CrawlDatums(); // 爬区的种子URL集合
    protected CrawlDatums forcedSeeds = new CrawlDatums();// 强制注入 爬区的种子URL集合
    protected Fetcher fetcher;// 爬取任务执行器,管理爬虫线程,调度
    protected int maxExecuteCount = -1; // 设置每个爬取任务的最大执行次数，爬取或解析失败都会导致执行失败。 当一个任务执行失败时，爬虫会在后面的迭代中重新执行该任务，当该任务执行失败的次数超过最大执行次数时，任务生成器会忽略该任务

    protected Executor executor = null;// 实际获得内容并处理的执行器
    protected NextFilter nextFilter = null;// URL爬取后获得的url 过滤器
    protected DBManager dbManager;// 爬取的url存储管理器

其中最重要的方法开始爬取过程

 /**
     * 开始爬取，迭代次数为depth
     *
     * @param depth 迭代次数
     * @throws Exception 异常
     */
    public void start(int depth) throws Exception {
   
        LOG.info(this.toString());

        //register dbmanager conf
        ConfigurationUtils.setTo(this, dbManager, executor, nextFilter);

        registerOtherConfigurations();

		// 这里 如果不是可恢复的 清理数据管理器
        if (!resumable) {
   
            if (dbManager.isDBExists()) {