开源网络爬虫WebCollector的demo

最新推荐文章于 2018-11-12 17:49:06 发布

fjssharpsword

最新推荐文章于 2018-11-12 17:49:06 发布

阅读量4.4k

点赞数 1

分类专栏： Java

本文链接：https://blog.csdn.net/fjssharpsword/article/details/50630010

版权

Java 专栏收录该内容

153 篇文章 2 订阅

订阅专栏

1、环境：jdk7+eclipse mars

2、WebCollector开源网址https://github.com/CrawlScript/WebCollector

下载webcollector-2.26-bin.zip，解压文件夹引入所有jar包到工程。

3、demo源码：

/**
 * Demo of crawling web by webcollector 
 * @author fjs
 */
package com;
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import org.jsoup.nodes.Document;

public class demo extends BreadthCrawler {
	/**
     * @param crawlPath crawlPath is the path of the directory which maintains
     * information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     * links which match regex rules from page
     */
    public demo(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        /*start page*/
        this.addSeed("http://guangzhou.qfang.com");
        /*fetch url like the value by setting up RegEx filter rule */
        this.addRegex(".*");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
    	
        String url = page.getUrl();
        Document doc = page.getDoc();
        System.out.println(url);
        System.out.println(doc.title());
        
        /*If you want to add urls to crawl,add them to nextLink*/
        /*WebCollector automatically filters links that have been fetched before*/
        /*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/
        //next.add("http://gz.house.163.com/");
    }
    
    public static void main(String[] args) throws Exception {
        demo crawler = new demo("path", true);
        crawler.setThreads(50);
        crawler.setTopN(100);
        //crawler.setResumable(true);
        /*start crawl with depth 3*/
        crawler.start(3);
    }
}

4、实际应用中，对page进行解析抓取网页内容。

fjssharpsword

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
开源网络爬虫WebCollector的demo

1、环境：jdk7+eclipse mars2、WebCollector开源网址https://github.com/CrawlScript/WebCollector 下载webcollector-2.26-bin.zip，解压文件夹引入所有jar包到工程。3、demo源码： /** * Demo of crawling web by webcollector
复制链接

扫一扫