WebCollector爬虫的各种参数配置（代理、断点等）

最新推荐文章于 2023-03-16 18:15:27 发布

AJAXHu

最新推荐文章于 2023-03-16 18:15:27 发布

阅读量663

点赞数

文章标签：爬虫 java

BreadthCrawler是WebCollector最常用的爬取器之一，依赖文件系统进行爬取信息的存储。这里以BreadthCrawler为例，对WebCollector的爬取配置进行描述：

import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;
import cn.edu.hfut.dmic.webcollector.model.Page;
import java.net.InetSocketAddress;
import java.net.Proxy;


public class MyCrawler extends BreadthCrawler{

    /*在visit方法里定义自己的操作*/
    @Override
    public void visit(Page page) {
        System.out.println("URL:"+page.getUrl());
        System.out.println("Content-Type:"+page.getResponse().getContentType());
        System.out.println("Code:"+page.getResponse().getContentType());
        System.out.println("-----------------------------");
    }
    
    public static void main(String[] args) throws Exception{
        MyCrawler crawler=new MyCrawler();
        
        /*配置爬取合肥工业大学网站*/
        crawler.addSeed("http://www.hfut.edu.cn/ch/");
        crawler.addRegex("http://.*hfut\\.edu\\.cn/.*");
        
        /*设置保存爬取记录的文件夹*/
        crawler.setCrawlPath("crawl_hfut");
        
        /*设置线程数*/
        crawler.setThreads(50);
        
        /*设置爬虫是否为断点爬取*/
        crawler.setResumable(false);
        
        /*设置代理服务器*/
        Proxy proxy=new Proxy(Proxy.Type.HTTP, new InetSocketAddress("14.18.16.67",80));
        crawler.setProxy(proxy);
        
        /*设置User-Agent*/
        crawler.setUseragent("Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) Gecko/20100101 Firefox/26.0");
        
        /*设置Cookie*/
        crawler.setCookie("......");
        
        /*进行深度为5的爬取*/
        crawler.start(5);
    }
  
}

这里解释一下，setCrawlPath是BreadthCrawler特有的，用于设定存储爬取记录的文件夹，如果不指定，默认使用crawl文件夹作为爬取记录文件夹。

如果使用断点模式，要保证同一个爬虫的爬取使用相同的CrawlPath，因为爬取记录就是靠CrawlPath存储的。

AJAXHu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
WebCollector爬虫的各种参数配置（代理、断点等）

BreadthCrawler是WebCollector最常用的爬取器之一，依赖文件系统进行爬取信息的存储。这里以BreadthCrawler为例，对WebCollector的爬取配置进行描述：import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;import cn.edu.hfut.dmic.webcollector.mod...
复制链接

扫一扫