Heritrix取消robots.txt

      Robots.txt是一种用于限制网络爬虫的文件,如果在构建网站时,在站点内放置一个Robots.txt文件,在其中可以声明不希望搜索引擎访问的部分。然而,这也是Heritrix爬虫在抓取网页时花费过多的时间去判断该Robots.txt文件是否存在。。。好在这个协议本身是一种附加协议,完全可以不遵守。

    在Heritrix的org.archive.crawler.prefetch.PreconditionEnforcer类中定义了获取Robots.txt的方法,我的选择是无论Robots.txt是否存在,都返回不存在,修改方法如下

    private boolean considerRobotsPreconditions(CrawlURI curi) {
    	//此处为提高抓取效率,将忽略Robots.txt协议
    	return false;
        // treat /robots.txt fetches specially
        /*UURI uuri = curi.getUURI();
        try {
            if (uuri != null && uuri.getPath() != null &&
                    curi.getUURI().getPath().equals("/robots.txt")) {
                // allow processing to continue
                curi.setPrerequisite(true);
                return false;
            }
        }
        catch (URIException e) {
            logger.severe("Failed get of path for " + curi);
        }
        // require /robots.txt if not present
        if (isRobotsExpired(curi)) {
        	// Need to get robots
            if (logger.isLoggable(Level.FINE)) {
                logger.fine( "No valid robots for " +
                    getController().getServerCache().getServerFor(curi) +
                    "; deferring " + curi);
            }

            // Robots expired - should be refetched even though its already
            // crawled.
            try {
                String prereq = curi.getUURI().resolve("/robots.txt").toString();
                curi.markPrerequisite(prereq,
                    getController().getPostprocessorChain());
            }
            catch (URIException e1) {
                logger.severe("Failed resolve using " + curi);
                throw new RuntimeException(e1); // shouldn't ever happen
            }
            return true;
        }
        // test against robots.txt if available
        CrawlServer cs = getController().getServerCache().getServerFor(curi);
        if(cs.isValidRobots()){
            String ua = getController().getOrder().getUserAgent(curi);
            if(cs.getRobots().disallows(curi, ua)) {
                if(((Boolean)getUncheckedAttribute(curi,ATTR_CALCULATE_ROBOTS_ONLY)).booleanValue() == true) {
                    // annotate URI as excluded, but continue to process normally
                    curi.addAnnotation("robotExcluded");
                    return false; 
                }
                // mark as precluded; in FetchHTTP, this will
                // prevent fetching and cause a skip to the end
                // of processing (unless an intervening processor
                // overrules)
                curi.setFetchStatus(S_ROBOTS_PRECLUDED);
                curi.putString("error","robots.txt exclusion");
                logger.fine("robots.txt precluded " + curi);
                return true;
            }
            return false;
        }
        // No valid robots found => Attempt to get robots.txt failed
        curi.skipToProcessorChain(getController().getPostprocessorChain());
        curi.setFetchStatus(S_ROBOTS_PREREQUISITE_FAILURE);
        curi.putString("error","robots.txt prerequisite failed");
        if (logger.isLoggable(Level.FINE)) {
            logger.fine("robots.txt prerequisite failed " + curi);
        }*/
        //return true;
    }
 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值