heritrix总结--HostnameQueueAssignmentPolicy改写

之前写的那篇文章解决了指定路径网页的下载,但是因为heritrix指定的链接队列中以host作为key值进行hash,这样使得即使配置了100个线程,也只有一个线程在运行,因为heritrix默认每次从一个队列中取出来一个url进行抓取,等抓取结束之后再取另外一个。因为指定路径基本上都是在一个host里面,这样就会变成单线程爬取,非常的慢。

无奈之下继续改写,这次是修改HostnameQueueAssignmentPolicy,也是系统默认的。之前继承了一个AbstractFrontier,但是配置了半天也没有配置出来多线程下载,无奈之下只要直接改动HostnameQueueAssignmentPolicy。其中主要是getClassKey这个函数,是生成队列的key值得,使用ELFHash哈希算法进行url的hash。

代码如下:

/* HostnameQueueAssignmentPolicy
*
* $Id: HostnameQueueAssignmentPolicy.java 3838 2005-09-21 23:00:47Z gojomo $
*
* Created on Oct 5, 2004
*
* Copyright (C) 2004 Internet Archive.
*
* This file is part of the Heritrix web crawler (crawler.archive.org).
*
* Heritrix is free software; you can redistribute it and/or modify
* it under the terms of the GNU Lesser Public License as published by
* the Free Software Foundation; either version 2.1 of the License, or
* any later version.
*
* Heritrix is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
* GNU Lesser Public License for more details.
*
* You should have received a copy of the GNU Lesser Public License
* along with Heritrix; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
*/ 
package org.archive.crawler.frontier;

import java.util.logging.Level;
import java.util.logging.Logger;

import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.framework.CrawlController;
import org.archive.net.UURI;
import org.archive.net.UURIFactory;

/**
 * QueueAssignmentPolicy based on the hostname:port evident in the given
 * CrawlURI.
 * 
 * @author gojomo
 */
//产生以域名作为key的类
public class HostnameQueueAssignmentPolicy extends QueueAssignmentPolicy {
    private static final Logger logger = Logger
        .getLogger(HostnameQueueAssignmentPolicy.class.getName());
    /**
     * When neat host-based class-key fails us
     */
    private static String DEFAULT_CLASS_KEY = "default...";
    
    private static final String DNS = "dns";
    //实现多线程的算法,对key进行散列
    public String getClassKey(CrawlController controller, CandidateURI cauri) {
	    String uri = cauri.getUURI().toString();
        long hash = ELFHash(uri);
        String a = Long.toString(hash % 100);
        return a;
	}
	//ELFHash散列算法
    public long ELFHash(String str)
    {
       long hash = 0;
       long x    = 0;
       for(int i = 0; i < str.length(); i++)
       {
          hash = (hash << 4) + str.charAt(i);
          if((x = hash & 0xF0000000L) != 0)
          {
             hash ^= (x >> 24);
             hash &= ~x;
          }
       }
       return (hash & 0x7FFFFFFF);
    }

    /*public String getClassKey(CrawlController controller, CandidateURI cauri) {
        String scheme = cauri.getUURI().getScheme();
        String candidate = null;
        try {
            if (scheme.equals(DNS)){//是域名
                if (cauri.getVia() != null) {
                    // Special handling for DNS: treat as being
                    // of the same class as the triggering URI.
                    // When a URI includes a port, this ensures 
                    // the DNS lookup goes atop the host:port
                    // queue that triggered it, rather than 
                    // some other host queue
                	UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia());
                    candidate = viaUuri.getAuthorityMinusUserinfo();
                    // adopt scheme of triggering URI
                    scheme = viaUuri.getScheme();
                } else {
                    candidate= cauri.getUURI().getReferencedHost();
                }
            } else {
                candidate =  cauri.getUURI().getAuthorityMinusUserinfo();
            }
            
            if(candidate == null || candidate.length() == 0) {
                candidate = DEFAULT_CLASS_KEY;
            }
        } catch (URIException e) {
            logger.log(Level.INFO,
                    "unable to extract class key; using default", e);
            candidate = DEFAULT_CLASS_KEY;
        }
        if (scheme != null && scheme.equals(UURIFactory.HTTPS)) {
            // If https and no port specified, add default https port to
            // distinguish https from http server without a port.
            if (!candidate.matches(".+:[0-9]+")) {
                candidate += UURIFactory.HTTPS_PORT;
            }
        }
        // Ensure classKeys are safe as filenames on NTFS
        return candidate.replace(':','#');//域名了 基本上是
    }*/

}


事实证明下载速度立刻有了质的提升,至少不是单线程了,基本上也都是在300k左右。

好了,先介绍到这,以后可以自己写个爬虫。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值