heritrix总结--HostnameQueueAssignmentPolicy改写

最新推荐文章于 2016-08-14 16:12:47 发布

雨落

最新推荐文章于 2016-08-14 16:12:47 发布

阅读量655

点赞数

分类专栏：搜索引擎爬虫文章标签： string scheme class filenames 多线程 null

本文链接：https://blog.csdn.net/anbo724/article/details/7074661

版权

搜索引擎同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

爬虫

6 篇文章 0 订阅

订阅专栏

之前写的那篇文章解决了指定路径网页的下载，但是因为heritrix指定的链接队列中以host作为key值进行hash，这样使得即使配置了100个线程，也只有一个线程在运行，因为heritrix默认每次从一个队列中取出来一个url进行抓取，等抓取结束之后再取另外一个。因为指定路径基本上都是在一个host里面，这样就会变成单线程爬取，非常的慢。

无奈之下继续改写，这次是修改HostnameQueueAssignmentPolicy，也是系统默认的。之前继承了一个AbstractFrontier，但是配置了半天也没有配置出来多线程下载，无奈之下只要直接改动HostnameQueueAssignmentPolicy。其中主要是getClassKey这个函数，是生成队列的key值得，使用ELFHash哈希算法进行url的hash。

代码如下：

/* HostnameQueueAssignmentPolicy
*
* $Id: HostnameQueueAssignmentPolicy.java 3838 2005-09-21 23:00:47Z gojomo $
*
* Created on Oct 5, 2004
*
* Copyright (C) 2004 Internet Archive.
*
* This file is part of the Heritrix web crawler (crawler.archive.org).
*
* Heritrix is free software; you can redistribute it and/or modify
* it under the terms of the GNU Lesser Public License as published by
* the Free Software Foundation; either version 2.1 of the License, or
* any later version.
*
* Heritrix is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
* GNU Lesser Public License for more details.
*
* You should have received a copy of the GNU Lesser Public License
* along with Heritrix; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
*/ 
package org.archive.crawler.frontier;

import java.util.logging.Level;
import java.util.logging.Logger;

import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.framework.CrawlController;
import org.archive.net.UURI;
import org.archive.net.UURIFactory;

/**
 * QueueAssignmentPolicy based on the hostname:port evident in the given
 * CrawlURI.
 * 
 * @author gojomo
 */
//产生以域名作为key的类
public class HostnameQueueAssignmentPolicy extends QueueAssignmentPolicy {
    private static final Logger logger = Logger
        .getLogger(HostnameQueueAssignmentPolicy.class.getName());
    /**
     * When neat host-based class-key fails us
     */
    private static String DEFAULT_CLASS_KEY = "default...";
    
    private static final String DNS = "dns";
    //实现多线程的算法，对key进行散列
    public String getClassKey(CrawlController controller, CandidateURI cauri) {
	    String uri = cauri.getUURI().toString();
        long hash = ELFHash(uri);
        String a = Long.toString(hash % 100);
        return a;
	}
	//ELFHash散列算法
    public long ELFHash(String str)
    {
       long hash = 0;
       long x    = 0;
       for(int i = 0; i < str.length(); i++)
       {
          hash = (hash << 4) + str.charAt(i);
          if((x = hash & 0xF0000000L) != 0)
          {
             hash ^= (x >> 24);
             hash &= ~x;
          }
       }
       return (hash & 0x7FFFFFFF);
    }

    /*public String getClassKey(CrawlController controller, CandidateURI cauri) {
        String scheme = cauri.getUURI().getScheme();
        String candidate = null;
        try {
            if (scheme.equals(DNS)){//是域名
                if (cauri.getVia() != null) {
                    // Special handling for DNS: treat as being
                    // of the same class as the triggering URI.
                    // When a URI includes a port, this ensures 
                    // the DNS lookup goes atop the host:port
                    // queue that triggered it, rather than 
                    // some other host queue
                	UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia());
                    candidate = viaUuri.getAuthorityMinusUserinfo();
                    // adopt scheme of triggering URI
                    scheme = viaUuri.getScheme();
                } else {
                    candidate= cauri.getUURI().getReferencedHost();
                }
            } else {
                candidate =  cauri.getUURI().getAuthorityMinusUserinfo();
            }
            
            if(candidate == null || candidate.length() == 0) {
                candidate = DEFAULT_CLASS_KEY;
            }
        } catch (URIException e) {
            logger.log(Level.INFO,
                    "unable to extract class key; using default", e);
            candidate = DEFAULT_CLASS_KEY;
        }
        if (scheme != null && scheme.equals(UURIFactory.HTTPS)) {
            // If https and no port specified, add default https port to
            // distinguish https from http server without a port.
            if (!candidate.matches(".+:[0-9]+")) {
                candidate += UURIFactory.HTTPS_PORT;
            }
        }
        // Ensure classKeys are safe as filenames on NTFS
        return candidate.replace(':','#');//域名了 基本上是
    }*/

}

事实证明下载速度立刻有了质的提升，至少不是单线程了，基本上也都是在300k左右。

好了，先介绍到这，以后可以自己写个爬虫。

雨落

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
heritrix总结--HostnameQueueAssignmentPolicy改写

之前写的那篇文章解决了指定路径网页的下载，但是因为heritrix指定的链接队列中以host作为key值进行hash，这样使得即使配置了100个线程，也只有一个线程在运行，因为heritrix默认每次从一个队列中取出来一个url进行抓取，等抓取结束之后再取另外一个。因为指定路径基本上都是在一个host里面，这样就会变成单线程爬取，非常的慢。无奈之下继续改写，这次是修改HostnameQueue
复制链接

扫一扫