之前写的那篇文章解决了指定路径网页的下载,但是因为heritrix指定的链接队列中以host作为key值进行hash,这样使得即使配置了100个线程,也只有一个线程在运行,因为heritrix默认每次从一个队列中取出来一个url进行抓取,等抓取结束之后再取另外一个。因为指定路径基本上都是在一个host里面,这样就会变成单线程爬取,非常的慢。
无奈之下继续改写,这次是修改HostnameQueueAssignmentPolicy,也是系统默认的。之前继承了一个AbstractFrontier,但是配置了半天也没有配置出来多线程下载,无奈之下只要直接改动HostnameQueueAssignmentPolicy。其中主要是getClassKey这个函数,是生成队列的key值得,使用ELFHash哈希算法进行url的hash。
代码如下:
/* HostnameQueueAssignmentPolicy
*
* $Id: HostnameQueueAssignmentPolicy.java 3838 2005-09-21 23:00:47Z gojomo $
*
* Created on Oct 5, 2004
*
* Copyright (C) 2004 Internet Archive.
*
* This file is part of the Heritrix web crawler (crawler.archive.org).
*
* Heritrix is free software; you can redistribute it and/or modify
* it under the terms of the GNU Lesser Public License as published by
* the Free Software Foundation; either version 2.1 of the License, or
* any later version.
*
* Heritrix is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser Public License for more details.
*
* You should have received a copy of the GNU Lesser Public License
* along with Heritrix; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*/
package org.archive.crawler.frontier;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.framework.CrawlController;
import org.archive.net.UURI;
import org.archive.net.UURIFactory;
/**
* QueueAssignmentPolicy based on the hostname:port evident in the given
* CrawlURI.
*
* @author gojomo
*/
//产生以域名作为key的类
public class HostnameQueueAssignmentPolicy extends QueueAssignmentPolicy {
private static final Logger logger = Logger
.getLogger(HostnameQueueAssignmentPolicy.class.getName());
/**
* When neat host-based class-key fails us
*/
private static String DEFAULT_CLASS_KEY = "default...";
private static final String DNS = "dns";
//实现多线程的算法,对key进行散列
public String getClassKey(CrawlController controller, CandidateURI cauri) {
String uri = cauri.getUURI().toString();
long hash = ELFHash(uri);
String a = Long.toString(hash % 100);
return a;
}
//ELFHash散列算法
public long ELFHash(String str)
{
long hash = 0;
long x = 0;
for(int i = 0; i < str.length(); i++)
{
hash = (hash << 4) + str.charAt(i);
if((x = hash & 0xF0000000L) != 0)
{
hash ^= (x >> 24);
hash &= ~x;
}
}
return (hash & 0x7FFFFFFF);
}
/*public String getClassKey(CrawlController controller, CandidateURI cauri) {
String scheme = cauri.getUURI().getScheme();
String candidate = null;
try {
if (scheme.equals(DNS)){//是域名
if (cauri.getVia() != null) {
// Special handling for DNS: treat as being
// of the same class as the triggering URI.
// When a URI includes a port, this ensures
// the DNS lookup goes atop the host:port
// queue that triggered it, rather than
// some other host queue
UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia());
candidate = viaUuri.getAuthorityMinusUserinfo();
// adopt scheme of triggering URI
scheme = viaUuri.getScheme();
} else {
candidate= cauri.getUURI().getReferencedHost();
}
} else {
candidate = cauri.getUURI().getAuthorityMinusUserinfo();
}
if(candidate == null || candidate.length() == 0) {
candidate = DEFAULT_CLASS_KEY;
}
} catch (URIException e) {
logger.log(Level.INFO,
"unable to extract class key; using default", e);
candidate = DEFAULT_CLASS_KEY;
}
if (scheme != null && scheme.equals(UURIFactory.HTTPS)) {
// If https and no port specified, add default https port to
// distinguish https from http server without a port.
if (!candidate.matches(".+:[0-9]+")) {
candidate += UURIFactory.HTTPS_PORT;
}
}
// Ensure classKeys are safe as filenames on NTFS
return candidate.replace(':','#');//域名了 基本上是
}*/
}
事实证明下载速度立刻有了质的提升,至少不是单线程了,基本上也都是在300k左右。
好了,先介绍到这,以后可以自己写个爬虫。