Hetiitrix 主题策略抓取主要分两种:基于链接和基于内容。
扩展FrontierScheduler (是否作为候选URL,每个候选URL都创建一个线程)和扩展Extractor(对于页面的内容是否进行抽取)
一. 扩展FrontierScheduler
1.新建org.archive.crawler.postprocessor.MyFrontierScheduler|MyFrontierScheduler
继承FrontierScheduler类,重写schedule方法,过滤剩下包含news的域名:
package org.archive.crawler.postprocessor;
import java.util.logging.Level;
import java.util.logging.Logger;
importorg.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.datamodel.FetchStatusCodes;
import org.archive.crawler.framework.Processor;
public classMyFrontierScheduler extends FrontierScheduler
{
/**
*
*/
private static final long serialVersionUID =-1074778906898000967L;
/**
* @param nameName of this filter.
*/
publicMyFrontierScheduler(String name) {
super(name);
}
@Override
protected voidschedule(CandidateURI caUri) {
if(caUri.toString().contains("news")){
System.out.println(caUri.toString());
getController().getFrontier().schedule(caUri);
}
}
}
2.在conf的process.options中添加这个类
3.配置heritrix的前端并注意首次地址要包含“news”
二. 扩展Extractor
1. 在org.archive.extractor 中添加MyExtractor
2. 覆盖extract方法,其中extract有一个curi的参数。
CrawlURL是候选URL的封装版(封装了HttpRecorder,Link等)。里面有HttpRecorder类,该类可以找到CharSequence 的context文本。以下是API文档对CrawlURL类的简介:
Represents acandidate URI and the associated state it collects as it is crawled.
Core stateis in instance variables but a flexible attribute list is also available. Usethis 'bucket' to carry custom processing extracted data and state acrossCrawlURI processing. See theCandidateURI.putString(String, String)
,CandidateURI.getString(String)
, etc.
利用CrawlURL,我们首先可以获取context文本,然后在文本中找出想要的链接,最后在这个队列后面添加需要的链接。
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.extractor.Extractor;
import org.archive.crawler.extractor.Link;
import org.archive.io.ReplayCharSequence;
import org.archive.util.HttpRecorder;
public class MyExtractor extends Extractor{
//<a href="XXXX"...=... >
//http://news.sina.com.cn/c/nd/2015-12-08/doc-ifxmhqac0214384.shtml
//http://mil.news.sina.com.cn/jqs/2015-12-08/doc-ifxmnurf8411968.shtml
private String HERF ="<a(.*)href\\s*=\\s*(\"([^\"]*)\"|([^\\s>]*))(.*)>";
private String sinaUrl ="http://(.*)news.sina.com.cn(.*).shtml";
public MyExtractor(Stringname, String description) {
super(name,description);
}
public MyExtractor(String name) {
super(name, "sinaextrator");
}
/**
*
*/
private static final longserialVersionUID = -963034874191929396L;
@Override
protected voidextract(CrawlURI curi) {
String url="";
try {
HttpRecorder hr =curi.getHttpRecorder();
if(null == hr){
throw newException("httprecorder is null");
}
ReplayCharSequence rc = hr.getReplayCharSequence();
if(null == rc){
return ;
}
String context =rc.toString();
Pattern pattern =Pattern.compile(HERF,Pattern.CASE_INSENSITIVE);
Matcher matcher =pattern.matcher(context);
while(matcher.find()){
url =matcher.group(2);
url =url.replace("\"", "");
//System.out.println(url);
if(url.matches(sinaUrl)){
System.out.println(url);
<span style="color:#ff0000;">curi.createAndAddLinkRelativeToBase(url, context,Link.NAVLINK_HOP);</span>
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Ps:要一定要有Myextrator(String)这个构造函数,因为默认用这个构造函数。
3.在conf/process.options中添加这个类,然后在前台的extract中选中这个类即可,注意这个类必须是放在HTTP的下面。