关于heritrix的性能

最新推荐文章于 2021-02-21 12:17:27 发布

Kuiiiiiiie

最新推荐文章于 2021-02-21 12:17:27 发布

阅读量732

点赞数

分类专栏：待学习文章标签：网络爬虫 heritrix

待学习专栏收录该内容

14 篇文章 0 订阅

订阅专栏

1.Heritrix可以以任何URL作为种子，只要你这个种子URL里包含其他URL，就可以一直不停的抓取下去，直到所有URL抓取完毕。

2.垂直搜索需要特殊控制，如抓取你想要抓的URL，以及抽取你要抽取你要的内容。Heritrix的高度可扩展性可以帮你解决这些问题。如继承Frontier、Extractor、Writer，以及自定义Rule都可以。

（1）Extrator：

package org.archive.crawler.extractor;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.io.ReplayCharSequence;
import org.archive.util.HttpRecorder;
/**
* I ignore the log information.
* @author Administrator
*
*/
public class CCERExtractor extends Extractor{
/**
* if the url starts with http ---- if it is under http://www.pku.edu.cn, not then "not"
* else if matches mailto or javascript ---- "not"
* else ---- "yes"
*/
public static final String pattern_ahref = "<[aA] href=\"([^\"]+)\"";// group(1)
public CCERExtractor(String name){
super(name,"CCER Extractor");
}
public CCERExtractor(String name, String description) {
super(name, description);
}
@Override
protected void extract(CrawlURI curi) {
HttpRecorder hr = curi.getHttpRecorder();
ReplayCharSequence cs = null;
try {
cs = hr.getReplayCharSequence();
} catch (IOException e) {
e.printStackTrace();
}
if(cs == null){
return;
}
String content = cs.toString();
Matcher matcher = Pattern.compile(CCERExtractor.pattern_ahref).matcher(content);
while(matcher.find()){
String newUrl = matcher.group(1);
if(newUrl.startsWith("http")){//find the ccer website
if(newUrl.startsWith("http://www.pku.edu.cn")){// case 1 that matches
createAndAddLinkRelativeToBase(curi, newUrl, Link.NAVLINK_HOP);
}
}else if(!newUrl.toLowerCase().startsWith("mailto") && !newUrl.toLowerCase().startsWith("javascript")){//case 2 that matches. Ignore the mailto and javascript href.
if(newUrl.trim().startsWith("/")){
newUrl = newUrl.trim().substring(1).trim();
}
newUrl = "http://www.ccer.pku.edu.cn/cn/" + newUrl;//" http://www.ccer.pku.edu.cn/cn/ " should be added to the first
createAndAddLinkRelativeToBase(curi, newUrl, Link.NAVLINK_HOP);// make sure that the newUrl is available.
}
}
}
private void createAndAddLinkRelativeToBase(CrawlURI curi, String newUrl, char hopType){
try {
curi.createAndAddLinkRelativeToBase(newUrl, "", hopType);
} catch (URIException e) {
e.printStackTrace();
}
}
}

Ps：！！！！！在conf/modules下的Processor.options下将这个新的解析器加入进去，那么在配置的时候便可以出现这个选项。但是必须注意： Crawler是严格按照配置的信息来进行抓取的，所以CCERExtractor必须在ExtractorHttp后面。在options里面的位置无所谓，只要放到extractor中即可，没有先后顺序。

！！！！！！

（2）Frontier：

FrontierScheduler 是 org.archive.crawler.postprocessor 包中的一个类，它的作用是将在 Extractor 中所分析得出的链接加入到 Frontier 中，以待继续处理。在该类的 innerProcess(CrawlURI) 函数中，首先检查当前链接队列中是否有一些属于高优先级的链接。如果有，则立刻转走进行处理；如果没有，则对所有的链接进行遍历，然后调用 Frontier 中的 schedule() 方法加入队列进行处理。其代码如图 20 所示。

图 20. FrontierScheduler 类中的 innerProcess() 和 schedule() 函数

从上面的代码可以看出 innerProcess() 函数并未直接调用 Frontier 的 schedule() 方法，而是调用自己内部的 schedule() 方法，进而在这个方法中再调用 Frontier 的 schedule() 方法。而 FrontierScheduler 的 schedule() 方法实际上直接将当前的候选链接不加任何判断地直接加入到抓取队列当中了。这种方式为 FrontierScheduler 的扩展留出了很好的接口。

这里我们需要构造一个 FrontierScheduler 的派生类 FrontierSchedulerForBjfu，这个类重载了 schedule(CandidateURI caUri) 这个方法，限制抓取的 URI 必须包含“bjfu”，以保证抓取的链接都是北林内部的地址。派生类 FrontierSchedulerForBjfu 具体代码如图 21 所示。

图 21. 派生类 FrontierSchedulerForBjfu

然后，在 modules 文件夹中的 Processor.options 中添加一行“org.archive.crawler.postprocessor.FrontierSchedulerForBjfu|FrontierSchedulerForBjfu”，这样在爬虫的 WebUI 中就可以选择我们扩展的 org.archive.crawler.postprocessor.FrontierSchedulerForBjfu 选项。如图 22 所示。

图 22. 用 FrontierSchedulerForBjfu 代替 FrontierScheduler

3.Heritrix目前对中文支持部分不够，比如种子URl中不能存在中文，以及抽取新URL时，有中文的URl抽取不到，其他的我还没发现。这些需要你改部分源代码（主要是正则表达式）。
4.你想抓取智联招聘的IT方面的工作信息，看你上面长长的URL，肯定是经过浏览器处理后的URL，你可能是在智联招聘的搜索框中输入了你要搜索的条件后点搜索浏览器所形成的URL。所以你要进入其源代码，根据他的FORM，自己去获取URL，这种URL没有经过浏览器处理。

Kuiiiiiiie

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
关于heritrix的性能

1.Heritrix可以以任何URL作为种子，只要你这个种子URL里包含其他URL，就可以一直不停的抓取下去，直到所有URL抓取完毕。2.垂直搜索需要特殊控制，如抓取你想要抓的URL，以及抽取你要抽取你要的内容。Heritrix的高度可扩展性可以帮你解决这些问题。如继承Frontier、Extractor、Writer，以及自定义Rule都可以。（1）Extrator：pa
复制链接

扫一扫