Heritrix 抓取高级篇

最新推荐文章于 2016-08-26 16:51:59 发布

wukjong_1988

最新推荐文章于 2016-08-26 16:51:59 发布

阅读量150

点赞数

分类专栏：网络信息体系结构文章标签： JavaScript Apache Blog ViewUI

本文链接：https://blog.csdn.net/yuanbohan/article/details/83765031

版权

网络信息体系结构专栏收录该内容

9 篇文章 0 订阅

订阅专栏

使用Heritrix进行抓取网页，有半天阅读我之前博客的话，很容易就能够顺利的进行抓取任务，但在抓取过程中可能会遇到：
[b]1 想抓取特定格式/特定要求的网页[/b]
这个要根据具体的网站，才能采取具体的措施。这主要是根据网站编写的时候，它的出度的具体格式。如果是类似<a href="http://www.xxx.xxx.xx...." ..>这样的可以直接指向某个具体的URL，那么添加到URI中的应该是这个完整的URL,如果是去掉了http://www等的前面的内容，而只是简单指向本网站下的某个网页，那么在加入到URI中的时候，要记得加上头使得它是一个完整的网页的URL。根据CCER网站下的网页内容，自己写了个CCERExtractor.java来进行过滤，只抓取符合条件的URL。


package org.archive.crawler.extractor;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.io.ReplayCharSequence;
import org.archive.util.HttpRecorder;

/**
 * I ignore the log information.
 * @author Administrator
 *
 */
public class CCERExtractor extends Extractor{

	/**
	 * if the url starts with http			----		if it is under http://www.pku.edu.cn, not then "not"
	 * else if matches mailto or javascript	----		"not"
	 * else									----		"yes"
	 */
	public static final String pattern_ahref = "<[aA] href=\"([^\"]+)\"";// group(1)

	public CCERExtractor(String name){
		super(name,"CCER Extractor");
	}

	public CCERExtractor(String name, String description) {
		super(name, description);
	}

	@Override
	protected void extract(CrawlURI curi) {
		HttpRecorder hr = curi.getHttpRecorder();
		ReplayCharSequence cs = null;
		try {
			cs = hr.getReplayCharSequence();
		} catch (IOException e) {
			e.printStackTrace();
		}
		if(cs == null){
			return;
		}

		String content = cs.toString();
		Matcher matcher = Pattern.compile(CCERExtractor.pattern_ahref).matcher(content);
		while(matcher.find()){
			String newUrl = matcher.group(1);
			if(newUrl.startsWith("http")){//find the ccer website
				if(newUrl.startsWith("http://www.pku.edu.cn")){// case 1 that matches
					createAndAddLinkRelativeToBase(curi, newUrl, Link.NAVLINK_HOP);
				}
			}else if(!newUrl.toLowerCase().startsWith("mailto") && !newUrl.toLowerCase().startsWith("javascript")){//case 2 that matches. Ignore the mailto and javascript href. 
				if(newUrl.trim().startsWith("/")){
					newUrl = newUrl.trim().substring(1).trim();
				}
				newUrl = "http://www.ccer.pku.edu.cn/cn/" + newUrl;//" http://www.ccer.pku.edu.cn/cn/ " should be added to the first
				createAndAddLinkRelativeToBase(curi, newUrl, Link.NAVLINK_HOP);// make sure that the newUrl is available.
			}
		}
	}

	private void createAndAddLinkRelativeToBase(CrawlURI curi, String newUrl, char hopType){
		try {
			curi.createAndAddLinkRelativeToBase(newUrl, "", hopType);
		} catch (URIException e) {
			e.printStackTrace();
		}
	}
}

在modules下的Processor.options下将这个新的解析器加入进去，那么在配置的时候便可以出现这个选项。但是必须注意：[b]Crawler是严格按照配置的信息来进行抓取的，所以CCERExtractor必须在ExtractorHttp后面[/b]。在options里面的位置无所谓，只要放到extractor中即可，没有先后顺序。

[img]http://dl.iteye.com/upload/attachment/349208/3ca33e78-97b8-38cb-ba6f-d188bb3a8fd6.jpg[/img]

[b]2 单线程的困扰[/b](我之前的博客[url]http://hanyuanbo.iteye.com/blog/788177[/url])

[b]3 机器内存的限制[/b](在run configuration中，设置VM-arguments为-Xmx512m。见附件)

[b]4 robots.txt的限制[/b]
这个是“君子协定”的内容，如果不考虑这个限制的话，可以提高抓取的效率（虽然某些网站可能不希望你这样肆无忌惮的抓取它的网站）。这个很简单，只需要把

org.archive.crawler.prefetch.PreconditionEnforcer

中将

private boolean considerRobotsPreconditions(CrawlURI curi) {
         ...
}

改成

private boolean considerRobotsPreconditions(CrawlURI curi) {
        return false;//无论如何都不考虑robots.txt的限制
}

[b]5 有时候配置的时候，那个可选项的下拉菜单没有了！！[/b](在Heritrix的run configuration中，classpath中的user entries中选择右边的advanced，然后选择external folder，选择conf目录即可。见附件)

至此，便开始抓取啦，还是按照之前博客里所说的配置，记得在选择extractor的时候，选择CCERExtractor，并且这个要放在ExtractorHtml后面。抓取过程如下：

[img]http://dl.iteye.com/upload/attachment/350111/fc2a7fa9-3e20-3924-9aaf-5d925d888db8.jpg[/img]

[img]http://dl.iteye.com/upload/attachment/350113/aa8ce58b-cb3d-3fa9-a127-9970b2213515.jpg[/img]

[img]http://dl.iteye.com/upload/attachment/350115/4bf745ac-32c7-39e6-8aee-79f863fb3b4b.jpg[/img]

[img]http://dl.iteye.com/upload/attachment/350117/8050e119-d8e9-3a69-82bc-f410b67fc600.jpg[/img]

wukjong_1988

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Heritrix 抓取高级篇

使用Heritrix进行抓取网页，有半天阅读我之前博客的话，很容易就能够顺利的进行抓取任务，但在抓取过程中可能会遇到：[b]1 想抓取特定格式/特定要求的网页[/b]这个要根据具体的网站，才能采取具体的措施。这主要是根据网站编写的时候，它的出度的具体格式。如果是类似这样的可以直接指向某个具体的URL，那么添加到URI中的应该是这个完整的URL,如果是去掉了http://www等的前面的内...
复制链接

扫一扫