heritrix种子选取，与扩展抓取

最新推荐文章于 2024-09-16 14:55:55 发布

dufei07

最新推荐文章于 2024-09-16 14:55:55 发布

阅读量177

点赞数

分类专栏：搜索引擎文章标签： JSP Java 搜索引擎 lucene 多线程

搜索引擎专栏收录该内容

10 篇文章 0 订阅

订阅专栏

搜索引擎首先要用爬虫把网页爬下来，我用Heritrix，选择Heritrix的主要原因是因为手头有一本《Heritrix+lucene构建自己的搜索引擎》书，资料多一点困难就少一点吧。

其实这几天一直在想做什么主题的垂直搜索引擎，最后决定做汽车的。毕竟没什么经验，时间也不是很多了，我想第一期计划是完成对车的详细参数的搜索。

我选择的网站是太平洋汽车网。

首先我找到了一个可以吧所有的汽车都可以连接到的页面“http://price.pcauto.com.cn/serial_config.jsp?sid=3178”这个页面的左侧栏有一个把所有品牌和型号都罗列出来的树。保存下此页面，准备提取出此页面中的所有有用url

对页面处理的代码如下：

package tool;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class GetUrl {

	public static void main(String[] args) {
		try {
			BufferedReader br = new BufferedReader(
					new FileReader("d:\\aa.html"));
			String line = "";
			while ((line = br.readLine()) != null) {
				parse(line);
			}
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

	private static void parse(String line) {
		Pattern p = Pattern
				.compile("http://price.pcauto.com.cn/serial.jsp\\?sid=\\d+");
		Matcher m = p.matcher(line);
		while (m.find()) {
			System.out.println(m.group());
		}

	}

}

其中用了正在表达式把所有符合

"http://price.pcauto.com.cn/serial.jsp\\?sid=\\d+"的连接提取出来运行结果为：

http://price.pcauto.com.cn/serial.jsp?sid=374

http://price.pcauto.com.cn/serial.jsp?sid=93

http://price.pcauto.com.cn/serial.jsp?sid=3178

http://price.pcauto.com.cn/serial.jsp?sid=1603

http://price.pcauto.com.cn/serial.jsp?sid=2163

http://price.pcauto.com.cn/serial.jsp?sid=3104

......................................................................

一共有866个连接。

下一步是用这866个连接为heritrix的种子进行抓取，在抓取前还是要做前期的一些处理工作

因为heritrix如果不对链接做筛选的话后期工作是无法进行的。

1：heritrix有多个扩展点，这里我扩展FrontierSchedule,它是一个PostProcessor，它的作用是在Extractor 中所分析的

链接加到Frontier中。我写了CarFrontirer这个继承了FrontierSchedule的类对本项目的链接进行扩展：

package userP;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.postprocessor.FrontierScheduler;

public class CarFrontier extends FrontierScheduler {
	private static final long serialVersionUID = 1L;
	// Pattern p =
	// Pattern.compile("http://price.pcauto.com.cn/m\\d+/|robots.txt|dns:*");
	Pattern p = Pattern.compile("http://price.pcauto.com.cn/m\\d+/");

	public CarFrontier(String name) {
		super(name);
	}

	protected void schedule(CandidateURI caUri) {
		String url = caUri.toString();
		Matcher m = p.matcher(url);
		if (m.matches() || url.indexOf("robots.txt") != -1
				|| url.indexOf("dns:") != -1) {
			this.getController().getFrontier().schedule(caUri);
		} else {
			return;
		}
	}

}

扩展了FrontierSchedule后，要在Processor.options中加入刚才写CarFrontier这个类，并在WEB控制台Model中的post processors中选择CarFrontier

2：启动开始抓取我发现速度很慢，我想事不是处理robot浪费了太多时间，于是我就在Perfetcher中取消了robots.txt 的限制，方法就是找到package org.archive.crawler.prefetch.PreconditionEnforcer.considerRobotsPreconditions()，把其方法全部注释掉，最后rentrun false就可以了。你会发现你抓取的网页中没了robot.txt这个文件。

开始抓取，速度还是相当慢。

线程是1或0，平均速度7，8K