转自 http://blog.csdn.net/songzhen640/archive/2008/07/16/2662443.aspx
heritrix 以CrawlController(后台)形式运行的代码实现
理解不一定对,不过我实现了
在文件内可以改写order.xml 在写一个 seeds.txt 其中抓取的内容就在该文件夹内
package main;
import java.io.File;
import javax.management.InvalidAttributeValueException;
import org.archive.crawler.datamodel.CrawlOrder;
import org.archive.crawler.framework.CrawlController;
import org.archive.crawler.framework.exceptions.InitializationException;
import org.archive.crawler.settings.SettingsHandler;
import org.archive.crawler.settings.XMLSettingsHandler;
public class RunMain
{
public static void main(String args[]) throws InvalidAttributeValueException, InitializationException
{
//把order.xml写成文件实例
File order = new File(
"F://web//eclipse//workspace//heritrix//jobs//car//order.xml");
//加载Order
XMLSettingsHandler xml = new XMLSettingsHandler(order);
xml.initialize();
CrawlController crawl = new CrawlController();
crawl.initialize(xml);
//运行抓取
crawl.requestCrawlStart();
}
}
本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/songzhen640/archive/2008/07/16/2662443.aspx