nutch学习笔记2.Injector 数据下载入口

org.apache.nutch.crawl.Injector

public class Injector extends Configured implements Tool 从继承类和实现接口可以看出,Injector封装了Hadoop并在构造函数中初始化Hadoop配置参数Configuration( Configuration 内部机制请参考博文hadoop学习笔记1.Configuration),这也是nutch封装Hadoop的一种机制。

Injector 包含两个属性:
/** metadata key reserved for setting a custom score for a specific URL */

 public static String nutchScoreMDName = "nutch.score";
/** metadata key reserved for setting a custom fetchInterval for a specific URL */
public static String nutchFetchIntervalMDName = "nutch.fetchInterval";

 

 

一个方法:

第一个参数  crawlDb为nutch抓取目录下crawlDb目录的路径;

第二个参数 urlDir为nutch抓取文件列表目录;

 

 

public void inject(Path crawlDb, Path urlDir)

 

 public void inject(Path crawlDb, Path urlDir) throws IOException {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long start = System.currentTimeMillis();

//输出日志
if (LOG.isInfoEnabled()) {
LOG.info("Injector: starting at " + sdf.format(start));
LOG.info("Injector: crawlDb: " + crawlDb);
LOG.info("Injector: urlDir: " + urlDir);
}
//随机生成临时文件夹
  Path tempDir = new Path(getConf().get("mapred.temp.dir", ".") + "/inject-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));

// map text input file to a <url,CrawlDatum> file
if (LOG.isInfoEnabled()) {
LOG.info("Injector: Converting injected urls to crawl db entries.");
}

//创建作业,并设置作业实现类InjectMapper.class 
JobConf sortJob = new NutchJob(getConf());
sortJob.setJobName("inject " + urlDir);
FileInputFormat.addInputPath(sortJob, urlDir);
sortJob.setMapperClass(InjectMapper.class);
//设置设置map输出路径
FileOutputFormat.setOutputPath(sortJob, tempDir);

//设置作业参数
sortJob.setOutputFormat(SequenceFileOutputFormat.class);
sortJob.setOutputKeyClass(Text.class);
sortJob.setOutputValueClass(CrawlDatum.class);
sortJob.setLong("injector.current.time", System.currentTimeMillis());

//启动作业
JobClient.runJob(sortJob);

// merge with existing crawl db
if (LOG.isInfoEnabled()) {
LOG.info("Injector: Merging injected urls into crawl db.");
}


JobConf mergeJob = CrawlDb.createJob(getConf(), crawlDb);
FileInputFormat.addInputPath(mergeJob, tempDir);
mergeJob.setReducerClass(InjectReducer.class);
JobClient.runJob(mergeJob);
CrawlDb.install(mergeJob, crawlDb);

// clean up
FileSystem fs = FileSystem.get(getConf());
fs.delete(tempDir, true);

long end = System.currentTimeMillis();
LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));
}

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值