【无标题】

2021SC@SDUSC

本篇继续上一篇的分析
InjectMapper::map->processMetaData

private void processMetaData(String metadata, CrawlDatum datum,
    String url) {

  String[] splits = metadata.split(TAB_CHARACTER);
  for (String split : splits) {
    int indexEquals = split.indexOf(EQUAL_CHARACTER);

    String metaname = split.substring(0, indexEquals);
    String metavalue = split.substring(indexEquals + 1);

    if (metaname.equals(nutchScoreMDName)) {
      datum.setScore(Float.parseFloat(metavalue));
    } else if (metaname.equals(nutchFetchIntervalMDName)) {
      datum.setFetchInterval(Integer.parseInt(metavalue));
    } else if (metaname.equals(nutchFixedFetchIntervalMDName)) {
      int fixedInterval = Integer.parseInt(metavalue);
      if (fixedInterval > -1) {
        datum.getMetaData().put(Nutch.WRITABLE_FIXED_INTERVAL_KEY,
            new FloatWritable(fixedInterval));
        datum.setFetchInterval(fixedInterval);
      }
    } else {
      datum.getMetaData().put(new Text(metaname), new Text(metavalue));
    }
  }
}

TAB_CHARACTER的默认值是“\t”,EQUAL_CHARACTER的默认值是“=”,processMetaData函数根据TAB_CHARACTER提取每组url信息,每组url信息又通过等号划分属性名metaname和属性值metavalue ,然后将其设置进CrawlDatum中。

map函数处理完,hadoop框架继而调用InjectReducer的reduce函数继续处理,

InjectReducer::reduce

public void reduce(Text key, Iterable<CrawlDatum> values, Context context)
    throws IOException, InterruptedException {

  for (CrawlDatum val : values) {
    if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
      injected.set(val);
      injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
      injectedSet = true;
    } else {
      old.set(val);
      oldSet = true;
    }
  }

  CrawlDatum result;
  if (injectedSet && (!oldSet || overwrite)) {
    result = injected;
  } else {
    result = old;

    if (injectedSet && update) {
      old.putAllMetaData(injected);
      old.setScore(injected.getScore() != scoreInjected
          ? injected.getScore() : old.getScore());
      old.setFetchInterval(injected.getFetchInterval() != interval
          ? injected.getFetchInterval() : old.getFetchInterval());
    }
  }
  context.write(key, result);
}

reduce函数简而言之,要么覆盖之前某个url对应的CrawlDatum结构,要么只是通过putAllMetaData、setScore和setFetchInterval设置CrawlDatum中的对应信息,并不重写。

reduce函数执行成功后,就要向HDFS文件系统(前面注册的tempCrawlDb目录)中写入处理结果了。这里简单看一下CrawlDatum是如何写入的,CrawlDatum实现了hadoop的WritableComparable的write函数。

CrawlDatum::write

public void write(DataOutput out) throws IOException {
out.writeByte(CUR_VERSION); // store current version
out.writeByte(status);
out.writeLong(fetchTime);
out.writeByte(retries);
out.writeInt(fetchInterval);
out.writeFloat(score);
out.writeLong(modifiedTime);
if (signature == null) {
out.writeByte(0);
} else {
out.writeByte(signature.length);
out.write(signature);
}
if (metaData != null && metaData.size() > 0) {
out.writeBoolean(true);
metaData.write(out);
} else {
out.writeBoolean(false);
}
}

再回头看CrawlDb的install函数,当hadoop处理完数据后,就会调用该函数进行最后的处理,

public static void install(Job job, Path crawlDb) throws IOException {
Configuration conf = job.getConfiguration();
boolean preserveBackup = conf.getBoolean(“db.preserve.backup”, true);
FileSystem fs = FileSystem.get(conf);
Path old = new Path(crawlDb, “old”);
Path current = new Path(crawlDb, CURRENT_NAME);
Path tempCrawlDb = org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
.getOutputPath(job);
FSUtils.replace(fs, old, current, true);
FSUtils.replace(fs, current, tempCrawlDb, true);
Path lock = new Path(crawlDb, LOCK_NAME);
LockUtil.removeLockFile(fs, lock);
if (!preserveBackup && fs.exists(old)) {
fs.delete(old, true);
}
}

public static void replace(FileSystem fs, Path current, Path replacement,
boolean removeOld) throws IOException {

Path old = new Path(current + ".old");
if (fs.exists(current)) {
  fs.rename(current, old);
}

fs.rename(replacement, current);
if (fs.exists(old) && removeOld) {
  fs.delete(old, true);
}

}

public static boolean removeLockFile(FileSystem fs, Path lockFile)
throws IOException {
return fs.delete(lockFile, false);
}

install函数将原来的old目录替换为current目录,将current目录替换为最新的tempCrawlDb即“crawldb-随机数”目录,然后删除锁文件。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值