nutch代码分析第三篇——crawl.injector--mapper

2021SC@SDUSC

 injector的注释

 * Injector takes a flat text file of URLs (or a folder containing text files)
 * and merges ("injects") these URLs into the CrawlDb. Useful for bootstrapping
 * a Nutch crawl. The URL files contain one URL per line, optionally followed by
 * custom metadata separated by tabs with the metadata key separated from the
 * corresponding value by '='.
 * <p>

这段注释什么意思呢?意思是说injector是把大量的URLs(这个在之前会建立一个URL集)提取出来,注入到crawlDb里。inject的主要作用有一下几点:

     1,注入url.txt
    2,url标准化
    3,拦截url,进行正则校验(regex-urlfilter.txt)
    4,对符URL标准的url进行map对构造<url, CrawlDatum>,在构造过程中给CrawlDatum初始化得分,分数可影响url host的搜索排序,和采集优先级!
    5,reduce只做一件事,判断url是不是在crawldb中已经存在,如果存在则直接读取原来CrawlDatum,如果是新host,则把相应状态存储到里边(STATUS_DB_UNFETCHED(状态意思为没有采集过))
 

ublic class Injector extends Configured implements Tool {
  public static final Logger LOG = LoggerFactory.getLogger(Injector.class);
  
  /** metadata key reserved for setting a custom score for a specific URL */
  public static String nutchScoreMDName = "nutch.score";
  /** metadata key reserved for setting a custom fetchInterval for a specific URL */
  public static String nutchFetchIntervalMDName = "nutch.fetchInterval";

  class injector:对每一个URL进行筛选,加上score打分,并且如果有特殊的URL(这里我的理解是不合法的URL)获取交互

public static class InjectMapper implements Mapper<WritableComparable, Text, Text, CrawlDatum> {
    private URLNormalizers urlNormalizers;
    private int interval;
    private float scoreInjected;
    private JobConf jobConf;
    private URLFilters filters;
    private ScoringFilters scfilters;
    private long curTime;
 
    public void configure(JobConf job) {
      this.jobConf = job;
      urlNormalizers = new URLNormalizers(job, URLNormalizers.SCOPE_INJECT);
      interval = jobConf.getInt("db.fetch.interval.default", 2592000);
      filters = new URLFilters(jobConf);
      scfilters = new ScoringFilters(jobConf);
      scoreInjected = jobConf.getFloat("db.score.injected", 1.0f);
      curTime = job.getLong("injector.current.time", System.currentTimeMillis());
    }
 
    public void close() {}

 class  injectormapper:对URLs进行规范化判断

public void map(WritableComparable key, Text value,
                    OutputCollector<Text, CrawlDatum> output, Reporter reporter)
      throws IOException {
      String url = value.toString();              // value is line of text
      //忽略开头符号为#的
      if (url != null && url.trim().startsWith("#")) {
          /* Ignore line that start with # */
          return;
      }

      // if tabs : metadata that could be stored
      // must be name=value and separated by \t
      float customScore = -1f;
      int customInterval = interval;
      int fixedInterval = -1;
      Map<String,String> metadata = new TreeMap<String,String>();
      //对文本进行拆分
      if (url.indexOf("\t")!=-1){
    	  String[] splits = url.split("\t");
    	  url = splits[0];
    	  for (int s=1;s<splits.length;s++){
    		  // find separation between name and value
    		  int indexEquals = splits[s].indexOf("=");
    		  if (indexEquals==-1) {
    			  // skip anything without a =
    			  continue;		    
    		  }
                  //得到元数据名称及值
    		  String metaname = splits[s].substring(0, indexEquals);
    		  String metavalue = splits[s].substring(indexEquals+1);
                  //判断是不是保留的元数据名称
    		  if (metaname.equals(nutchScoreMDName)) {
    			  try {
    			  customScore = Float.parseFloat(metavalue);}
    			  catch (NumberFormatException nfe){}
    		  }
                  else if (metaname.equals(nutchFetchIntervalMDName)) {
                          try {
                                  customInterval = Integer.parseInt(metavalue);}
                          catch (NumberFormatException nfe){}
                  }
                  else if (metaname.equals(nutchFixedFetchIntervalMDName)) {
                          try {
                                  fixedInterval = Integer.parseInt(metavalue);}
                          catch (NumberFormatException nfe){}
                  }
                  //不是就保存到容器中
    		  else metadata.put(metaname,metavalue);
    	  }
      }
      try {
       //对url规范化,过滤
        url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
        url = filters.filter(url);             // filter the url
      } catch (Exception e) {
        if (LOG.isWarnEnabled()) { LOG.warn("Skipping " +url+":"+e); }
        url = null;
      }
      if (url == null) {
        reporter.getCounter("injector", "urls_filtered").increment(1);
      } else {                                   // if it passes
        value.set(url);                           // collect it
        //创建一个CrawlDatum,用来保存所处阶段,元数据,分值,计数等信息
        CrawlDatum datum = new CrawlDatum();
        datum.setStatus(CrawlDatum.STATUS_INJECTED);

        // Is interval custom? Then set as meta data
        if (fixedInterval > -1) {
          // Set writable using float. Flaot is used by AdaptiveFetchSchedule
          datum.getMetaData().put(Nutch.WRITABLE_FIXED_INTERVAL_KEY, new FloatWritable(fixedInterval));
          datum.setFetchInterval(fixedInterval);
        } else {
          datum.setFetchInterval(customInterval);
        }

        datum.setFetchTime(curTime);
        // now add the metadata
        Iterator<String> keysIter = metadata.keySet().iterator();
        while (keysIter.hasNext()){
        	String keymd = keysIter.next();
        	String valuemd = metadata.get(keymd);
        	datum.getMetaData().put(new Text(keymd), new Text(valuemd));
        }
        if (customScore != -1) datum.setScore(customScore);
        else datum.setScore(scoreInjected);
        try {
               //分值初始化
        	scfilters.injectedScore(value, datum);
        } catch (ScoringFilterException e) {
        	if (LOG.isWarnEnabled()) {
        		LOG.warn("Cannot filter injected score for url " + url
        				+ ", using default (" + e.getMessage() + ")");
        	}
        }
        reporter.getCounter("injector", "urls_injected").increment(1);
        output.collect(value, datum);
      }
    }
  }
 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值