Nutch2.2.3源码阅读之injectorJob

injectorJob是Nutch中第一个模块。它的功能是对URL进行优先值排序并存到本地列表中。

nutch中URL是按行存储的,每行的结构如下:
 http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000
 * \t userType=open_source

程序的入口如下:

  int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),args);

其中NutchCofiguration是一个配置nutch的类,在这不详述。主要是看injectorJob方法。在InjectorJob类中,定义了一个UrlMapper用来继承Mapper类并实现map函数,在这里面没有reduce函数。

 @Override
    protected void setup(Context context) throws IOException, InterruptedException {
      urlNormalizers = new URLNormalizers(context.getConfiguration(),
      URLNormalizers.SCOPE_INJECT);
      interval = context.getConfiguration().getInt("db.fetch.interval.default",2592000);
      filters = new URLFilters(context.getConfiguration());
      scfilters = new ScoringFilters(context.getConfiguration());
      scoreInjected = context.getConfiguration().getFloat("db.score.injected",
          1.0f);
      curTime = context.getConfiguration().getLong("injector.current.time",
          System.currentTimeMillis());
    }

这是Mapper类中的setup函数,在此过程中进行了参数定义。分别是归一化方法(urlNormalizers),间隔(interval???),两个过滤器(URLFilters、ScoringFilters),以及当前时间和分数写入。

protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
      String url = value.toString().trim(); // value is line of text

      if (url != null && (url.length() == 0 || url.startsWith("#"))) {
        /* Ignore line that start with # */
        return;
      }

      // if tabs : metadata that could be stored
      // must be name=value and separated by \t
      float customScore = -1f;
      int customInterval = interval;
      Map<String, String> metadata = new TreeMap<String, String>();
      if (url.indexOf("\t") != -1) {
        String[] splits = url.split("\t");
        url = splits[0];
        for (int s = 1; s < splits.length; s++) {
          // find separation between name and value
          int indexEquals = splits[s].indexOf("=");
          if (indexEquals == -1) {
            // skip anything without a =
            continue;
          }
          String metaname = splits[s].substring(0, indexEquals);
          String metavalue = splits[s].substring(indexEquals + 1);
          if (metaname.equals(nutchScoreMDName)) {
            try {
              customScore = Float.parseFloat(metavalue);
            } catch (NumberFormatException nfe) {
            }
          } else if (metaname.equals(nutchFetchIntervalMDName)) {
            try {
              customInterval = Integer.parseInt(metavalue);
            } catch (NumberFormatException nfe) {
            }
          } else
            metadata.put(metaname, metavalue);
        }
      }
      try {
        url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
        url = filters.filter(url); // filter the url
      } catch (Exception e) {
        LOG.warn("Skipping " + url + ":" + e);
        url = null;
      }
      if (url == null) {
        context.getCounter("injector", "urls_filtered").increment(1);
        return;
      } else { // if it passes
        String reversedUrl = TableUtil.reverseUrl(url); // collect it
        WebPage row = WebPage.newBuilder().build();
        row.setFetchTime(curTime);
        row.setFetchInterval(customInterval);

        // now add the metadata
        Iterator<String> keysIter = metadata.keySet().iterator();
        while (keysIter.hasNext()) {
          String keymd = keysIter.next();
          String valuemd = metadata.get(keymd);
          row.getMetadata().put(new Utf8(keymd),
              ByteBuffer.wrap(valuemd.getBytes()));
        }

        if (customScore != -1)
          row.setScore(customScore);
        else
          row.setScore(scoreInjected);

        try {
          scfilters.injectedScore(url, row);
        } catch (ScoringFilterException e) {
          if (LOG.isWarnEnabled()) {
            LOG.warn("Cannot filter injected score for url " + url
                + ", using default (" + e.getMessage() + ")");
          }
        }
        context.getCounter("injector", "urls_injected").increment(1);
        row.getMarkers()
            .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));
        Mark.INJECT_MARK.putMark(row, YES_STRING);
        context.write(reversedUrl, row);
      }
    }
  }

这是主要的map函数,在这个函数中做了以下事情。
1.是对每个读入的URL进行格式处理,包括取出两边的空格

String url = value.toString().trim();

忽略开头的’#’

if (url != null && (url.length() == 0 || url.startsWith("#"))) {
        /* Ignore line that start with # */
        return;
      }

已知URL的格式

 http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000
 * \t userType=open_source

对其进行如下处理

  Map<String, String> metadata = new TreeMap<String, String>();
      if (url.indexOf("\t") != -1) {
        String[] splits = url.split("\t");
        url = splits[0];
        for (int s = 1; s < splits.length; s++) {
          // find separation between name and value
          int indexEquals = splits[s].indexOf("=");
          if (indexEquals == -1) {
            // skip anything without a =
            continue;
          }
          String metaname = splits[s].substring(0, indexEquals);
          String metavalue = splits[s].substring(indexEquals + 1);
          if (metaname.equals(nutchScoreMDName)) {
            try {
              customScore = Float.parseFloat(metavalue);
            } catch (NumberFormatException nfe) {
            }
          } else if (metaname.equals(nutchFetchIntervalMDName)) {
            try {
              customInterval = Integer.parseInt(metavalue);
            } catch (NumberFormatException nfe) {
            }
          } else
            metadata.put(metaname, metavalue);
        }
      }

首先将其按照’\t’分片,第一个显然是url,然后根据’=’解析出name和value值。

try {
        url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
        url = filters.filter(url); // filter the url
      } catch (Exception e) {
        LOG.warn("Skipping " + url + ":" + e);
        url = null;
      }
      if (url == null) {
        context.getCounter("injector", "urls_filtered").increment(1);
        return;
      } else { // if it passes
        String reversedUrl = TableUtil.reverseUrl(url); // collect it
        WebPage row = WebPage.newBuilder().build();
        row.setFetchTime(curTime);
        row.setFetchInterval(customInterval);

        // now add the metadata
        Iterator<String> keysIter = metadata.keySet().iterator();
        while (keysIter.hasNext()) {
          String keymd = keysIter.next();
          String valuemd = metadata.get(keymd);
          row.getMetadata().put(new Utf8(keymd),
              ByteBuffer.wrap(valuemd.getBytes()));
        }

        if (customScore != -1)
          row.setScore(customScore);
        else
          row.setScore(scoreInjected);

        try {
          scfilters.injectedScore(url, row);
        } catch (ScoringFilterException e) {
          if (LOG.isWarnEnabled()) {
            LOG.warn("Cannot filter injected score for url " + url
                + ", using default (" + e.getMessage() + ")");
          }
        }
        context.getCounter("injector", "urls_injected").increment(1);
        row.getMarkers()
            .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));
        Mark.INJECT_MARK.putMark(row, YES_STRING);
        context.write(reversedUrl, row);
      }

对URL进行归一化以及过滤,若URL为空,则context计数器加一,否则进行如下处理。
首先翻转URL

String org.apache.nutch.util.TableUtil.reverseUrl(String urlString) throws MalformedURLException

Reverses a url's domain. This form is better for storing in hbase. Because scans within the same domain are faster.

E.g. "http://bar.foo.com:8983/to/index.html?a=b" becomes "com.foo.bar:8983:http/to/index.html?a=b".

这里用到reverseURL,API写的很清楚。

 Iterator<String> keysIter = metadata.keySet().iterator();
        while (keysIter.hasNext()) {
          String keymd = keysIter.next();
          String valuemd = metadata.get(keymd);
          row.getMetadata().put(new Utf8(keymd),
              ByteBuffer.wrap(valuemd.getBytes()));
        }

在这里遍历metadata树图,对每个key值写入相应的value。
之后是对其写入Filter score,最后写入context中的是翻转后的URL机webpage类的row。
之后是负责创建Job的run()函数

public Map<String, Object> run(Map<String, Object> args) throws Exception {
    getConf().setLong("injector.current.time", System.currentTimeMillis());
    Path input;
    Object path = args.get(Nutch.ARG_SEEDDIR);
    if (path instanceof Path) {
      input = (Path) path;
    } else {
      input = new Path(path.toString());
    }
    numJobs = 1;
    currentJobNum = 0;
    currentJob = NutchJob.getInstance(getConf(), "inject " + input);
    FileInputFormat.addInputPath(currentJob, input);
    currentJob.setMapperClass(UrlMapper.class);
    currentJob.setMapOutputKeyClass(String.class);
    currentJob.setMapOutputValueClass(WebPage.class);
    currentJob.setOutputFormatClass(GoraOutputFormat.class);

    DataStore<String, WebPage> store = StorageUtils.createWebStore(
        currentJob.getConfiguration(), String.class, WebPage.class);
    GoraOutputFormat.setOutput(currentJob, store, true);

    // NUTCH-1471 Make explicit which datastore class we use
    Class<? extends DataStore<Object, Persistent>> dataStoreClass = StorageUtils
        .getDataStoreClass(currentJob.getConfiguration());
    LOG.info("InjectorJob: Using " + dataStoreClass
        + " as the Gora storage class.");

    currentJob.setReducerClass(Reducer.class);
    currentJob.setNumReduceTasks(0);

    currentJob.waitForCompletion(true);
    ToolUtil.recordJobStatus(null, currentJob, results);

    // NUTCH-1370 Make explicit #URLs injected @runtime
    long urlsInjected = currentJob.getCounters()
        .findCounter("injector", "urls_injected").getValue();
    long urlsFiltered = currentJob.getCounters()
        .findCounter("injector", "urls_filtered").getValue();
    LOG.info("InjectorJob: total number of urls rejected by filters: "
        + urlsFiltered);
    LOG.info("InjectorJob: total number of urls injected after normalization and filtering: "
        + urlsInjected);

    return results;
  }

首先是对输入路径进行处理,然后是创建Map Job,
currentJob是实例化的一个Nutch Job,对currentJob进行map/reduce设置。

  currentJob.setMapperClass(UrlMapper.class);
    currentJob.setMapOutputKeyClass(String.class);
    currentJob.setMapOutputValueClass(WebPage.class);
    currentJob.setOutputFormatClass(GoraOutputFormat.class);

GoraOutPutFormat是apache的一个项目,定义了一种输出格式。

  DataStore<String, WebPage> store = StorageUtils.createWebStore(
        currentJob.getConfiguration(), String.class, WebPage.class);
    GoraOutputFormat.setOutput(currentJob, store, true);

建立一个Gora形式的储存样例,之后将输出形式设置为当前的job,以及储存地点。

 currentJob.setReducerClass(Reducer.class);
    currentJob.setNumReduceTasks(0);

    currentJob.waitForCompletion(true);
    ToolUtil.recordJobStatus(null, currentJob, results);

设置无educe过程,及向系统记录job状态

之后一个run函数负责开始一段injector job,在此不说。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
nutch javax.net.ssl.sslexception : could not generate dh keypair 是一个SSL异常,意味着Nutch无法生成DH密钥对。 TLS(Transport Layer Security)是一种加密协议,用于保护在网络上进行的通信。在TLS握手期间,服务器和客户端会协商加密算法和生成共享密钥对。 DH(Diffie-Hellman)密钥交换是TLS协议中常用的一种加密算法。它允许服务器和客户端在不直接传递密钥的情况下,通过交换公钥来生成共享密钥。 nutch javax.net.ssl.sslexception : could not generate dh keypair 错误意味着Nutch无法生成DH密钥对。这可能是由于以下几个原因导致的: 1. Java安全性策略限制:Java默认情况下,限制了密钥长度。您可以尝试通过修改Java安全性策略文件来解决此问题。 2. 加密算法不受支持:您使用的Java版本可能不支持所需的加密算法。您可以尝试升级到较新的Java版本。 3. 随机数生成器问题:DH密钥对需要使用随机数生成器生成随机数。但是,如果随机数生成器不可用或出现故障,就会出现此错误。您可以尝试重新配置随机数生成器或更换可靠的实现。 4. SSL证书问题:此错误可能是由于证书问题引起的。您可以检查证书是否过期或不匹配,并尝试更新或更换证书。 针对这个错误,您可以逐一排查上述情况,并尝试相应的解决方法来解决该问题。如果问题仍然存在,您可能需要进一步的调查和故障排除来确定准确的原因并解决问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值