nutch源码分析---2

最新推荐文章于 2021-12-26 14:37:42 发布

二侠

最新推荐文章于 2021-12-26 14:37:42 发布

阅读量736

点赞数 1

分类专栏： nutch-1.12源码分析

本文链接：https://blog.csdn.net/conansonic/article/details/52232218

版权

nutch-1.12源码分析专栏收录该内容

6 篇文章 0 订阅

订阅专栏

nutch源码分析—generate

根据上一章的分析，“bin/nutch generate crawl/crawldb crawl/segments”这条命令最终会调用org.apache.nutch.crawl.Generator的main函数。

Generator::main

  public static void main(String args[]) throws Exception {
    int res = ToolRunner
        .run(NutchConfiguration.create(), new Generator(), args);
    System.exit(res);
  }

ToolRunner是hadoop的一个工具类，最终会调用Generator类的run函数。

Generator::run

  public int run(String[] args) throws Exception {

    Path dbDir = new Path(args[0]);
    Path segmentsDir = new Path(args[1]);
    long curTime = System.currentTimeMillis();
    long topN = Long.MAX_VALUE;
    int numFetchers = -1;
    boolean filter = true;
    boolean norm = true;
    boolean force = false;
    String expr = null;
    int maxNumSegments = 1;

    for (int i = 2; i < args.length; i++) {
      if ("-topN".equals(args[i])) {
        topN = Long.parseLong(args[i + 1]);
        i++;
      } else if ("-numFetchers".equals(args[i])) {
        numFetchers = Integer.parseInt(args[i + 1]);
        i++;
      } else if ("-adddays".equals(args[i])) {
        long numDays = Integer.parseInt(args[i + 1]);
        curTime += numDays * 1000L * 60 * 60 * 24;
      } else if ("-noFilter".equals(args[i])) {
        filter = false;
      } else if ("-noNorm".equals(args[i])) {
        norm = false;
      } else if ("-force".equals(args[i])) {
        force = true;
      } else if ("-maxNumSegments".equals(args[i])) {
        maxNumSegments = Integer.parseInt(args[i + 1]);
      } else if ("-expr".equals(args[i])) {
        expr = args[i + 1];
      }
    }

    Path[] segs = generate(dbDir, segmentsDir, numFetchers, topN, curTime,
          filter, norm, force, maxNumSegments, expr);

  }

run函数根据调用main函数传入的参数设置各个变量，其中dbDir为crawl/crawldb，即inject后初始url所在的数据目录，segmentsDir为crawl/segments，代表一次抓取的url所在的数据目录。generate函数最终会在segmentsDir指定的目录下生成一次抓取的子目录。

Generator::run->generate
第一部分

  public Path[] generate(Path dbDir, Path segments, int numLists, long topN,
      long curTime, boolean filter, boolean norm, boolean force,
      int maxNumSegments, String expr) throws IOException {

    Path tempDir = new Path(getConf().get("mapred.temp.dir", ".")
        + "/generate-temp-" + java.util.UUID.randomUUID().toString());

    Path lock = new Path(dbDir, CrawlDb.LOCK_NAME);
    FileSystem fs = FileSystem.get(getConf());
    LockUtil.createLockFile(fs, lock, force);

    JobConf job = new NutchJob(getConf());
    job.setInputFormat(SequenceFileInputFormat.class);
    job.setMapperClass(Selector.class);
    job.setPartitionerClass(Selector.class);
    job.setReducerClass(Selector.class);

    job.setOutputFormat(SequenceFileOutputFormat.class);
    job.setOutputKeyClass(FloatWritable.class);
    job.setOutputKeyComparatorClass(DecreasingFloatComparator.class);
    job.setOutputValueClass(SelectorEntry.class);
    job.setOutputFormat(GeneratorOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(dbDir, CrawlDb.CURRENT_NAME));
    FileOutputFormat.setOutputPath(job, tempDir);

    ...

    JobClient.runJob(job);

    ...
  }

tempDir指定的目录为“mapred临时文件存放地+/generate-temp+uuid”，接下来在crawl/crawldb下创建.locked锁文件，再往下创建了hadoop的Job，设置Mapper、Partitioner和Reducer为Selector，并设置Mapper的输入为crawl/crawldb/current目录下的文件，Reducer的输出保存到前面创建的临时目录tempDir中。
generate函数接下来调用JobClient的runJob函数执行该Job，注意这里是同步执行，如果是submitJob函数，则是异步执行。整个Job的执行结果是把crawl/crawldb/current目录下存储的关于起始url地址对应的CrawlDatum结构经过处理、过滤最后存入tempDir中。
下面先看Selector的map函数。

Selector::map

    public void map(Text key, CrawlDatum value,
        OutputCollector<FloatWritable, SelectorEntry> output, Reporter reporter)
        throws IOException {
      Text url = key;
      if (filter) {
        filters.filter(url.toString())
      }
      CrawlDatum crawlDatum = value;

      if (!schedule.shouldFetch(url, crawlDatum, curTime)) {
        return;
      }

      LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData().get(
          Nutch.WRITABLE_GENERATE_TIME_KEY);
      if (oldGenTime != null) {
        if (oldGenTime.get() + genDelay > curTime)
          return;
      }

      float sort = 1.0f;
      sort = scfilters.generatorSortValue(key, crawlDatum, sort);

      if (expr != null) {
        if (!crawlDatum.evaluate(expr)) {
          return;
        }
      }

      if (restrictStatus != null
          && !restrictStatus.equalsIgnoreCase(CrawlDatum
              .getStatusName(crawlDatum.getStatus())))
        return;

      if (scoreThreshold != Float.NaN && sort < scoreThreshold)
        return;

      if (intervalThreshold != -1
          && crawlDatum.getFetchInterval() > intervalThreshold)
        return;

      sortValue.set(sort);

      crawlDatum.getMetaData().put(Nutch.WRITABLE_GENERATE_TIME_KEY, genTime);
      entry.datum = crawlDatum;
      entry.url = key;
      output.collect(sortValue, entry); 
    }

map函数的核心功能是判断url地址是否符合抓取的条件。
map函数从crawl/crawldb/current目录下读出数据的key值为url，value值为封装了url信息的CrawlDatum结构。
filter默认为true，除非在执行generate命令时指定参数为“-noFilter”，因此接下来通过URLFilters的filter函数过滤url地址。
schedule默认为DefaultFetchSchedule，其shouldFetch检查是否可以对该url进行一次抓取，判断的准则是前后两次抓取的间隔时间要大于一定的阈值。
Selector::map->DefaultFetchSchedule::shouldFetch

  public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
    if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {
      if (datum.getFetchInterval() > maxInterval) {
        datum.setFetchInterval(maxInterval * 0.9f);
      }
      datum.setFetchTime(curTime);
    }
    if (datum.getFetchTime() > curTime) {
      return false;
    }
    return true;
  }

再往下，如果url对应的信息中设置了WRITABLE_GENERATE_TIME_KEY对应的值，即不为null，且上一次抓取该url的时间没有过期，过期的默认时间genDelay为7天，则不进行抓取。
再往下的generatorSortValue会给url进行打分，CrawlDatum的evaluate函数会对CrawlDatum中的数据进行Jexl解析，这里就不往下看了。map函数再往下检查状态、打分、抓取间隔是否满足指定要求。
Selector的map函数的执行结果，会通过Selector的getPartition函数分发到Reducer上。

Selector::getPartition

    public int getPartition(FloatWritable key, Writable value,
        int numReduceTasks) {
      return partitioner.getPartition(((SelectorEntry) value).url, key,
          numReduceTasks);
    }

partitioner默认为URLPartitioner，传入的参数numReduceTasks表示有多少任务需要执行。URLPartitioner的getPartition函数最终返回的整型指定哪一个Reducer处理Maper的输出结果。

Selector::getPartition->URLPartitioner::getPartition

  public int getPartition(Text key, Writable value, int numReduceTasks) {
    String urlString = key.toString();
    URL url = null;
    int hashCode = urlString.hashCode();

    urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_PARTITION);
    url = new URL(urlString);
    hashCode = url.getHost().hashCode();

    if (mode.equals(PARTITION_MODE_DOMAIN) && url != null)
      hashCode = URLUtil.getDomainName(url).hashCode();
    else if (mode.equals(PARTITION_MODE_IP)) {
      InetAddress address = InetAddress.getByName(url.getHost());
      hashCode = address.getHostAddress().hashCode();
    }

    hashCode ^= seed;
    return (hashCode & Integer.MAX_VALUE) % numReduceTasks;
  }

getPartition函数根据mode的不同，从url中提取出主机名、域名或主机IP地址，然后计算其hash值，最后模以numReduceTasks计算出需要哪个Reducer进行处理。

最后看一下Selector的reduce函数。

Selector::reduce

    public void reduce(FloatWritable key, Iterator<SelectorEntry> values,
        OutputCollector<FloatWritable, SelectorEntry> output, Reporter reporter)
        throws IOException {

      while (values.hasNext()) {

        if (count == limit) {
          if (currentsegmentnum < maxNumSegments) {
            count = 0;
            currentsegmentnum++;
          } else
            break;
        }

        SelectorEntry entry = values.next();
        Text url = entry.url;
        String urlString = url.toString();
        URL u = null;

        String hostordomain = null;
        if (normalise && normalizers != null) {
          urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
        }
        u = new URL(urlString);
        if (byDomain) {
          hostordomain = URLUtil.getDomainName(u);
        } else {
          hostordomain = new URL(urlString).getHost();
        }


        hostordomain = hostordomain.toLowerCase();

        if (maxCount > 0) {
          int[] hostCount = hostCounts.get(hostordomain);
          if (hostCount == null) {
            hostCount = new int[] { 1, 0 };
            hostCounts.put(hostordomain, hostCount);
          }

          hostCount[1]++;
          while (segCounts[hostCount[0] - 1] >= limit
              && hostCount[0] < maxNumSegments) {
            hostCount[0]++;
            hostCount[1] = 0;
          }

          if (hostCount[1] >= maxCount) {
            if (hostCount[0] < maxNumSegments) {
              hostCount[0]++;
              hostCount[1] = 0;
            } else {
              continue;
            }
          }
          entry.segnum = new IntWritable(hostCount[0]);
          segCounts[hostCount[0] - 1]++;
        } else {
          entry.segnum = new IntWritable(currentsegmentnum);
          segCounts[currentsegmentnum - 1]++;
        }

        output.collect(key, entry);
        count++;
      }
    }

reduce函数的核心功能就是计算该url属于哪一个segment中，不同时间获取到的不同url地址会存入crawl/segments下不同的文件夹中。

Generator::run->generate
第二部分

  public Path[] generate(Path dbDir, Path segments, int numLists, long topN,
      long curTime, boolean filter, boolean norm, boolean force,
      int maxNumSegments, String expr) throws IOException {

    ...

    List<Path> generatedSegments = new ArrayList<Path>();

    FileStatus[] status = fs.listStatus(tempDir);
    for (FileStatus stat : status) {
      Path subfetchlist = stat.getPath();
      if (!subfetchlist.getName().startsWith("fetchlist-"))
        continue;
      Path newSeg = partitionSegment(fs, segments, subfetchlist, numLists);
      generatedSegments.add(newSeg);
    }

    if (getConf().getBoolean(GENERATE_UPDATE_CRAWLDB, false)) {
      Path tempDir2 = new Path(getConf().get("mapred.temp.dir", ".")
          + "/generate-temp-" + java.util.UUID.randomUUID().toString());

      job = new NutchJob(getConf());
      job.setJobName("generate: updatedb " + dbDir);
      job.setLong(Nutch.GENERATE_TIME_KEY, generateTime);
      for (Path segmpaths : generatedSegments) {
        Path subGenDir = new Path(segmpaths, CrawlDatum.GENERATE_DIR_NAME);
        FileInputFormat.addInputPath(job, subGenDir);
      }
      FileInputFormat.addInputPath(job, new Path(dbDir, CrawlDb.CURRENT_NAME));
      job.setInputFormat(SequenceFileInputFormat.class);
      job.setMapperClass(CrawlDbUpdater.class);
      job.setReducerClass(CrawlDbUpdater.class);
      job.setOutputFormat(MapFileOutputFormat.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(CrawlDatum.class);
      FileOutputFormat.setOutputPath(job, tempDir2);

      JobClient.runJob(job);
      CrawlDb.install(job, dbDir);

      fs.delete(tempDir2, true);
    }

    LockUtil.removeLockFile(fs, lock);
    fs.delete(tempDir, true);

    Path[] patharray = new Path[generatedSegments.size()];
    return generatedSegments.toArray(patharray);
  }

tempDir目录保存了前面Reducer的输出结果，依次遍历tempDir目录下文件名为“fetchlist-”开头的子目录。partitionSegment函数内部利用hadoop将tempDir文件夹下的输出结果写入crawl/segments中，并返回目录路径。
接下来的if语句是合并crawlDb中和crawl_generate文件夹中的数据，本章不仔细往下看了。
最后删除锁文件，删除临时文件夹tempDir，最后返回crawl/segments下的目录路径，例如“crawl/segments/2016*/crawl_generate”。

Generator::run->generate->partitionSegment

  private Path partitionSegment(FileSystem fs, Path segmentsDir, Path inputDir,
      int numLists) throws IOException {
    Path segment = new Path(segmentsDir, generateSegmentName());
    Path output = new Path(segment, CrawlDatum.GENERATE_DIR_NAME);

    NutchJob job = new NutchJob(getConf());
    job.setJobName("generate: partition " + segment);

    job.setInt("partition.url.seed", new Random().nextInt());

    FileInputFormat.addInputPath(job, inputDir);
    job.setInputFormat(SequenceFileInputFormat.class);

    job.setMapperClass(SelectorInverseMapper.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(SelectorEntry.class);
    job.setPartitionerClass(URLPartitioner.class);
    job.setReducerClass(PartitionReducer.class);
    job.setNumReduceTasks(numLists);

    FileOutputFormat.setOutputPath(job, output);
    job.setOutputFormat(SequenceFileOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(CrawlDatum.class);
    job.setOutputKeyComparatorClass(HashComparator.class);
    JobClient.runJob(job);
    return segment;
  }

generateSegmentName函数根据当前时间生成字符串。
partitionSegment在segmentsDir指向的crawl/segments文件夹下创建generateSegmentName生成的时间为名称的文件夹，然后再在该文件夹下创建crawl_generate文件夹。
inputDir指向tempDir文件夹下的fetchlist-*文件夹。
接下来创建Job，SelectorInverseMapper的map函数什么也没做。URLPartitioner的getPartition函数前面分析过了。PartitionReducer的reduce函数从SelectorEntry中提取出CrawlDatum结构，最终写入到crawl_generate文件夹中。

总结一下，nutch的generate命令其实就是把crawl/crawldb中保存的url地址处理、过滤，最终存入crawl/segments/时间/crawl_generate文件夹中。

二侠

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
nutch源码分析---2

nutch源码分析—generate根据上一章的分析，“bin/nutch generate crawl/crawldb crawl/segments”这条命令最终会调用org.apache.nutch.crawl.Generator的main函数。Generator::main public static void main(String args[]) throws Except
复制链接

扫一扫