【无标题】

最新推荐文章于 2024-08-22 20:19:10 发布

21??????????

最新推荐文章于 2024-08-22 20:19:10 发布

阅读量67

点赞数

文章标签： lucene apache 全文检索

本文链接：https://blog.csdn.net/MyosotisLPS/article/details/122181629

版权

2021SC@SDUSC

根据上一章的分析，“bin/nutch fetch crawl/segments/*”这条命令最终会调用org.apache.nutch.fetcher.Fetcher的main函数。

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new Fetcher(), args);
System.exit(res);
}

ToolRunner的run函数进而调用Fetcher的run函数。

Fetcher::run

public int run(String[] args) throws Exception {

Path segment = new Path(args[0]);
int threads = getConf().getInt("fetcher.threads.fetch", 10);

for (int i = 1; i < args.length; i++) {
  if (args[i].equals("-threads")) {
    threads = Integer.parseInt(args[++i]);
  }
}

getConf().setInt("fetcher.threads.fetch", threads);

fetch(segment, threads);
return 0;

}

获取抓取网页的线程数threads，默认为10，segment为crawl/segments/2*的目录路径，最后调用fetch函数。

Fetcher::run->fetch

public void fetch(Path segment, int threads) throws IOException {
checkConfiguration();

JobConf job = new NutchJob(getConf());
job.setJobName("fetch " + segment);
job.setInt("fetcher.threads.fetch", threads);
job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());
job.setSpeculativeExecution(false);
FileInputFormat.addInputPath(job, new Path(segment,
    CrawlDatum.GENERATE_DIR_NAME));
job.setInputFormat(InputFormat.class);
job.setMapRunnerClass(Fetcher.class);

FileOutputFormat.setOutputPath(job, segment);
job.setOutputFormat(FetcherOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NutchWritable.class);

JobClient.runJob(job);

}

checkConfiguration在配置文件中检查是否配置了http.agent.name属性，如果没有设置则抛出异常。接下来创建hadoop的Job，输入为crawl/segments/2下的crawl_generate目录，由命令generate生成，处理函数为Fetcher中的run函数，输出也为crawl/segments/2目录下，FetcherOutputFormat定义了最后如何输出。

Fetcher::run

public void run(RecordReader<Text, CrawlDatum> input,
OutputCollector<Text, NutchWritable> output, Reporter reporter)
throws IOException {

...

feeder = new QueueFeeder(input, fetchQueues, threadCount
    * queueDepthMuliplier);
feeder.start();

for (int i = 0; i < threadCount; i++) {
  FetcherThread t = new FetcherThread(getConf(), getActiveThreads(), fetchQueues, 
      feeder, spinWaiting, lastRequestStart, reporter, errors, segmentName,
      parsing, output, storingContent, pages, bytes);
  fetcherThreads.add(t);
  t.start();
}

...

}

Fetcher的run函数首先创建一个共享队列QueueFeeder，然后创建QueueFeeder(feeder)，用于读取crawl/crawldb/2*下的url和CrawlDatum，把它们放到共享队列FetchItemQueues(fetchQueues)中。
然后创建FetcherThread，并调用其start函数开始抓取网页。

Fetcher::run->QueueFeeder::run

public void run() {
boolean hasMore = true;

while (hasMore) {

  ...

  int feed = size - queues.getTotalSize();
  if (feed <= 0) {
    Thread.sleep(1000);
    continue;
  } else {
    while (feed > 0 && hasMore) {
      Text url = new Text();
      CrawlDatum datum = new CrawlDatum();
      hasMore = reader.next(url, datum);
      if (hasMore) {
        queues.addFetchItem(url, datum);
        feed--;
      }
    }
  }
}

}

feed变量表示共享队列FetchItemQueues中是否有空闲位置可以插入待抓取的url和CrawlDatum，如果feed小于0，表示空间不足，就需要进程睡眠等待，如果feed大于0，表示空间足够，此时通过RecordReader（reader）的next函数从crawl/crawldb/2*/crawl_generate文件夹中依次读取url和CrawlDatum，调用addFetchItem函数将其封装成FetchItem并添加到共享队列中。

21??????????

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【无标题】

2021SC@SDUSC根据上一章的分析，“bin/nutch fetch crawl/segments/*”这条命令最终会调用org.apache.nutch.fetcher.Fetcher的main函数。public static void main(String[] args) throws Exception {int res = ToolRunner.run(NutchConfiguration.create(), new Fetcher(), args);System.exit(res)
复制链接

扫一扫