Nutch1

最新推荐文章于 2024-09-04 16:58:44 发布

rongrong0206

最新推荐文章于 2024-09-04 16:58:44 发布

阅读量590

点赞数 1

分类专栏：搜索引擎/hadoop 文章标签： string 任务 java null url 360

搜索引擎/hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

nutch搏斗之一

问题描述：
在用nutch1.0做generate 包括5亿url的crawldb时，它默认按照64M分块，分成777个map task，在运行的后期出现
Could not find taskTracker/jobcache/job_200903231519_0017/attempt_200903231519_0017_r_000051_0/output/file.out in any of the configured local directories
异常。
解决办法：
减小task数目，改成按照crawldb里面文件个数划分的策略：

Java代码

public static class InputFormat extends SequenceFileInputFormat<WritableComparable, Writable> {
/** Don't split inputs, to keep things polite. */
public InputSplit[] getSplits(JobConf job, int nSplits)
throws IOException {
FileStatus[] files = listStatus(job);
FileSystem fs = FileSystem.get(job);
InputSplit[] splits = new InputSplit[files.length];
for (int i = 0; i < files.length; i++) {
FileStatus cur = files[i];
splits[i] = new FileSplit(cur.getPath(), 0,
cur.getLen(), (String[])null);
}
return splits;
}
}

这次出现了新问题，有数个task因为十分钟无反应而导致整个任务failed
解决办法：
修改hadoop-site.xml

Java代码

<property>
<name>mapred.task.timeout</name>
<value>3600000</value>
<description>The number of milliseconds before a task will be
terminated if it neither reads an input, writes an output, nor
updates its status string.
</description>
</property>

总结：
大与小，多与少，长与短，在不同的情况下是不断变化的，对于大数据量而言，更要跟具实际情况灵活变化，所谓运用之刀，存乎一心是也！

rongrong0206

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Nutch1

<br />nutch搏斗之一问题描述： <br />在用nutch1.0做generate 包括5亿url的crawldb时，它默认按照64M分块，分成777个map task，在运行的后期出现 <br />Could not find taskTracker/jobcache/job_200903231519_0017/attempt_200903231519_0017_r_000051_0/output/file.out in any of the configured local directori
复制链接

扫一扫

专栏目录