hadoop常见错误

最新推荐文章于 2023-04-16 15:28:26 发布

「已注销」

最新推荐文章于 2023-04-16 15:28:26 发布

阅读量837

点赞数

分类专栏： hadoop 文章标签： hadoop jar xml

本文链接：https://blog.csdn.net/madding/article/details/7718192

版权

hadoop 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

xml jar冲突：

https://issues.apache.org/jira/browse/NUTCH-964

三方包依赖：测试时可以先打成一个jar包：

<!-- 包含依赖，打成一个jar包 -->
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
            </plugin>

hadoop上xml文件读取解析：

这个确实比较麻烦，因为FSDataInputStream对象无法直接传给Digester.parse(InputStream input)作解析，而且直接读取文件也必须依赖Configuration，

可以采用FSDataInputStream先读入文件，后用缓存中文件构造一个输入流或inputsource，具体代码如下：

FSDataInputStream fsdis = null;
        try {
            fs = FileSystem.get(conf);
            Path vcPath = new Path(vcfile);
            if (fs.exists(vcPath)) {
                fsdis = fs.open(vcPath);
            } else {
                logger.error("not find vcfile, filepath=" + vcfile);
            }

            int resultlen = 0;
            byte[] result = new byte[1024 * 1024 * 100];
            
            int bufferlen = -1;
            byte[] buffer = new byte[1024 * 256];
            while ((bufferlen=fsdis.read(buffer)) > 0) {
                for (int i = 0; i < bufferlen; i++, resultlen++) {
                    result[resultlen] = buffer[i];
                }
            }
            
            StringReader sr = new StringReader(new String(result, 0, resultlen));
            InputSource is = new InputSource(sr);

            if (this.is != null) {
                vc = (XXXObject) digester.parse(is);
            }

更合理：

if (CategoryLogic.getNestTree() == null) {
            FileSystem fs = null;
            FSDataInputStream fsdis = null;

            BufferedReader bf = null;
            InputSource is = null;
            try {
                fs = FileSystem.get(conf);
                Path vcPath = new Path(vcfile);
                if (fs.exists(vcPath)) {
                    fsdis = fs.open(vcPath);
                } else {
                    logger.error("not find vcfile, filepath=" + vcfile);
                }

                bf = new BufferedReader(new InputStreamReader(fsdis));
                is = new InputSource(bf);
                CategoryLogic.init(is);
            } catch (IOException e) {
                logger.error("get filesystm error.", e);
            } finally {
                IOUtils.closeQuietly(fsdis);
                IOUtils.closeQuietly(bf);
            }
        }

hadoop上运行的任务出现一部分被kill掉，但是运行结果正常：

这是因为hadoop在运行时发现一个任务节点运行太慢，会去多个节点运行同一个任务，当一个结束，会kill掉还没结束的点，具体设置如下：

<property>
       <name>mapred.map.tasks.speculative.execution</name>
       <value>false</value>
   </property>

   <property>
       <name>mapred.reduce.tasks.speculative.execution</name>
       <value>false</value>
   </property>

任务没运行完就被结束掉了：

这个是因为任务运行的时间超过了设置任务的最大运行时间，可以先调整以下任务参数，再优化代码

<property>
       <name>mapred.task.timeout</name>
       <value>900000</value>
   </property>

在作mapreduce时，如果需要加载外部文件作解析需要一定内存，如vc解析，默认512m内存不够用，可以修改以下参数作调整：

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1G</value>
</property>

设置map，reduce任务数：

一个job的任务数差不多是固定的，如果map和reduce能同时运行的数量不同，带来的整体速度必然提高

<property>
       <name>mapred.map.tasks</name>
       <value>200</value>
   </property>
   <property>
       <name>mapred.reduce.tasks</name>
       <value>60</value>
   </property>