MapReduce中如何访问外部jar包和数据文件

最新推荐文章于 2021-04-07 23:59:48 发布

c77_cn

最新推荐文章于 2021-04-07 23:59:48 发布

阅读量1.2k

点赞数

分类专栏： Hadoop 文章标签： MapReduce Distributed Cache

本文链接：https://blog.csdn.net/c77_cn/article/details/45173553

版权

Hadoop 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

说明：本文提及的所有代码和配置参数，都是基于Hadoop 2.5.0-cdh5.2.0环境。

MapReduce（MR）程序中经常需要访问外部的文件，例如：外部的jar包或数据文件。对于前者，可以拷贝到hadoop的lib路径下（本文的CDH环境中，真实路径为/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/lib/）。很明显，这种方法有些笨拙，尤其是节点较多的集群。对于后者，可将数据文件内容序列化后写入Configuration，然后MR使用时再反序列化。这种方法在数据稍大时（数M以上），也不值得提倡。

针对上述方法的不足，可采用下述两种方法之一来弥补。首先声明：以下两种方法本质上都是基于Hadoop的Distributed Cache（DC）机制的。关于DC的介绍，你轻点鼠标能搜到一大堆，所以在此不做介绍。

方法一：使用GenericOptionsParser工具的files和libjars参数

例如：hadoop jar TestLibJar.jar -libjars sqljdbc4-1.0.jar -files mydata.dat /tmp/input /tmp/output

参数中指定的是本地文件。Job提交时，将上传它们到HDFS的${yarn.app.mapreduce.am.staging-dir}/${user}/.staging/${job_id}路径下，然后再分发并缓存到到各个NodeManger的${yarn.nodemanager.local-dirs}/usercache/${user}/filecache下。HDFS上的文件会在Job结束后自动清除，但是NodeManger上的缓存文件可能不会马上清除，因为它的清除机制是由几个参数综合决策的，例如参数yarn.nodemanager.localizer.cache.target-size-m等。

上述命令参数中指定的文件，可用于MR代码中，代码示例如下：

public class TestDistruteCache extends Configured implements Tool
{
    private static final Logger LOG = LoggerFactory.getLogger(TestDistruteCache.class);

    public static void readFile(final String file)
    {
        try
        {
            final Path path = new Path(file);
            final BufferedReader reader = new BufferedReader(new FileReader(new File(
                path.getName())));
            LOG.info("The first line is: {}", reader.readLine());
            reader.close();
        }
        catch (final Exception e)
        {
            e.printStackTrace();
        }
    }

    public static void operatorSqlServer()
    {
        //......;
    }

    public static class Map extends Mapper<LongWritable, Text, LongWritable, Text>
    {
        private URI[] caches = null;

        @Override
        public void setup(final Context context) throws IOException
        {
            operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
            readFile("mydata.dat");   //Access mydata.dat
        }

        @Override
        public void map(final LongWritable key, final Text value, final Context context)
            throws IOException, InterruptedException
        {
            context.write(key, value);
        }
    }

    public static class Reduce extends Reducer<LongWritable, Text, LongWritable, Text>
    {
        private URI[] caches = null;

        @Override
        public void setup(final Context context)
        {
            try
            {
                operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
                readFile("mydata.dat"); //Access mydata.dat
            }
            catch (final IOException e)
            {
                e.printStackTrace();
            }
        }

        public void reducer(final LongWritable key, final Iterable<Text> values,
                            final Context context) throws IOException, InterruptedException
        {
            for (final Text value : values)
            {
                context.write(key, value);
            }
        }
    }

    public int run(final String[] args)  throws Exception
    {
        final Configuration conf = getConf();

        final Job job = Job.getInstance(conf);
        job.setJarByClass(TestDistruteCache.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(2);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);

        return 0;
    }

    public static void main(final String[] args) throws Exception
    {
        ToolRunner.run(new Configuration(), new TestDistruteCache(), args);
    }
}

上述示例中，参数中指定了一个jar和一个数据文件，代码中对两者都进行了访问。在我的测试集群上，可以看到这两个文件都被分发到/yarn/nm/usercache/root/filecache/下面，并且在Job执行过程中，可以在HDFS的/user/root/.staging/路径下看到这两个文件。

每个参数如需指定多个文件，只需逗号分隔文件名即可。

方法二：非参数方式，使用DC的API方式

此方法会将代码中指定的HDFS文件，分发并缓存到各个NodeManger的${yarn.nodemanager.local-dirs}/filecache下，缓存清除机制与方法一相同。当然访问外部文件也是通过代码了......闲话少说，直接上代码：

public class TestDistruteCache extends Configured implements Tool
{
    private static final Logger LOG = LoggerFactory.getLogger(TestDistruteCache.class);

    public static void readFile(final URI file)
    {
        try
        {
            final Path path = new Path(file);
            final BufferedReader reader = new BufferedReader(new FileReader(new File(
                path.getName())));
            LOG.debug("The first line is: {}", reader.readLine());
            reader.close();
        }
        catch (final Exception e)
        {
            e.printStackTrace();
        }
    }

    public static void readFile()
    {
        try
        {
            final Path path = new Path("test.dat");
            final BufferedReader reader = new BufferedReader(new FileReader(new File(
                path.getName())));
            LOG.debug("The first line is: {}", reader.readLine());
            reader.close();
        }
        catch (final Exception e)
        {
            e.printStackTrace();
        }
    }

    public static void operatorSqlServer()
    {
        //......;
    }

    public static class Map extends Mapper<LongWritable, Text, LongWritable, Text>
    {
        private final URI[] caches = null;

        @Override
        public void setup(final Context context)
        {
            operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
            readFile();   //Access test.dat
        }

        @Override
        public void map(final LongWritable key, final Text value, final Context context)
            throws IOException, InterruptedException
        {
            context.write(key, value);
        }
    }

    public static class Reduce extends Reducer<LongWritable, Text, LongWritable, Text>
    {
        private URI[] caches = null;

        @Override
        public void setup(final Context context)
        {
            try
            {
		operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
                caches = context.getCacheFiles();
                readFile(caches[1]); //Access mydata2.dat
            }
            catch (final IOException e)
            {
                e.printStackTrace();
            }

        }

        public void reducer(final LongWritable key, final Iterable<Text> values,
                            final Context context) throws IOException, InterruptedException
        {
            for (final Text value : values)
            {
                context.write(key, value);
            }
        }
    }

    public int run(final String[] args) throws Exception
    {
        final Configuration conf = getConf();

        final Job job = Job.getInstance(conf);
        job.setJarByClass(TestDistruteCache.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(2);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.addCacheFile(new URI("/tmp/mydata1.dat#test.dat"));
        job.addCacheFile(new URI("/tmp/mydata2.dat"));
        job.addArchiveToClassPath(new Path("/tmp/sqljdbc4-1.0.jar"));

        job.waitForCompletion(true);

        return 0;
    }

    public static void main(final String[] args)
        throws Exception
    {
        ToolRunner.run(new Configuration(), new TestDistruteCache(), args);
    }
}

代码与方法一的代码大致类似，但是有几处不同：

1. run中多了三行，这是添加两个数据文件和一个jar文件到DC；

2. 为第一个数据文件指定了别名（test.dat），所以在新增的readFile()中，直接通过别名可以访问数据文件。

小结一下：

1. MR中，如果需要访问的外部数据文件小于数M，请通过Configuration访问，否则请参考本文所述两种方法之一；

2. MR中，如果需要访问外部jar包，请使用本文所述的两种方法之一；

3. 两种方法本质相同，但是留意一个细节：需要分发的文件位于不同位置。方法一是本地文件，方法二是HDFS文件。