MapReduce应用开发

最新推荐文章于 2022-05-26 18:59:43 发布

星月的雨

最新推荐文章于 2022-05-26 18:59:43 发布

阅读量306

点赞数

分类专栏：大数据 hadoop

本文链接：https://blog.csdn.net/liu1390910/article/details/79132833

版权

hadoop 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

大数据

11 篇文章 0 订阅

订阅专栏

用于配置的API

    Configuration conf = new Configuration();
    conf.addResource("configuration-1.xml");
    conf.addResource("configuration-2.xml");

辅助类GenericOptionsParser，Tool，ToolRunner

GenericOptionsParser是一个类，用来解释常用的Hadoop命令行选项，并根据需要，为Configuration对象设置相应的取值，

通常不直接使用，更方便的方式是使用Tool接口，通过ToolRunner来运行应用程序，ToolRunner内部调用GenericOptionsParser

public class ConfigurationPrinter extends Configured implements Tool {
  
  static {
    Configuration.addDefaultResource("hdfs-default.xml");
    Configuration.addDefaultResource("hdfs-site.xml");
    Configuration.addDefaultResource("yarn-default.xml");
    Configuration.addDefaultResource("yarn-site.xml");
    Configuration.addDefaultResource("mapred-default.xml");
    Configuration.addDefaultResource("mapred-site.xml");
  }

  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    for (Entry<String, String> entry: conf) {
      System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
    }
    return 0;
  }
  
  public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);
    System.exit(exitCode);
  }
}

Hadoop -D color=yellow 用于将color的配置属性设置为yellow，-D优先级要高于配置文件里的其他属性

MRUnit写单元测试

MRUnit是一个测试库，便于将一直的输入传递给Mapper或者检查reducer的输出是否符合预期

 @Test
  public void processesValidRecord() throws IOException, InterruptedException {
    Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
                                  // Year ^^^^
        "99999V0203201N00261220001CN9999999N9-00111+99999999999");
                              // Temperature ^^^^^
    new MapDriver<LongWritable, Text, Text, IntWritable>()
      .withMapper(new MaxTemperatureMapper())
      .withInput(new LongWritable(0), value)
      .withOutput(new Text("1950"), new IntWritable(-11))
      .runTest();
  }

public class MaxTemperatureReducerTest {
  
  //vv MaxTemperatureReducerTestV1
  @Test
  public void returnsMaximumIntegerInValues() throws IOException,
      InterruptedException {
    new ReduceDriver<Text, IntWritable, Text, IntWritable>()
      .withReducer(new MaxTemperatureReducer())
      .withInput(new Text("1950"),
          Arrays.asList(new IntWritable(10), new IntWritable(5)))
      .withOutput(new Text("1950"), new IntWritable(10))
      .runTest();
  }
  //^^ MaxTemperatureReducerTestV1
}

本地运行测试数据

在本地作业运行期上运行作业

public class MaxTemperatureDriver extends Configured implements Tool {

  @Override
  public int run(String[] args) throws Exception {
    if (args.length != 2) {
      System.err.printf("Usage: %s [generic options] <input> <output>\n",
          getClass().getSimpleName());
      ToolRunner.printGenericCommandUsage(System.err);
      return -1;
    }
    
    Job job = new Job(getConf(), "Max temperature");
    job.setJarByClass(getClass());

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    job.setMapperClass(MaxTemperatureMapper.class);
    job.setCombinerClass(MaxTemperatureReducer.class);
    job.setReducerClass(MaxTemperatureReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    
    return job.waitForCompletion(true) ? 0 : 1;
  }
  
  public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
    System.exit(exitCode);
  }
}

测试驱动程序

1.使用本地作业运行

 @Test
  public void test() throws Exception {
    Configuration conf = new Configuration();
    conf.set("fs.defaultFS", "file:///");
    conf.set("mapreduce.framework.name", "local");
    conf.setInt("mapreduce.task.io.sort.mb", 1);
    
    Path input = new Path("input/ncdc/micro");
    Path output = new Path("output");
    
    FileSystem fs = FileSystem.getLocal(conf);
    fs.delete(output, true); // delete old output
    
    MaxTemperatureDriver driver = new MaxTemperatureDriver();
    driver.setConf(conf);
    
    int exitCode = driver.run(new String[] {
        input.toString(), output.toString() });
    assertThat(exitCode, is(0));
    
    checkOutput(conf, output);
  }

2.使用一个mini集群运行，Hadoop有一组测试类，MiniDFSCluster，MiniMRCluster,MiniYARN

Cluster

在集群上运行

本地左右使用单JVM运行一个作业

在分布式中：

JobConf 或 Job上的 setJarByClass()方法中设置类

setJar() 通过文件路径设置一个指定的JAR文件

客户端的类路径：

hadoop jar <jar>组成部分：

作业的jar文件

所有jar文件包含的lib文件

HADOOP_CLASSPATH定义的类路径(如果设置了)

任务类路径，不受HADOOP_CLASSPATH控制

作业的jar文件

所有jar文件包含的lib文件

使用-libjars选项添加到分布式缓存的所有文件

启动作业

-conf选项指定要运行的作业集群 hadoop -conf hadoop-cluster.xml

MapReduce作业ID有Yarn资源管理器创建，一个应用id包含两部分:

开始时间和计数器：application_124324234234234_003 003表示第三个应用

作业ID ： application替换为job: job_124324234234234_003

任务ID ： task_124324234234234_003_m_00003 表示ID为task_124324234234234_003_m_00003

作业的第4个map任务，由0开始

任务尝试：task_124324234234234_003_m_00003_0 0表示第一次尝试

MapReduce页面

作业历史由MapReduce的 application_master存放在HDFS中

通过mapreduce.jobhistory.done-dir 设置存放目录

获取结果

每个reduce产生一个输出文件,以下命令得到了源模式指定目录下所有文件，并将其合并为本地文件系统

的一个文件：

hadoop fs -getmerge max-temp max-temp-local

sort max-temp-local | tail

或使用 -cat命令打印到控制台

作业调试

调试一个作业是，应当总想是否能够使用计数器来获得需要找出时间发生来源的相关信息。

如果日志数据规模比较大:

一种是将这些信息写到map的输出流，供reduce任务分析和汇总，而不是写到标准

错误流

一种是写程序分析作业产生的日志

public class MaxTemperatureMapper
  extends Mapper<LongWritable, Text, Text, IntWritable> {

  /*[*/enum Temperature {
    OVER_100
  }/*]*/
  
  private NcdcRecordParser parser = new NcdcRecordParser();

  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    
    parser.parse(value);
    if (parser.isValidTemperature()) {
      int airTemperature = parser.getAirTemperature();
      /*[*/if (airTemperature > 1000) {
        System.err.println("Temperature over 100 degrees for input: " + value);
        context.setStatus("Detected possibly corrupt record: see logs.");
        context.getCounter(Temperature.OVER_100).increment(1);
      }/*]*/
      context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
    }
  }
}

处理不合理的数据

public class MaxTemperatureMapper
  extends Mapper<LongWritable, Text, Text, IntWritable> {
  
  enum Temperature {
    MALFORMED
  }

  private NcdcRecordParser parser = new NcdcRecordParser();
  
  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    
    parser.parse(value);
    if (parser.isValidTemperature()) {
      int airTemperature = parser.getAirTemperature();
      context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
    } else if (parser.isMalformedTemperature()) {
      System.err.println("Ignoring possibly corrupt input: " + value);
      context.getCounter(Temperature.MALFORMED).increment(1);
    }
  }
}

Hadoop日志

日志聚合：YARN中可以获取到已完成的应用任务日志，并把其搬移到HDFS中，如果被启用，可通过集群上将

yarn.log-aggregation-enable设置为true，可点击web界面中的logs链接，或使用命令

默认关闭状态

public class LoggingIdentityMapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
  extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
  
  private static final Log LOG = LogFactory.getLog(LoggingIdentityMapper.class);
  
  @Override
  @SuppressWarnings("unchecked")
  public void map(KEYIN key, VALUEIN value, Context context)
      throws IOException, InterruptedException {
    // Log to stdout file
    System.out.println("Map key: " + key);
    
    // Log to syslog file
    LOG.info("Map key: " + key);
    if (LOG.isDebugEnabled()) {
      LOG.debug("Map value: " + value);
    }
    context.write((KEYOUT) key, (VALUEOUT) value);
  }
}

默认日志级别是INFO,因此debug不显示在syslog中，需要设置mapreduce.map.log.level

远程调试

作业调优

mapper数量
reducer数量
combiner
中间值压缩
自定义序列
调整shuffle

分析任务

必须不断运行，改变或不改变代码，并检查是否有明显的改进

有些问题（如内存溢出）只能在集群上重现

1.HPROF分析工具

mapreduce.task.profile设置为true

能提供程序的CPU和堆使用情况能有价值的信息

MapReduce的工作流

增加更多作业，而不是增加作业复杂度

更复杂问题，可考虑更高级语言,Pig,Hive,Spark...

JobControl

当MapReduce工作流的作业不止一个时，考虑是否用一个线性作业链，或一个更复杂的有向无环图

JobClient.runjob(conf1)

JobClient.runjob(conf2)

Apache Oozie

运行工作流的系统，该工作流由相互依赖的作业组成

星月的雨

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
MapReduce应用开发

用于配置的API Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); conf.addResource("configuration-2.xml");辅助类GenericOptionsParser，Tool，ToolRunnerGenericO
复制链接

扫一扫