单机模式运行hadoop，来自《Hadoop权威指南》

最新推荐文章于 2024-09-20 17:43:24 发布

BuerAkun1024

最新推荐文章于 2024-09-20 17:43:24 发布

阅读量178

点赞数

分类专栏： Hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/qq1641070658/article/details/106555058

版权

Hadoop 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

使用Hadoop来分析数据

使用Mapreduce规范进行编程，本地测试后部署到集群上

两个阶段：

两个阶段均以键值对作为输入、输出。键是某一位置相对于文件起始位置的偏移量

Map阶段：数据准备
- 去除已损数据，筛掉缺失的、可疑的、错误的数据。
- 提取年份和气温信息，并将其作为输出。
- map函数输出经过MapReduce框架处理后，发送到reduce函数。
Reduce阶段：算法设计
- 找出每年的最高气温。
- 基于键值进行排序和分组，输入：键是年份，值是当年所有气温。
- 输出：（年，当年最高气温）

Java MapReduce

Map函数，由Mapper类实现

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper
	extends Mapper<LongWritable, Text, Text, IntWritable> {
	private static final int MISSING = 9999;
	
    @Override
	public void map(LongWritable key, Text value, Context context)
		throws IOException, InterruptedException {
	  String line = value.toString();
	  String year = line.substring(15, 19);
      int airTemperature;
	  if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
	    airTemperature = Integer.parseInt(line.substring(88, 92));
	  } else {
	  	airTemperature = Integer.parseInt(line.substring(87, 92));
      }
      String quality = line.substring(92, 93);
      if (airTemperature != MISSING && quality.matches("[01459]")) {
          //We write an output record only if the temperature is
          //present and the quality code indicates the temperature reading is OK.
	    context.write(new Text(year), new IntWritable(airTemperature));
      }
    }
  }

The Mapper class is a generic type(泛型), with four formal type parameters that specify the input
key, input value, output key, and output value types of the map function.

输入键：长整数偏移量可以见上图 LongWritable

输入值：一行文本

输出键：年份

输出值：气温

hadoop本身的基本类型：这些类型都在org.apache.hadoop.io包中

LongWritable --> java long
Text —> java String
IntWritable —> java Integer

The map() method is passed a key and a value. Context用于输出内容的写入

Reduce函数

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer
    extends Reducer<Text, IntWritable, Text, IntWritable> {
  
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
     int maxValue = Integer.MIN_VALUE;
     for (IntWritable value : values) {
       maxValue = Math.max(maxValue, value.get());
     }
     context.write(key, new IntWritable(maxValue));
   }
}

reduce函数也有四个参数，必须匹配map函数的数据类型。

运行作业的代码

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxTemperature {

  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      System.err.println("Usage: MaxTemperature <input path> <output path>");
      System.exit(-1);//要指定好输入输出路径
    }
    
    Job job = new Job();
	job.setJarByClass(MaxTemperature.class);//Hadoop可以利用传进来的类来找到相关的jar文件
	job.setJobName("Max temperature");
    FileInputFormat.addInputPath(job, new Path(args[0]));//设置输入路径，可以是单个文件或者                                                        //目录，可以多次调用以实现多文件输入
    FileOutputFormat.setOutputPath(job, new Path(args[1]));//设置输出路径，作业运行前这个目                                                            //录是不应该存在的，防止数据丢失
	job.setMapperClass(MaxTemperatureMapper.class);//设置map类
	job.setReducerClass(MaxTemperatureReducer.class);//设置reduce类
	job.setOutputKeyClass(Text.class);//设置输出键
	job.setOutputValueClass(IntWritable.class);//设置输出内容
	System.exit(job.waitForCompletion(true) ? 0 : 1);//返回值表示执行成功或者失败
  }
}

Job对象控制指定作业执行规范，控制整个作业的运行。

运行测试

运行测试使用小数据即可，单机测试

% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop MaxTemperature input/ncdc/sample.txt output
#运行这些命令需要在范例所在的文件夹下才可以
attempt_local26392882_0001_m_000000_0
#map函数的ID
attempt_local26392882_0001_r_000000_0
#reduce函数的ID
#The last section of the output, titled “Counters,” shows the statistics that Hadoop #generates for each job it runs. These are very useful for checking whether the amount of #data processed is what you expected. For example, we can follow the number of records #that went through the system: five map input records produced five map output records 
#(since the mapper emitted one output record for each valid input record), then five #reduce input records in two groups (one for each unique key) produced two reduce output #records.
% cat output/part-r-00000
1949 111
1950 22
#最终的输出

Hadoop 权威指南第三版中文版
提取码：7fzt