一、数据集及程序要求
数据集ramen-ratings.txt,包含全世界2580种方便面的品牌、国家、包装类型、评分等内容,使用MapReduce计数并求平均值,输出:
- 所有国家方便面的平均分(即哪国的方便面最好吃)
数据获取地址:https://github.com/ordinaryload/Hadoop-tools
二、源代码编写
2.1 打开IntelliJ IDEA创建Maven项目
2.2 pom.xml 文件如下:
<dependencies>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.mrunit/mrunit -->
<dependency>
<groupId>org.apache.mrunit</groupId>
<artifactId>mrunit</artifactId>
<version>1.1.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j</artifactId>
<version>2.5</version>
<type>pom</type>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>3.1.0</version>
</dependency>
</dependencies>
2.4 编写Map实现
- 实现逻辑:
输入:一行数据
处理:使用空格将字符串split成数组,提取国家和评分,分别作为键和值输出。
输出:<国家, 评分> - 代码编辑:
public class NoodlesMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
private static Logger logger = Logger.getLogger(NoodlesMapper.class.getName());
/**
*
* @param key 输入键参数(行首字符偏移量) 数据类型须与泛型类型1一致
* @param value 输入值参数(行文本) 数据类型须与泛型类型2一致
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//初始化数据
String[] strs = value.toString().split("\t");
//国家
String nation = strs[4];
logger.info("nation:"+nation);
double v = 0.0;
//清理无效数据
if(strs[5].equals("Unrated")){
return;
}else {
v = Float.parseFloat(strs[5].trim());
}
logger.info("stars ============>>> " + v);
context.write(new Text(nation), new DoubleWritable(v));
}
}
2.5 编写Reduce实现
- 实现逻辑:
输入:<国家,[该国家所有的评分数组]>
处理:计算平均分
输出:<国家,平均分> - 代码编辑
public class NoodlesReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
private static Logger logger = Logger.getLogger(NoodlesReducer.class.getName());
@Override
protected void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
double sum = 0;
int count = 0;
Iterator<DoubleWritable> val = values.iterator();
while (val.hasNext()) {
sum += val.next().get();//计算总评分
count++;//统计总的国家数
}
double avg = (double) sum/count;
logger.info("avg = " + avg);
context.write(key, new DoubleWritable(avg));
}
}
2.6 编写Run实现
public class NoodlesAVGRun extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new NoodlesAVGRun(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
//System.setProperty("HADOOP_USER_NAME","root");//指定虚拟机里的用户名
Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://192.168.137.150:9000");
// 创建作业
Job job = Job.getInstance(conf, "NoodlesAVGRun");
// 指定作业的主类
job.setJarByClass(NoodlesAVGRun.class);
// 指定Map和Reduce类
job.setMapperClass(NoodlesMapper.class);
job.setReducerClass(NoodlesReducer.class);
// 指定输入格式为:文本格式文件
job.setInputFormatClass(TextInputFormat.class);
//TextInputFormat.addInputPath(job, new Path(args[0]));
TextInputFormat.addInputPath(job, new Path("/ramen-ratings.txt"));
// 指定输出格式为:文本格式文件,键为文本、值为浮点型
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
//TextOutputFormat.setOutputPath(job, new Path(args[1]));
TextOutputFormat.setOutputPath(job, new Path("/test/output"));
// 执行MapReduce
boolean res = job.waitForCompletion(true);
if(res) {
System.out.println("执行成功");
return 0;
}
else
return -1;
}
}
执行结果: