大数据从入门到实战-文件内容合并去重

最新推荐文章于 2023-12-18 15:47:20 发布

是草莓熊吖

最新推荐文章于 2023-12-18 15:47:20 发布

阅读量2.3k

点赞数 4

文章标签： java hadoop 大数据

本文链接：https://blog.csdn.net/qq_61604164/article/details/126018797

版权

map类

首先我们来看看Mapper对象：

在编写MapReduce程序时，要编写一个类继承Mapper类，这个Mapper类是一个泛型类型，它有四个形参类型，分别指定了map()函数的输入键，输入值，和输出键，输出值的类型。就第一关的例子来说，输入键是一个长整型，输入值是一行文本，输出键是单词，输出值是单词出现的次数。

Hadoop提供了一套可优化网络序列化传输的基本类型，而不是直接使用Java内嵌的类型。这些类型都在org.apache.hadoop.io包中，这里使用LongWritable（相当于Java中的Long类型），Text类型（相当于Java中的String类型）和IntWritable（相当于Integer类型）。

map()函数的输入是一个键和一个值，我们一般首先将包含有一行输入的text值转换成Java的String类型，然后再使用对字符串操作的类或者其他方法进行操作即可。

Reducer类

同样Reducer也有四个参数类型用于指定输入和输出类型，reduce()函数的输入类型必须匹配map函数的输出类型，即Text类型和IntWritable类型，在这种情况下，reduce函数的输出类型也必须是Text和IntWritable类型，即分别输出单词和次数。

Job类

一般我们用Job对象来运行MapReduce作业，Job对象用于指定作业执行规范，我们可以用它来控制整个作业的运行，我们在Hadoop集群上运行这个作业时，要把代码打包成一个JAR文件（Hadoop在集群上发布的这个文件），不用明确指定JAR文件的名称，在Job对象的setJarByClass()函数中传入一个类即可，Hadoop利用这个类来查找包含他的JAR文件。addInputPath()函数和setOutputPath()函数用来指定作业的输入路径和输出路径。值的注意的是，输出路径在执行程序之前不能存在，否则Hadoop会拒绝执行你的代码。

最后我们使用waitForCompletion()方法提交代码并等待执行，该方法唯一的参数是一个布尔类型的值，当该值为true时，作业会把执行过程打印到控制台，该方法也会返回一个布尔值，表示执行的成败。

编程要求

接下来我们通过一个练习来巩固学习到的MapReduce知识吧。

对于两个输入文件，即文件file1和文件file2，请编写MapReduce程序，对两个文件进行合并，并剔除其中重复的内容，得到一个新的输出文件file3。为了完成文件合并去重的任务，你编写的程序要能将含有重复内容的不同文件合并到一个没有重复的整合文件，规则如下：

第一列按学号排列；
学号相同，按x,y,z排列；
输入文件路径为：/user/tmp/input/；
输出路径为：/user/tmp/output/。

注意：输入文件后台已经帮你创建好了，不需要你再重复创建

代码文件

import java.io.IOException;

import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Merge {

	/**
	 * @param args
	 * 对A,B两个文件进行合并，并剔除其中重复的内容，得到一个新的输出文件C
	 */
	//在这重载map函数，直接将输入中的value复制到输出数据的key上 注意在map方法中要抛出异常：throws IOException,InterruptedException
	public static class Map  extends Mapper<Object, Text, Text, Text>{
	
    /********** Begin **********/

        public void map(Object key, Text value, Context content) 
            throws IOException, InterruptedException {  
            Text text1 = new Text();
            Text text2 = new Text();
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                text1.set(itr.nextToken());
                text2.set(itr.nextToken());
                content.write(text1, text2);
            }
        }  
	/********** End **********/
	} 
		
	//在这重载reduce函数，直接将输入中的key复制到输出数据的key上  注意在reduce方法上要抛出异常：throws IOException,InterruptedException
	public static class  Reduce extends Reducer<Text, Text, Text, Text> {
    /********** Begin **********/
        
        public void reduce(Text key, Iterable<Text> values, Context context) 
            throws IOException, InterruptedException {
            Set<String> set = new TreeSet<String>();
            for(Text tex : values){
                set.add(tex.toString());
            }
            for(String tex : set){
                context.write(key, new Text(tex));
            }
        }  
    
	/********** End **********/

	}
	
	public static void main(String[] args) throws Exception{

		// TODO Auto-generated method stub
		Configuration conf = new Configuration();
		conf.set("fs.default.name","hdfs://localhost:9000");
		
		Job job = Job.getInstance(conf,"Merge and duplicate removal");
		job.setJarByClass(Merge.class);
		job.setMapperClass(Map.class);
		job.setCombinerClass(Reduce.class);
		job.setReducerClass(Reduce.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		String inputPath = "/user/tmp/input/";  //在这里设置输入路径
		String outputPath = "/user/tmp/output/";  //在这里设置输出路径

		FileInputFormat.addInputPath(job, new Path(inputPath));
		FileOutputFormat.setOutputPath(job, new Path(outputPath));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

}

启动hadoop点击测试