MapReduce去重复数据打包服务器运行随堂笔记202111

最新推荐文章于 2023-11-23 14:23:51 发布

Jasonwx123

最新推荐文章于 2023-11-23 14:23:51 发布

阅读量226

点赞数

分类专栏： Hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/JsonWuxin/article/details/112061191

版权

Hadoop 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

一、 MapReduce去重复数据

案例：创建一个文件 1.txt
A 1
a 2
b 23
a 2
c 34
输出结果：
A 1
a 2
b 23
c 34

二、准备工作

1.txt
将1.txt(本地文件)上传到hdfs文件系统中【hafs dfs –put 】
将idea程序打jar包
执行hadoop jar包的命令【hadoop jar jar_name class_name hdfs文件输出路径】

三、细节

Mavan 本地仓库
Jar包

四、代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class DuplicationMapReduce {
  /*
     MyMapper class  extends Mapper
     MyReducer class  extends Reducer

    */
    public static class MyMapper extends Mapper<LongWritable, Text,Text,Text> {
      private static Text line = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
     /* super.map(key, value, context);*/
      line = value;
      context.write(line,new Text(""));
    }
  }
    public static class MyReducer extends Reducer<Text,Text,Text,Text> {
      @Override
      protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        /*super.reduce(key, values, context);*/
        context.write(key,new Text(""));
      }
    }
    public static void main(String[] args) throws IOException{
      Job job = Job.getInstance(new Configuration());
      job.setJarByClass(DuplicationMapReduce.class);// className
      job.setMapperClass(MyMapper.class);
      job.setReducerClass(MyReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(Text.class);
      FileInputFormat.setInputPaths(job,new Path(args[0]));
      FileOutputFormat.setOutputPath(job,new Path(args[1]));
      try{
        Boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
      }catch(Exception e){
        e.printStackTrace();
      }
    }
}

五、上传jar 包

在这里插入图片描述

六、测试

在这里插入图片描述

hdfs dfs –put 1.txt /
在这里插入图片描述

hadoop jar Projectwx-1.0-SNAPSHOT.jar DuplicationMapReduce /1.txt /2021
在这里插入图片描述

hdfs dfs -cat /2021/part-r-00000

在这里插入图片描述

Jasonwx123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce去重复数据打包服务器运行随堂笔记202111

一、 MapReduce去重复数据案例：创建一个文件 1.txtA 1a 2b 23a 2c 34输出结果：A 1a 2b 23c 34二、准备工作1.txt将1.txt(本地文件)上传到hdfs文件系统中【hafs dfs –put 】将idea程序打jar包执行hadoop jar包的命令【hadoop jar jar_name class_name hdfs文件输出路径】三、细节Mavan 本地仓库Jar包四、代码import org
复制链接

扫一扫