maperuce 运算

最新推荐文章于 2024-07-23 17:34:00 发布

四风

最新推荐文章于 2024-07-23 17:34:00 发布

阅读量213

点赞数

文章标签：大数据 Java Hadoop

本文链接：https://blog.csdn.net/weixin_42776949/article/details/86484089

版权

maperuce 运算
1. 概念：
maperuce 运算框架主要实现 hadoop 的数据处理，数据处理经过 5个节点
数据流： input ---> spilt -----> map -----> shuffile -----> reduce (最后reduce 输出 )

1.1 input [ 把被运算的数据录入，切块，分成 64 M 大小的块（block） , 方便后续计算】


1.2 spilt 【把被运算的数据切片，将Input 中的块按照行切成片。是 key -- value, 】
每行的起始下标作为输出键，每行的内容作为输出值

1.3 map [ 把 spilt 的片（行），进行数据处理，处理成键值对，将每行拆分成每一个单词作为输出键，个数设置为 1. 作为输出值，


1.4 shuffle 【混洗】， { 把相同的放在一起，一个为【1】，两个为【1， 1】，值是一个固定值为1 的数组}


1.5 reduce [ 将 shuffile 的结果集做数据处理，】
wordcount 的数据处理：将键对应的值（值为1 的数字）的做累加，即得出我们的每个单词出现的个数。


1.6 输出（ output )


2. maperuce 开发
准备开发：

新建 maperuce 项目： wordcountdemo

增加配置文件： core-site 。xml ， log4j .xml

新建一个 class , wordcountJob ( 开发 map ，开发reduce, 创建job 并执行）

2.1 map 开发
要求：
1. 静态；

2. 继承 hadoop 的 mapper 父类

3. 重写 map（）

public void map(Object keyIn, Text valueIn, Context ctx){
            IntWritable valueOut = new IntWritable(1);
            Text keyOut = null;
            
            StringTokenizer token = new StringTokenizer(valueIn.toString());
            //
            while(token.hasMoreTokens()){
                String key = token.nextToken();
                keyOut = new Text(key);
                try {
                    ctx.write(keyOut, valueOut);
                } catch (IOException | InterruptedException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
            }
            
        }

2.2 reduce 开发
要求：
1. 静态；

2. 继承 hadoop 的 reduce 父类

3. 重写 reduce（）

  public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
        
        public void reduce(Text keyIn , Iterable<IntWritable>valuesIn,Context ctx) throws IOException, InterruptedException{
        Text keyOut = keyIn;
        //输出值
        IntWritable valueOut = new IntWritable();
        int sum = 0;
        // 循环混洗后的数字数组 如【1.1.1.1】
        for(IntWritable val : valuesIn){
            sum +=val.get();   //累加
            
        }
            valueOut.set(sum); // 将累加的结果转化成IntWritable
            
            ctx.write(keyOut, valueOut); // 输出到下一步
        }    
        
    }

2.3 创建并启动 job

步骤：
                           1. 加载HDFS配置文件（配置hdfs 访问入口）

                       2. 创建一个 job 设置 job (运算作业）的主启动类。
                       3. 设置 job 的 map 自定义静态类

                       4. 设置 reduce 的自定义静态类

public static void main (String[] args) throws IOException, ClassNotFoundException, InterruptedException{
       // 创建 job 执行Job

       //1. 加载HDFS配置文件（配置hdfs 访问入口）

       Configuration conf = new Configuration();

       //2. 创建一个 job 设置 job (运算作业）的主启动类。
       Job job = Job.getInstance(conf);
       job.setJarByClass(wordcountJob.class);

   //   3. 设置 job 的 map 自定义静态类
job.setMapperClass(WordCountMapper.class);
       //4. 设置 reduce 的自定义静态类
       job.setReducerClass(WordCountReducer.class);

       // 配置最终输出（reduce）

       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(IntWritable.class);

       // maperduce 作业需要的资源位置（总输出位置）
       Path inputPath = new Path("hdfs://node1:9000/input/*.txt");
       FileInputFormat.addInputPath(job, inputPath);


       // maperduce 作业结果的保存位置（总输出位置）
       Path outputPath = new Path("hdfs://node1:9000/output/wc10");
       FileOutputFormat.setOutputPath(job, outputPath);

       //启动
       System.exit(job.waitForCompletion(true)?0:1);
}

3. hdfs 的数据类型；
字符串： TeXT，等同于 Java 中的字符串，在HDFS 中的TXT 类型是字节文件。
text -----> String
text t : 转换成 String 。 t. to String ( )
String -------> Text

Text t = new Text(字符串）

整型数字： intWritable 等同于Java 的 integer
intWritable 转 int
eg ; intWritable a:
int b = a.get() ; // 转化

int 转 intWritable
eg: intWritable a = new IntWritable( ） ;
或
intWritable

长整型： longwritable 等同于Java 的 long

四风

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
maperuce 运算

maperuce 运算1. 概念： maperuce 运算框架主要实现 hadoop 的数据处理，数据处理经过 5个节点数据流： input ---&gt; spilt -----&gt; map -----&gt; shuffile -----&gt; reduce (最后reduce 输出 ) ...
复制链接

扫一扫