Java MapReduce 基本计算操作实现实战

最新推荐文章于 2024-08-24 20:33:52 发布

涛濤

最新推荐文章于 2024-08-24 20:33:52 发布

阅读量8.7k

点赞数 7

分类专栏： Hadoop 文章标签： Hadoop mapreduce hdfs

本文链接：https://blog.csdn.net/admin1973/article/details/62037603

版权

Java MapReduce 基本计算操作实现实战

特别注意:

1.在运行代码的过程中注意自己本地Hadoop版本必须与服务器版本一致,否则会出现许多意向不到的问题;

2.数据之间不能有空行间隔;

3.解析字符串时使用的StringTokenizer,单词间隔必须是空格,否则也将解析失败;

4.可以将输入数据放在本地也可以上传至HDFS上,使用时指定文件夹Path就OK了;

源码下载:http://download.csdn.net/detail/admin1973/9780247

本地使用Hadoop.dll下载:http://download.csdn.net/detail/admin1973/9780240

1、数据去重

"数据去重"主要是为了掌握和利用并行化思想来对数据进行有意义的筛选。统计大数据集上的数据种类个数、从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重。下面就进入这个实例的MapReduce程序设计。

1.1 实例描述

对数据文件中的数据进行去重。数据文件中的每行都是一个数据。

　　样例输入如下所示：

1）file1：

2012-3-1 a

2012-3-2 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-7 c

2012-3-3 c

2）file2：

2012-3-1 b

2012-3-2 a

2012-3-3 b

2012-3-4 d

2012-3-5 a

2012-3-6 c

2012-3-7 d

2012-3-3 c

样例输出如下所示：

2012-3-1 a

2012-3-1 b

2012-3-2 a

2012-3-2 b

2012-3-3 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-6 c

2012-3-7 c

2012-3-7 d

1.2 设计思路

数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次。我们自然而然会想到将同一个数据的所有记录都交给一台reduce机器，无论这个数据出现多少次，只要在最终结果中输出一次就可以了。具体就是reduce的输入应该以数据作为key，而对value-list则没有要求。当reduce接收到一个<key，value-list>时就直接将key复制到输出的key中，并将value设置成空值。

　　在MapReduce流程中，map的输出<key，value>经过shuffle过程聚集成<key，value-list>后会交给reduce。所以从设计好的reduce输入可以反推出map的输出key应为数据，value任意。继续反推，map输出数据的key为数据，而在这个实例中每个数据代表输入文件中的一行内容，所以map阶段要完成的任务就是在采用Hadoop默认的作业输入方式之后，将value设置为key，并直接输出（输出中的value任意）。map中的结果经过shuffle过程之后交给reduce。reduce阶段不会管每个key有多少个value，它直接将输入的key复制为输出的key，并输出就可以了（输出中的value被设置成空了）。

1.3 程序代码

程序代码如下所示：

package com.sct.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * Created by leitao on 2017/3/13.
 * 数据去重
 */
public class Dedup {
    //map将输入中的value复制到输出数据的key上,并直接输出
    public static class Map extends Mapper<Object,Text,Text,Text>{
        private static  Text line = new Text();//每行数据
        //实现map函数

        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            line = value;
            context.write(line,new Text(""));
        }
    }
    //reduce将输入中的key复制到输出数据的key上,并直接输出
    public  static  class  Reduce extends Reducer<Text,Text,Text,Text>{
        //实现reduce函数

        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            context.write(key,new Text(""));
        }
    }

    public static void main(String[] args) throws  Exception{
//        System.setProperty("hadoop.home.dir", "F:\\hadoop-2.7.3");
//        Configuration configuration = new Configuration();
//        //这句话很关键
//        configuration.set("mapred.job.tracker","192.168.113.130:9001");
//        String[] ioArgs = new String[]{"dedup_in","dedup_out"};
//        String[] otherArgs = new GenericOptionsParser(configuration,ioArgs).getRemainingArgs();
//        if(otherArgs.length!=2){
//            System.err.println("Usage: Data Deduplication <in> <out>");
//            System.exit(2);
//        }
        //输入路径
        String dst = "hdfs://192.168.113.130:9000/dedup_in";
        //输出路径，必须是不存在的，空文件夹也不行。
        String dstOut = "hdfs://192.168.113.130:9000/dedup_out";
        Configuration configuration = new Configuration();
//        System.setProperty("hadoop.home.dir", "F:\\hadoop-2.7.3");
        System.setProperty("hadoop.home.dir", "C:\\Users\\Administrator\\Desktop\\hadoop-2.7.3\\hadoop-2.7.3");
        configuration.set("fs.hdfs.impl",
                org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
        );
        configuration.set("fs.file.impl",
                org.apache.hadoop.fs.LocalFileSystem.class.getName()
        );

        Job job = new Job(configuration,"Data Deduplication");
        job.setJarByClass(Dedup.class);
        //设置Map,Combine和Reduce处理类
        job.setMapperClass(Map.class);
        job.setCombinerClass(Reduce.class);
        job.setReducerClass(Reduce.class);

        //设置输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        //设置输出目录
        FileInputFormat.addInputPath(job,new Path(dst));
        FileOutputFormat.setOutputPath(job, new Path(dstOut));

        System.exit(job.waitForCompletion(true) ? 0 : 1);


    }

}

2、数据排序

"数据排序"是许多实际任务执行时要完成的第一项工作，比如学生成绩评比、数据建立索引等。这个实例和数据去重类似，都是先对原始数据进行初步处理，为进一步的数据操作打好基础。下面进入这个示例。

2.1 实例描述

对输入文件中数据进行排序。输入文件中的每行内容均为一个数字，即一个数据。要求在输出中每行有两个间隔的数字，其中，第一个代表原始数据在原始数据集中的位次，第二个代表原始数据。

样例输入：

1）file1：

2

32

654

32

15

756

65223

2）file2：

5956

22

650

92

3）file3：

26

54

6

样例输出：

1    2

2    6

3    15

4    22

5    26

6    32

7    32

8    54

9    92

10    650

11    654

12    756

13    5956

14    65223

2.2 设计思路

这个实例仅仅要求对输入数据进行排序，熟悉MapReduce过程的读者会很快想到在MapReduce过程中就有排序，是否可以利用这个默认的排序，而不需要自己再实现具体的排序呢？答案是肯定的。

　　但是在使用之前首先需要了解它的默认排序规则。它是按照key值进行排序的，如果key为封装int的IntWritable类型，那么MapReduce按照数字大小对key排序，如果key为封装为String的Text类型，那么MapReduce按照字典顺序对字符串排序。

　　了解了这个细节，我们就知道应该使用封装int的IntWritable型数据结构了。也就是在map中将读入的数据转化成IntWritable型，然后作为key值输出（value任意）。reduce拿到<key，value-list>之后，将输入的key作为value输出，并根据value-list中元素的个数决定输出的次数。输出的key（即代码中的linenum）是一个全局变量，它统计当前key的位次。需要注意的是这个程序中没有配置Combiner，也就是在MapReduce过程中不使用Combiner。这主要是因为使用map和reduce就已经能够完成任务了。

2.3 程序代码

程序代码如下所示：

package com.sct.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;

/**
 * Created by leitao on 2017/3/13.
 * 排序
 */
public class Sort {

    //map将输入中的value化成IntWritable类型,作为输出的key
    public static class Map extends Mapper<Object,Text,IntWritable,IntWritable>{
        private static  IntWritable data = new IntWritable();
        //实现map函数

        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            data.set(Integer.parseInt(line));
            context.write(data,new IntWritable(1));

        }
    }
    //reduce 将输入的key复制输出数据的key上
    //然后根据输入的value-list中的元素的个数决定key的输出次数
    //用全局linenum来代表key的位次
    public static class Reduce extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable>{
        private static  IntWritable linenum = new IntWritable(1);

        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            for (IntWritable val : values){
                context.write(linenum,key);
                linenum = new IntWritable(linenum.get()+1);
            }
        }
    }
    public  static  void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        //这句话很关键
        conf.set("mapred.job.tracker","192.168.113.130");
        String[] ioArgs = new String[]{"sort_in","sort_in_1","sort_out"};
        String[] otherArgs = new GenericOptionsParser(conf,ioArgs).getRemainingArgs();
        if (otherArgs.length!=3){
            System.err.println("Useage:Data Sort <in> <out>");
            System.exit(3);
        }
        Job job = new Job(conf,"Data Sort");
        job.setJarByClass(Sort.class);
        //设置Map和Reduce处理类
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        //设置输出类型
        job.setOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        //设置输入输出目录
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//设置多个输入目录
        FileInputFormat.addInputPath(job, new Path(otherArgs[1]));

        FileOutputFormat.setOutputPath(job,new Path(otherArgs[2]));
        boolean flag = job.waitForCompletion(true);
        if (flag){
            System.out.println("排序成功!");
        }else {
            System.out.println("排序失败!");
        }
        System.exit(1);
    }

}