从零搭建Hadoop集群三

最新推荐文章于 2024-07-13 16:15:27 发布

crazyskady

最新推荐文章于 2024-07-13 16:15:27 发布

阅读量419

点赞数

分类专栏：大数据入门 hadoop 文章标签： hadoop hadoop集群 java

本文链接：https://blog.csdn.net/crazyskady/article/details/75174956

版权

大数据入门同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

hadoop

4 篇文章 0 订阅

订阅专栏

接上文，看了几篇文章，自己尝试着写了两个hadoop处理文件的小代码，惭愧啊，人家四五年前玩的东西，现在才想起来要尝试着学习一下，还磕磕绊绊的。>_<

求平均数

我们有两个文件，里面保存了一些人的语文和数学成绩，格式如下：
testAvg.txt

张三 语文 88
李四 语文 77
王五 语文 66
张三 数学 90
李四 数学 79
王五 数学 68

testAvg2.txt

赵六 语文 88
赵六 数学 90

先在hdfs里创建一个文件夹scoreAvg，将这两个数据文件用-put命令放到hdfs的文件系统内。
然后在上文最后创建的testProject里新建一个class文件。
文件内容其实很简单，就是重写一下Hadoop的map和reduce函数。
Map函数，我的理解就是先对输入文件里一行一行的进行处理，整理好Key和Value的分组，然后再扔给Hadoop。
Reduce函数，就是将同一个Key对应的所有数据拿到一起进行处理，在这个函数里不需要考虑其他Key的影响。

        public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line, "\n");

            while (tokenizer.hasMoreElements()) {
                //对每行数据进行分割
                StringTokenizer tokenizerForLine = new StringTokenizer(tokenizer.nextToken());
                //名称
                String studentName = tokenizerForLine.nextToken();
                //科目名，这个在我们的示例里中不需要使用
                String subjectName = tokenizerForLine.nextToken();
                //成绩
                String score       = tokenizerForLine.nextToken();

                Text name = new Text(studentName);
                int scoreInt = Integer.parseInt(score);
                //将每个人的成绩输出给Reduce，studentName就是Key，这样同一个人的所有成绩都会被输出给Reduce
                output.collect(name, new IntWritable(scoreInt));
            }
        }
    }

再来看看Reduce：

    public static class Reduce extends MapReduceBase implements 
    Reducer<Text, IntWritable, Text, IntWritable> 
    {
        //可以看到，Reduce的输入是一个Key对应一个迭代器，也就是对应一组数据
        public void reduce(Text key, Iterator<IntWritable> values, 
                OutputCollector<Text, IntWritable> output, Reporter reporter)
        throws IOException 
        {
            int scoreSum = 0;
            int subjectCounter = 0;

            while (values.hasNext()) {
                //将这个用户的所有成绩相加，并计算科目的总数
                scoreSum += values.next().get();
                subjectCounter ++;
            }
            //取平均
            int scoreAvg = (int) scoreSum / subjectCounter;
            //输出
            output.collect(key, new IntWritable(scoreAvg));
        }
    }

最后再实现一个main函数来配置hadoop的配置信息，基本看函数的字面意思就知道配置的是什么内容，注意BasicConfigurator.configure()这句一定要配置，否则很多出错信息都不会显示：

    public static void main(String[] args) throws Exception {
        //configure很重要！很重要！很重要！
        //重要的事情要说三遍，如果没有这句，hadoop的很多出错信息看不到，定位起来会一头雾水
        BasicConfigurator.configure();
        JobClient client = new JobClient();
        JobConf job = new JobConf(AvgScore.class);

        job.setJobName("AvgScore");
        //配置Hdfs的路径
        job.set("fs.default.name", "hdfs://192.168.245.128:9000");
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setMapperClass(Map.class);
        job.setCombinerClass(Reduce.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormat(TextInputFormat.class);
        job.setOutputFormat(TextOutputFormat.class);
        //配置输入和输出路径，注意如果输出目录已经存在，会报错，在重跑之前需要删除掉输出目录
        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.245.128:9000//scoreAvg"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.245.128:9000//scoreAvgOutput"));
        client.setConf(job);
        JobClient.runJob(job);
    }

OK，代码堆完，Run on Hadoop:
run

没有问题的话就可以去输出目录查看结果了：
result1

看看结果，符合预期 ^_^
result2

表连接处理

这个例子里两个文件里的格式是不一样的，第一个文件里的内容是每个人对应的地区的代码：

张三 1
李四 1
周五 2
赵六 3

第二个文件的内容是地区编码的解码：

1 火星
2 水星
3 地球

我们所要做的工作就是将这两张表都处理一下，得到用户和地区的对应输出，如下图：
result3

代码与上面的示例基本相同，重写map和reduce函数而已。
先看map，这里通过判断文件的首位字符来判断是位置文件还是用户文件，并加以区分处理：

    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
        public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
        throws IOException {
        try {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line, "\n");

            while (tokenizer.hasMoreElements()) {
                StringTokenizer tokenizerForLine = new StringTokenizer(tokenizer.nextToken());
                String value0 = tokenizerForLine.nextToken();
                String value1 = tokenizerForLine.nextToken();
                //这里通过每行的第一个字符是否是数字0-9来判断是哪个文件
                if (value0.charAt(0) >= '0' && value0.charAt(0) <= '9') {
                    //位置文件，以位置的区域代码为Key，在位置的字符串前增加Location用以标识
                    output.collect(new Text(value0), new Text("Location" + " " + value1));
                }else {
                    //用户区域文件，以位置的区域代码为Key，在用户的字符串前增加User用以标识
                    output.collect(new Text(value1), new Text("User" + " " + value0));
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        }
    }

再来看看Reduce函数，因为我们在Map里是用位置的编码作为Key，所以自然的同样的Key的数据会被扔到同一个Reduce里进行处理，同样的Key的数据包含位置和用户的数据，在Reduce中我们将其保存到两个数组中，并记录其数量，最后再输出：

    public static class Reduce extends MapReduceBase implements 
    Reducer<Text, Text, Text, Text> 
    {
        public void reduce(Text key, Iterator<Text> values, 
                OutputCollector<Text, Text> output, Reporter reporter)
        throws IOException 
        {
        try {
            //简单的处理，定义了两个固定长度数组来保存，实际应用中自然不能如此使用
            String[] users = new String[5];
            String[] locations = new String[5];
            int userCounter = 0;
            int locationCounter = 0;

            while (values.hasNext()) {
                String line = values.next().toString();
                StringTokenizer tokenizer = new StringTokenizer(line);
                String userOrLocation = tokenizer.nextToken();
                String currValue = tokenizer.nextToken();

                if(userOrLocation.equals("Location")) {
                    //如果是位置信息，存入到Location数组
                    locations[locationCounter] = currValue;
                    locationCounter ++;
                }else {
                    //如果是用户信息，存入到User数组
                    users[userCounter] = currValue;
                    userCounter ++;
                }
            }
            //同样的位置代码，会进入同一个Reduce进行处理
            //只有该Key，也就是该位置代码对应的位置解码和用户数量都大于0，才需要输出
            //其实locationCounter也就是位置解码的数目是必然是1的  >_<
            if (userCounter > 0 && locationCounter >0) {
                for (int i=0; i<userCounter; i++) {
                    for (int j=0; j<locationCounter; j++) {
                        output.collect(new Text(users[i]), new Text(locations[j]));
                    }
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        }
    }

main函数中注意combine的class不能再设置为reduce啦，具体原理待我仔细研究完Hadoop的几个基本操作后再详述吧 >_<。

    public static void main(String[] args) throws Exception {

        BasicConfigurator.configure();
        JobClient client = new JobClient();
        JobConf job = new JobConf(FindLocation.class);

        job.setJobName("FindLocation");
        job.set("fs.default.name", "hdfs://192.168.245.128:9000");
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setMapperClass(Map.class);
        //如果还是使用Reduce作为CombinerClass，得到的结果是不正确的
        //job.setCombinerClass(Reduce.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormat(TextInputFormat.class);
        job.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.245.128:9000//findLocation"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.245.128:9000//findLocationOutput"));
        client.setConf(job);
        JobClient.runJob(job);
    }

完整的代码还是看我的github吧。 -_-!!!

crazyskady

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
从零搭建Hadoop集群三

接上文，看了几篇文章，自己尝试着写了两个hadoop处理文件的小代码，惭愧啊，人家四五年前玩的东西，现在才想起来要尝试着学习一下，还磕磕绊绊的。>_<求平均数我们有两个文件，里面保存了一些人的语文和数学成绩，格式如下： testAvg.txt张三语文 88李四语文 77王五语文 66张三数学 90李四数学 79王五数学 68testAvg2.txt赵六语文 88赵六数
复制链接

扫一扫