hadoop学习笔记--第八天--MapReduce基础编程

最新推荐文章于 2022-10-27 21:13:51 发布

wudi_1982

最新推荐文章于 2022-10-27 21:13:51 发布

阅读量500

点赞数

分类专栏： hadoop学习笔记

本文链接：https://blog.csdn.net/wudiit/article/details/17264695

版权

hadoop学习笔记专栏收录该内容

12 篇文章 0 订阅

订阅专栏

初识MapReduce，本能的想到了datastage orchestrate并行引擎（partition，collection），很亲切，核心思想看起来差不多。只不过orchestrate中包含了各种partition、collection的具体method。但总体还是分为两类，以均匀分布为主旨的方法以及KEY值相关的方法（保证KEY相同在相同分区）。
从《Hadoop in Action》中找了两段段代码（MyJob以及Citationhistogram），学习研究如下：
1、自己造了几条测试数据，第一列代表专利好，第二列代表对专利的引用，其中还包括一条空数据，1001没有引用专利如下(其中数据刻意没有排序)：

1006,1001
1001
1003,1002
1002,1001
1004,1003
1006,1002
1005,1001
2、运行MyJob，并设置 mapred.reduce.tasks=0 ，仅仅查看Map阶段的输出
hadoop jar /home/hdpuser/IdeaProjects/MyHadooTraining/out/production/MyJob.jar MyJob -D mapred.reduce.tasks=0 /user/hdpuser/testdt/in.txt /user/hdpuser/out/myjob
注意：map阶段，将key和value对换了一下。output.collect(value, key);
3、对应输出查看如下：
[hdpuser@hdpNameNode testdt]$ hadoop fs -cat /user/hdpuser/out/myjob/*
1003 1004
1002 1006
1001 1005
1001 1006
1001
1002 1003
1001 1002

4、到本地文件系统查看相应的BLK文件
[hdpuser@hdpNameNode finalized]$ cat blk_1073742183
1001 1006
1001
1002 1003
1001 1002
[hdpuser@hdpNameNode finalized]$ cat blk_1073742184
1003 1004
1002 1006
1001 1005
5、注意上述的对应关系，刻意发现数据被分成两个文件处理（我的HADOOP集群有两个节点），原始文件的前三条一个数据块，后四条一个数据块。
6、取消 mapred.reduce.tasks=0的设置后，再次运行（先删除myjob目录）
hadoop jar /home/hdpuser/IdeaProjects/MyHadooTraining/out/production/MyJob.jar MyJob /user/hdpuser/testdt/in.txt /user/hdpuser/out/myjob
7、查看最终结果
[hdpuser@hdpNameNode testdt]$ hadoop fs -cat /user/hdpuser/out/myjob/*
1001
1001 1002,1006,1005
1002 1006,1003
1003 1004

8、从最终结果中，可以明显看出，数据是按照key排序的，虽然在测试代码中，从来没有排序的内容存在。看起来，应该是在map和reduce中间的洗牌过程，进行的排序。
9、画图整理如下：

随后学习了CitationHistogram的代码，运行的时候一直报错，发现是做类型转换时，因为我的测试数据中包含""，转int时报错，随后修改了部分代码，运行成功，数据正确。
参考代码

MyJob

主要功能是整理哪些专利被引用，例如
1001 （1002,1006,1005）

import java.io.IOException;
import java.util.Iterator;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class MyJob extends Configured implements Tool {


    public static class MapClass extends MapReduceBase
            implements Mapper<Text, Text, Text, Text> {


        public void map(Text key, Text value,
                        OutputCollector<Text, Text> output,
                        Reporter reporter) throws IOException {


            output.collect(value, key);
        }
    }


    public static class Reduce extends MapReduceBase
            implements Reducer<Text, Text, Text, Text> {


        public void reduce(Text key, Iterator<Text> values,
                           OutputCollector<Text, Text> output,
                           Reporter reporter) throws IOException {


            String csv = "";
            while (values.hasNext()) {
                if (csv.length() > 0) csv += ",";
                csv += values.next().toString();
            }
            output.collect(key, new Text(csv));
        }
    }


    public int run(String[] args) throws Exception {
        Configuration conf = getConf();


        JobConf job = new JobConf(conf, MyJob.class);


        Path in = new Path(args[0]);
        Path out = new Path(args[1]);
        FileInputFormat.setInputPaths(job, in);
        FileOutputFormat.setOutputPath(job, out);


        job.setJobName("MyJob");
        job.setMapperClass(MapClass.class);
        job.setReducerClass(Reduce.class);


        job.setInputFormat(KeyValueTextInputFormat.class);
        job.setOutputFormat(TextOutputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.set("key.value.separator.in.input.line", ",");


        JobClient.runJob(job);


        return 0;
    }


    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new MyJob(), args);


        System.exit(res);
    }
}

CitationHistogram

import java.io.IOException;
import java.util.Iterator;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class CitationHistogram extends Configured implements Tool {


    public static class MapClass extends MapReduceBase
            implements Mapper<Text, Text, IntWritable, IntWritable> {


        private final static IntWritable uno = new IntWritable(1);
        private IntWritable citationCount = new IntWritable();


        public void map(Text key, Text value,
                        OutputCollector<IntWritable, IntWritable> output,
                        Reporter reporter) throws IOException {
            try
            {
                citationCount.set(Integer.parseInt(value.toString()));
            }
            catch (NumberFormatException e)
            {
                citationCount.set(9999);
            }
            output.collect(citationCount, uno);
        }
    }


    public static class Reduce extends MapReduceBase
            implements Reducer<IntWritable,IntWritable,IntWritable,IntWritable>
    {


        public void reduce(IntWritable key, Iterator<IntWritable> values,
                           OutputCollector<IntWritable, IntWritable>output,
                           Reporter reporter) throws IOException {


            int count = 0;
            while (values.hasNext()) {
                count += values.next().get();
            }
            output.collect(key, new IntWritable(count));
        }
    }


    public int run(String[] args) throws Exception {
        Configuration conf = getConf();


        JobConf job = new JobConf(conf, CitationHistogram.class);


        Path in = new Path(args[0]);
        Path out = new Path(args[1]);
        FileInputFormat.setInputPaths(job, in);
        FileOutputFormat.setOutputPath(job, out);


        job.setJobName("CitationHistogram");
        job.setMapperClass(MapClass.class);
        job.setReducerClass(Reduce.class);


        job.setInputFormat(KeyValueTextInputFormat.class);
        job.setOutputFormat(TextOutputFormat.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);
        job.set("key.value.separator.in.input.line", ",");


        JobClient.runJob(job);


        return 0;
    }


    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(),
                new CitationHistogram(),
                args);


        System.exit(res);
    }
}

wudi_1982

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop学习笔记--第八天--MapReduce基础编程

初识MapReduce，本能的想到了datastage orchestrate并行引擎（partition，collection），很亲切，核心思想看起来差不多。只不过orchestrate中包含了各种partition、collection的具体method。但总体还是分为两类，以均匀分布为主旨的方法以及KEY值相关的方法（保证KEY相同在相同分区）。从《Hadoop in Acti
复制链接

扫一扫

专栏目录