MapReduce的自定义组件案例总结

最新推荐文章于 2022-11-03 22:05:32 发布

Mark丶9251

最新推荐文章于 2022-11-03 22:05:32 发布

阅读量202

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/weixin_42038639/article/details/90482551

版权

hadoop 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

MapReduce的自定义组件案例总结

需要的依赖：

（其中需要用到两个maven打包插件，便于将项目打成jar包发布到hadoop集群上运行）

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.0-mr1-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.6.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.6.0-cdh5.14.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.6.0-cdh5.14.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>RELEASE</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                    <!--    <verbal>true</verbal>-->
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

一.自定义分区案例

需求将文件中的第六列的15以上的结果以及15以下的结果进行分开成两个文件进行保存

思路：其中k2，v2和k3，v3均分别为Text和NullWritable，自定义partitioner将k2中的值的第六列取出并根据这一列以15为分界线将不同的数据分到不同的两个分区

代码如下：

Main方法：

public class MyPartitionerMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        Job job = Job.getInstance(super.getConf(), "MyPartitioner");
        job.setJarByClass(MyPartitionerMain.class);
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path("hdfs://node01:8020/test/input"));
        job.setMapperClass(MyPartitionerMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setPartitionerClass(MyPatitioner.class);
        job.setNumReduceTasks(2);
        job.setReducerClass(MyPartionerReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("hdfs://node01:8020/test/output01"));
        boolean b = job.waitForCompletion(true);
        return b ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int run = ToolRunner.run(new Configuration(), new MyPartitionerMain(), args);
        System.exit(run);
    }
}

注意使用自定义分区的时候必须在集群上运行，故要设定打成jar包，且必须设定reduceTask的个数

自定义mapper方法

public class MyPartitionerMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        context.write(value,NullWritable.get());
    }
}

其中自定义mapper和reducer不需要改变

自定义reducer方法

public class MyPartionerReducer extends Reducer<Text, NullWritable,Text,NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

自定义partitioner方法

public class MyPatitioner extends Partitioner<Text, NullWritable> {
    @Override
    public int getPartition(Text text, NullWritable nullWritable, int i) {
        String s = text.toString().split("\t")[5];
        if(Long.parseLong(s)>15){
            return 1;
        }else{
            return 0;
        }
    }
}

二.自定义排序案例

对于一个数据文件，要求第一列按照字典顺序进行排列，第一列相同的时候，第二列按照升序进行排列：

数据格式样例如下：
a 1
a 7
b 8
b 10
a 5

思路：自定义数据类型实现WritableComparable接口,重写比较器和序列化反序列化方法，并将此数据类型作为key2,从而完成自定义排序

自定义数据类型代码如下：

@Data
public class MySortWritable implements WritableComparable<MySortWritable> {
    private String first;
    private Long second;
    @Override
    public int compareTo(MySortWritable o) {
        if(this.first.compareTo(o.first)==0){
            return this.second.compareTo(o.second);
        }else{
            return this.first.compareTo(o.first);
        }
    }
    //重写序列化方法
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(this.first);
        dataOutput.writeLong(this.second);
    }
    //重写反序列化方法
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.first = dataInput.readUTF();
        this.second = dataInput.readLong();
    }
}

自定义mapper：

public class MySortMapper extends Mapper<LongWritable, Text, MySortWritable, Text> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        MySortWritable mySortWritable = new MySortWritable();
        mySortWritable.setFirst(value.toString().split("\t")[0]);
        mySortWritable.setSecond(Long.parseLong(value.toString().split("\t")[1]));
        context.write(mySortWritable, value);
    }
}

自定义reducer：

public class MySortReducer extends Reducer<MySortWritable, Text, Text, NullWritable> {
    @Override
    protected void reduce(MySortWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            context.write(value, NullWritable.get());
        }
    }
}

Mark丶9251

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce的自定义组件案例总结

MapReduce的自定义组件案例总结需要的依赖：（其中需要用到两个maven打包插件，便于将项目打成jar包发布到hadoop集群上运行）<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <ar...
复制链接

扫一扫

专栏目录