MapReduce Algorithm - Reduce-side Join

Reduce-side join is also known as repartition join. The idea is quite simple: we map over both datasets and emit the join key as the intermediate key, and the tuple itself as the intermediate value. Since MapReduce guarantees that all values  with the same key are brought together, all tuples will be grouped by the join key, which is exactly what we need to perform the join operation. This approach is known as parallel sort-merge join in the database community. In more detail, there are three differenct cases to consider.

  1. One-to-one join - This is the simplest case, you emit the join key from map as the intermediate key, we can remove it from the value to save a bit of space; also we emit the values as a whole record as value to be joined in reduce phase.
  2. One-to-many join
  3. Many-to-many join -  We will demonstrate the last two cases with a general join framework.

The basic idea behind the reduce-side join is to repartition the two datasets by the join key. The approach isn't particularly efficient since it requires shuffling both dataset across the network, this leads us to the map-side join.

 

Optimized repartition join framework (reduce-side join)

The Hadoop contrib join(hadoop-datajoin) package requires that all the values for a key be loaded into memory. In this optimization you will cache the dataset that's smallest in size, and then perform a join as you iterate over data from the larger dataset. This involves performing a secondary sort over the map output data so that the reducer will receive the data from the smaller dataset ahead of the larger dataset. We will craft the join code in similar fashion to that the Hadoop contrib join package, but will use the new API of MapReduce framework. Below figure is the overview of how we will implement it, quoted from Hadoop in Practice, really really awesome book by the way, you should read it if you want to learn Hadoop well.



 

Code followed:

First the CompisiteKey as well as the comparators, we need this because a secondary sort required to place the values from the smaller dataset before the larger one.

 

package com.manning.hip.ch4.joins.improved.impl;

import org.apache.commons.lang.builder.ToStringBuilder;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class CompositeKey implements WritableComparable<CompositeKey> {

    private String key = "";
    private long order = 0;

    public CompositeKey() {
    }

    public CompositeKey(String key, long order) {
        this.key = key;
        this.order = order;
    }

    public String getKey() {
        return this.key;
    }

    public long getOrder() {
        return this.order;
    }

    public void setKey(String key) {
        this.key = key;
    }

    public void setOrder(long order) {
        this.order = order;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.key = in.readUTF();
        this.order = in.readLong();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(key);
        out.writeLong(this.order);
    }

    @Override
    public int compareTo(CompositeKey other) {
        if (this.key.compareTo(other.key) != 0) {
            return this.key.compareTo(other.key);
        } else if (this.order != other.order) {
            return order < other.order ? -1 : 1;
        } else {
            return 0;
        }
    }

    @Override
    public String toString() {
        return ToStringBuilder.reflectionToString(this);
    }


    static { // register this comparator
        WritableComparator.define(CompositeKey.class,
                new CompositeKeyComparator());
    }
}
 
package com.manning.hip.ch4.joins.improved.impl;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class CompositeKeyComparator extends WritableComparator {

    protected CompositeKeyComparator() {
        super(CompositeKey.class, true);
    }

    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {

        CompositeKey p1 = (CompositeKey) w1;
        CompositeKey p2 = (CompositeKey) w2;

        int cmp = p1.getKey().compareTo(p2.getKey());
        if (cmp != 0) {
            return cmp;
        }

        return p1.getOrder() == p2.getOrder() ? 0 : (p1
                .getOrder() < p2.getOrder() ? -1 : 1);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
        return compareBytes(b1, s1, l1, b2, s2, l2);
    }
}
 
The class CompositeKeyOnlyComparator will be used to group the values with the same CompositeKey into a same group.

 

package com.manning.hip.ch4.joins.improved.impl;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class CompositeKeyOnlyComparator extends WritableComparator {

    protected CompositeKeyOnlyComparator() {
        super(CompositeKey.class, true);
    }

    @Override
    public int compare(WritableComparable o1, WritableComparable o2) {

        CompositeKey p1 = (CompositeKey) o1;
        CompositeKey p2 = (CompositeKey) o2;

        return p1.getKey().compareTo(p2.getKey());

    }
}
 

 

The OutputValue which will be used in Reducer to hold the Map output value, it's a abstract class, you have to define you own way to get its data.
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.manning.hip.ch4.joins.improved.impl;

import org.apache.commons.lang.builder.ToStringBuilder;
import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableUtils;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * This abstract class serves as the base class for the values that
 * flow from the mappers to the reducers in a data join job.
 * Typically, in such a job, the mappers will compute the source
 * tag of an input record based on its attributes or based on the
 * file name of the input file. This tag will be used by the reducers
 * to re-group the values of a given key according to their source tags.
 */
public abstract class OutputValue implements Writable {
    protected BooleanWritable smaller;

    public OutputValue() {
        this.smaller = new BooleanWritable(false);
    }

    public BooleanWritable isSmaller() {
        return smaller;
    }

    public void setSmaller(BooleanWritable smaller) {
        this.smaller = smaller;
    }

    public abstract Writable getData();

    public OutputValue clone(Reducer.Context context) {
        return WritableUtils.clone(this, context.getConfiguration());
    }

    @Override
    public String toString() {
        return ToStringBuilder.reflectionToString(this);
    }
}
 
The following the core of the general join framework the Mapper and Reducer as well as the driver.
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.manning.hip.ch4.joins.improved.impl;

import org.apache.hadoop.io.BooleanWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.SortedMap;
import java.util.TreeMap;

/**
 * This abstract class serves as the base class for the mapper class of a data
 * join job.
 */
public abstract class OptimizedDataJoinMapperBase extends Mapper {

    private static enum MapSideCounter {
        TOTAL_COUNT, DISCARDED_COUNT, NULL_GROUP_KEY_COUNT, COLLECTED_COUNT
    }

    protected String inputFile = null;

    protected Text inputTag = null;

    protected Context context;

    protected CompositeKey outputKey = new CompositeKey();

    protected BooleanWritable smaller;

    private SortedMap<Object, Long> longCounters = new TreeMap<Object, Long>();

    public void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        this.context = context;
        this.inputFile = ((FileSplit) context.getInputSplit()).getPath().toString();
        this.inputTag = generateInputTag(this.inputFile);
        if (isInputSmaller(this.inputFile)) {
            smaller = new BooleanWritable(true);
            outputKey.setOrder(0);
        } else {
            smaller = new BooleanWritable(false);
            outputKey.setOrder(1);
        }
    }

    /**
     * Determine the source tag based on the input file name.
     *
     * @param inputFile input file.
     * @return the source tag computed from the given file name.
     */
    protected abstract Text generateInputTag(String inputFile);

    /**
     * Generate an output value. The user code can also perform
     * projection/filtering. If it decides to discard the input record when
     * certain conditions are met,it can simply return a null.
     *
     * @param o the Map input value
     * @return an object of OutputValue computed from the given value.
     */
    protected abstract OutputValue genMapOutputValue(Object o);

    /**
     * Generate a map output key. The user code can compute the key
     * programmatically, not just selecting the values of some fields. In this
     * sense, it is more general than the joining capabilities of SQL.
     *
     * @param aRecord record
     * @return the group key for the given record
     */
    protected abstract String genGroupKey(Object key,
                                          OutputValue aRecord);

    /**
     * @param inputFile input file.
     * @return true if the data from the supplied input file is smaller
     *         than data from the other input file.
     */
    protected abstract boolean isInputSmaller(String inputFile);

    public void map(Object key, Object value,
                    Context context) throws IOException, InterruptedException {
        context.getCounter(MapSideCounter.TOTAL_COUNT).increment(1);
        OutputValue aRecord = genMapOutputValue(value);
        if (aRecord == null) {
            context.getCounter(MapSideCounter.DISCARDED_COUNT).increment(1);
            return;
        }
        aRecord.setSmaller(smaller);
        String groupKey = genGroupKey(key, aRecord);
        if (groupKey == null) {
            context.getCounter(MapSideCounter.NULL_GROUP_KEY_COUNT).increment(1);
            return;
        }
        outputKey.setKey(groupKey);
        context.write(outputKey, aRecord);
        context.getCounter(MapSideCounter.COLLECTED_COUNT).increment(1);
    }
}
 
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.manning.hip.ch4.joins.improved.impl;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

/**
 * This abstract class serves as the base class for the reducer class of a data
 * join job. The reduce function will first group the values according to their
 * input tags, and then compute the cross product of over the groups. For each
 * tuple in the cross product, it calls the following method, which is expected
 * to be implemented in a subclass.
 * <p/>
 * protected abstract OptimizedTaggedMapOutput combine(String key, OptimizedTaggedMapOutput value1, OptimizedTaggedMapOutput value2);
 * <p/>
 * The above method is expected to produce one output value from an array of
 * records of different sources. The user code can also perform filtering here.
 * It can return null if it decides to the records do not meet certain
 * conditions.
 */
public abstract class OptimizedDataJoinReducerBase extends Reducer {

    private static enum ReduceSideCounter {
        GROUP_COUNT, COLLECTED_COUNT, ACTUAL_COLLECTED_COUNT
    }

    private long maxNumOfValuesPerGroup = 100;

    protected long largestNumOfValues = 0;

    protected long numOfValues = 0;

    protected long collected = 0;

    protected Context context;

    public static final Log LOG = LogFactory.getLog(OptimizedDataJoinReducerBase.class);

    public void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        this.context = context;
        this.maxNumOfValuesPerGroup =
                context.getConfiguration().getLong("datajoin.maxNumOfValuesPerGroup", 100);
    }

    public void reduce(Object key, Iterable values, Context context)
            throws IOException, InterruptedException {

        CompositeKey k = (CompositeKey) key;

        System.out.println("K[" + k + "]");

        List<OutputValue> smaller = new ArrayList<OutputValue>();
        this.numOfValues = 0;

        for (Object value : values) {
            numOfValues++;

            System.out.println("  V[" + value + "]");

            if (this.numOfValues % 100 == 0) {
                context.setStatus("key: " + key.toString() + " numOfValues: " + this.numOfValues);
            }

            if (this.numOfValues > this.maxNumOfValuesPerGroup) {
                continue;
            }

            OutputValue cloned = ((OutputValue) value).clone(context);

            if (cloned.isSmaller().get()) {
                System.out.println("Adding to smaller coll");
                smaller.add(cloned);
            } else {
                System.out.println("Join/collect");
                joinAndCollect(k, smaller, cloned);
            }
        }

        if (this.numOfValues > this.largestNumOfValues) {
            this.largestNumOfValues = numOfValues;
            LOG.info("key: " + key.toString() + " this.largestNumOfValues: "
                    + this.largestNumOfValues);
        }

        context.getCounter(ReduceSideCounter.GROUP_COUNT).increment(1);
    }

    /**
     * Join the list of the value lists, and collect the results.
     */
    private void joinAndCollect(CompositeKey key,
                                List<OutputValue> smaller,
                                OutputValue value)
            throws IOException, InterruptedException {
        if (smaller.size() < 1) {
            OutputValue combined = combine(key.getKey(), null, value);
            collect(key, combined);
        } else {
            for (OutputValue small : smaller) {
                OutputValue combined = combine(key.getKey(), small, value);
                collect(key, combined);
            }
        }
    }

    private static Text outputKey = new Text();

    private void collect(CompositeKey key,
                         OutputValue combined) throws IOException, InterruptedException {
        this.collected += 1;
        context.getCounter(ReduceSideCounter.COLLECTED_COUNT).increment(1);
        if (combined != null) {
            outputKey.set(key.getKey());
            context.write(outputKey, combined.getData());
            context.setStatus(
                    "key: " + key.toString() + " collected: " + collected);

            context.getCounter(ReduceSideCounter.ACTUAL_COLLECTED_COUNT).increment(1);
        }

    }

    /**
     * @return combined value derived from values of the sources
     */
    protected abstract OutputValue combine(String key,
                                           OutputValue smallValue,
                                           OutputValue largeValue);

}
 
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.manning.hip.ch4.joins.improved.impl;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * This class implements the main function for creating a map/reduce
 * job to join data of different sources. To create sucn a job, the
 * user must implement a mapper class that extends DataJoinMapperBase class,
 * and a reducer class that extends DataJoinReducerBase.
 */
public class OptimizedDataJoinJob extends Configured implements Tool {

    public static Class getClassByName(String className) {
        Class retv = null;
        try {
            ClassLoader classLoader = Thread.currentThread().getContextClassLoader();
            retv = Class.forName(className, true, classLoader);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
        return retv;
    }

    public static Job createDataJoinJob(String args[]) throws IOException {

        String inputDir = args[0];
        String outputDir = args[1];
        Class inputFormat = SequenceFileInputFormat.class;
        if (args[2].compareToIgnoreCase("text") != 0) {
            System.out.println("Using SequenceFileInputFormat: " + args[2]);
        } else {
            System.out.println("Using TextInputFormat: " + args[2]);
            inputFormat = TextInputFormat.class;
        }
        int numOfReducers = Integer.parseInt(args[3]);
        Class mapper = getClassByName(args[4]);
        Class reducer = getClassByName(args[5]);
        Class mapOutputValueClass = getClassByName(args[6]);
        Class outputFormat = TextOutputFormat.class;
        Class outputValueClass = Text.class;
        if (args[7].compareToIgnoreCase("text") != 0) {
            System.out.println("Using SequenceFileOutputFormat: " + args[7]);
            outputFormat = SequenceFileOutputFormat.class;
            outputValueClass = getClassByName(args[7]);
        } else {
            System.out.println("Using TextOutputFormat: " + args[7]);
        }
        long maxNumOfValuesPerGroup = 100;
        String jobName = "";
        if (args.length > 8) {
            maxNumOfValuesPerGroup = Long.parseLong(args[8]);
        }
        if (args.length > 9) {
            jobName = args[9];
        }
        Configuration defaults = new Configuration();
        Job job = new Job(defaults, "Optimized DataJoin Job");
        job.setJobName("DataJoinJob: " + jobName);

        FileSystem fs = FileSystem.get(defaults);
        fs.delete(new Path(outputDir), true);
        FileInputFormat.setInputPaths(job, inputDir);

        job.setInputFormatClass(inputFormat);

        job.setMapperClass(mapper);
        FileOutputFormat.setOutputPath(job, new Path(outputDir));
        job.setOutputFormatClass(outputFormat);
        SequenceFileOutputFormat.setOutputCompressionType(job,
                SequenceFile.CompressionType.BLOCK);
        job.setMapOutputKeyClass(CompositeKey.class);
        job.setMapOutputValueClass(mapOutputValueClass);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(outputValueClass);
        job.setReducerClass(reducer);

        job.setPartitionerClass(CompositeKeyPartitioner.class);
        job.setSortComparatorClass(CompositeKeyComparator.class);
        job.setGroupingComparatorClass(CompositeKeyOnlyComparator.class);

        job.setNumReduceTasks(numOfReducers);
        job.getConfiguration().setLong("datajoin.maxNumOfValuesPerGroup", maxNumOfValuesPerGroup);
        return job;
    }

    public int run(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//        JobClient jc = new JobClient(job);

        Job job = OptimizedDataJoinJob.createDataJoinJob(args);
        return job.waitForCompletion(true) ? 0 : 1;
    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        if (args.length < 8 || args.length > 10) {
            System.out.println("usage: DataJoinJob " + "inputdirs outputdir map_input_file_format "
                    + "numofParts " + "mapper_class " + "reducer_class "
                    + "map_output_value_class "
                    + "output_value_class [maxNumOfValuesPerGroup [descriptionOfJob]]]");
            System.exit(-1);
        }

        try {

            int exitCode = ToolRunner.run(new OptimizedDataJoinJob(), args);
            System.exit(exitCode);
        } catch (Exception ioe) {
            ioe.printStackTrace();
        }
    }
}
 
To run the join with the general join framework, you have to implement you own Mapper, Reducer and the output value class, then set the qualified path of classes into command line parameters of the drive class, note the order of these parameters.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值