The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Running on top of Hadoop MapReduce, the Apache Crunch library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For more information, you can visit Apache Crunch Homepage.
In this blog post, I'll show you how to write a word counting programming that you might familiar with if you know Hadoop, it's Hello World in Hadoop world. Firstly, let's look at the basic concepts in Crunch, including its type system and pipelined architecture.
- Data Pipelines:
As you can see, the pipeline class contains methods to read and write collections. These collection classes have methods to perform operations on the contents of collections to produce a new result collection. Therefore, a pipeline consists of the definition of one or more input collections, a number of operations on these intermediary collections, and the writing of the collections to data sinks. The execution of all the actual pipeline operations is delayed until the run or done methods are called, at which point Crunch translates the pipeline into one or more MapReduce jobs and starts their execution.
The Pipeline interface defines a readTextFile method that takes in a String and returns a PCollection of Strings. In addition to text files, the library supports reading data from SequenceFiles and Avro container files, via the SequenceFileSource and AvroFileSource classes defined in the org.apache.crunch.io package. Note that each PCollection is a reference to a source of data, no data is actually loaded into a PCollection on the client machine.
- Collections:
Collection classes contains a number of methods, which operate on the contents of the collections, these operations are executed in either the map or reduce phase. Among them, the PGroupedTable is a special collection that's a result of calling groupByKey method on the PTable, this results in a reduce phase being executed to perform the grouping.
- Data Functions:
As you can see, all DoFn instances are required to be java.io.Serializable. This is a key aspect of the library's design: once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce job, all of the state of that DoFn is serialized so that it may be distributed to all of the nodes in the Hadoop cluster that will be running that task. There are two important implications of this for developers:
- All member values of a DoFn must be either serializable or marked as transient.
- All anonymous DoFn instances must be defined in a static method or in a class that is itself serializable.
- Types and Serialization
The parallelDo method on the PCollection interface all take either a PType or PTableType argument, depending on whether the result was a PCollection or PTable. These interfaces are used by Crunch to map between the data types used in the Crunch pipeline, and the serialization format used when reading or writing data in HDFS. The class hierarchy of types in Crunch shown as below:
As you can see, Crunch has serializatio support for both native Hadoop Writable classes as well as Avro types.
Now let's look at the WordCount in Crunch:
package crunch;
import org.apache.crunch.*;
import org.apache.crunch.impl.mr.MRPipeline;
import org.apache.crunch.lib.Aggregate;
import org.apache.crunch.types.writable.Writables;
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection<String> lines = pipeline.readTextFile(args[0]);
PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() {
public void process(String line, Emitter<String> emitter) {
for (String word : line.split("\\s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable<String, Long> counts = Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
Simple enough, right? Let's go through this example line by line.
- The first argument to parallelDo is a string that is used to identify this step in the pipeline. When a pipeline is composed into a series of MapReduce jobs, it is often the case that multiple stages will run within the same Mapper or Reducer. Having a string that identifies each processing step is useful for debugging errors that occur in a running pipeline.
- The second argument to parallelDo is an anonymous subclass of DoFn. Each DoFn subclass must override the process method, which takes in a record from the input PCollection and an Emitter object that may have any number of output values written to it. In this case, our DoFn splits each lines up into words, using a blank space as a separator, and emits the words from the split to the output PCollection.
- The last argument to parallelDo is an instance of the PType interface, which specifies how the data in the output PCollection is serialized. While the API takes advantage of Java Generics to provide compile-time type safety, the generic type information is not available at runtime. The job planner needs to know how to map the records stored in each PCollection into a Hadoop-supported serialization format in order to read and write data to disk. Two serialization implementations are supported in Crunch via the PTypeFamily interface: a Writable-based system that is defined in the org.apache.crunch.types.writable package, and an Avro-based system that is defined in the org.apache.crunch.types.avro package. Each implementation provides convenience methods for working with the common PTypes (Strings, longs, bytes, etc.) as well as utility methods for creating PTypes from existing Writable classes or Avro schemas.
Step 3: counting the words. This is acomplished by the Aggregate.count(...) method, let's look at it's implementation.
/**
* Returns a {@code PTable} that contains the unique elements of this collection mapped to a count
* of their occurrences.
*/
public static <S> PTable<S, Long> count(PCollection<S> collect) {
// get the PTypeFamily that is associated with the PType for the collection.
PTypeFamily tf = collect.getTypeFamily();
return collect.parallelDo("Aggregate.count", new MapFn<S, Pair<S, Long>>() {
public Pair<S, Long> map(S input) {
return Pair.of(input, 1L);
}
}, tf.tableOf(collect.getPType(), tf.longs())).groupByKey()
.combineValues(Aggregators.SUM_LONGS());
}
The call to parallelDo converts each record in this PCollection into a Pair of the input record and the number 1 by extending the MapFn convenience subclass of DoFn, and uses the tableOf method of the PTypeFamily to specify that the returned PCollection should be a PTable instance, with the key being the PType of the PCollection and the value being the Long implementation for this PTypeFamily.
// In Aggregators.java
/**
* Sum up all {@code long} values.
* @return The newly constructed instance
*/
public static Aggregator<Long> SUM_LONGS() {
return new SumLongs();
}
private static class SumLongs extends SimpleAggregator<Long> {
private long sum = 0;
@Override
public void reset() {
sum = 0;
}
@Override
public void update(Long next) {
sum += next;
}
@Override
public Iterable<Long> results() {
return ImmutableList.of(sum);
}
}
/**
* Constructs and executes a series of MapReduce jobs in order to write data
* to the output targets.
*/
PipelineResult run();
/**
* Constructs and starts a series of MapReduce jobs in order ot write data to
* the output targets, but returns a {@code ListenableFuture} to allow clients to control
* job execution.
* @return
*/
PipelineExecution runAsync();
The class hierarchy of PipelineExecution shown as below:With the returned instance of PipelineExecution, you can control a Crunch pipeline as it runs, this interface is implemented to be thread safe. For example, you can query the job status by calling getStatus(), wait for a specified time interval by calling waitFor(long, TimeUnit), kill the job by calling kill() and so on, see the class diagram for details.