Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:
- Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
- Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
- Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
- There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.
The key concept here seems to be the RDD. Any one RDD:
- Is a collection of Java objects, which should have the same or similar structure.
- Can be partitioned/distributed and shuffled/redistributed across the cluster.
- Doesn’t have to be entirely in memory at once.
Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:
- At the moment, RDDs expire at the end of a job.
- This restriction will be lifted in a future release.
Just like MapReduce, Spark wants to be fault-tolerant enough to work on clusters of dubiously-reliable hardware. Unlike MapReduce, Spark doesn’t persist intermediate result sets (unless they’re too large to fit into RAM). Rather, Spark’s main fault-tolerance strategy is:
- RDDs are written by single operations (typically executed in a distributed fashion).
- If there’s a failure, the operation is replayed over the portion of the data that was on the affected node.
Further, Reynold Xin emailed:
Spark [supports] speculative execution for dealing with stragglers. Speculation is particularly important for low-latency jobs, which are common in Spark.
Shark borrows a lot of Hive code to do what Hive does, only over Spark. Notes on Shark’s query planning include:
- Shark borrows the Hive optimizer for up-front join reordering and so on.
- Shark can dynamically re-plan work in progress to:
- Change how work is partitioned among nodes.
- Select a join algorithm appropriate for the cardinalities of intermediate result sets.
Further Shark smarts are to be added down the road.
And finally, Shark gives a columnar storage format to its RDDs, which hasalready been discussed on this blog.