Spark, Shark, and RDDs

最新推荐文章于 2021-12-16 22:24:23 发布

wh62592855

最新推荐文章于 2021-12-16 22:24:23 发布

阅读量7.1k

点赞数

分类专栏： Distributed

Distributed 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:

Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.

The key concept here seems to be the RDD. Any one RDD:

Is a collection of Java objects, which should have the same or similar structure.
Can be partitioned/distributed and shuffled/redistributed across the cluster.
Doesn’t have to be entirely in memory at once.

Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:

At the moment, RDDs expire at the end of a job.
This restriction will be lifted in a future release.

Just like MapReduce, Spark wants to be fault-tolerant enough to work on clusters of dubiously-reliable hardware. Unlike MapReduce, Spark doesn’t persist intermediate result sets (unless they’re too large to fit into RAM). Rather, Spark’s main fault-tolerance strategy is:

RDDs are written by single operations (typically executed in a distributed fashion).
If there’s a failure, the operation is replayed over the portion of the data that was on the affected node.

Further, Reynold Xin emailed:

Spark [supports] speculative execution for dealing with stragglers. Speculation is particularly important for low-latency jobs, which are common in Spark.

Shark borrows a lot of Hive code to do what Hive does, only over Spark. Notes on Shark’s query planning include:

Shark borrows the Hive optimizer for up-front join reordering and so on.
Shark can dynamically re-plan work in progress to:
- Change how work is partitioned among nodes.
- Select a join algorithm appropriate for the cardinalities of intermediate result sets.

Further Shark smarts are to be added down the road.

And finally, Shark gives a columnar storage format to its RDDs, which hasalready been discussed on this blog.