Spark 与Storm 异同

http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html

Storm vs. Spark Streaming: Side-by-side comparison

Overview

Both Storm and Spark Streaming are open-source frameworks for distributed stream processing. But, there are important differences as you will see in the following side-by-side comparison.

Processing Model, Latency
Although both frameworks provide scalability and fault tolerance, they differ fundamentally in their processing model. Whereas Storm processes incoming events  one at a time, Spark Streaming batches up events that arrive within a short time window before processing them. Thus, Storm can achieve  sub-second latency of processing an event, while Spark Streaming has a latency of several seconds. 

Fault Tolerance, Data Guarantees
However, the tradeoff is in the fault tolerance data guarantees. Spark Streaming provides better support for  stateful computation that is fault tolerant. In Storm, each individual record has to be tracked as it moves through the system, so Storm only guarantees that each record will be processed at least once, but allows duplicates to appear during recovery from a fault. That means mutable state may be incorrectly updated twice. 

Spark Streaming, on the other hand, need only track processing at the batch level, so it can efficiently guarantee that each mini-batch will be processed  exactly once, even if a fault such as a node failure occurs. [Actually, Storm's  Trident library also provides exactly once processing. But, it relies on transactions to update state, which is slower and often has to be implemented by the user.]

Storm vs. Spark Streaming comparison.

Summary
In short,  Storm is a good choice if you need sub-second latency and no data lossSpark Streaming is better if you need stateful computation, with the guarantee that each event is processed exactly once. Spark Streaming programming logic may also be easier because it is similar to batch programming, in that you are working with batches (albeit very small ones).

Implementation, Programming API

Implementation
Storm is primarily  implemented in Clojure, while Spark Streaming is  implemented in Scala. This is something to keep in mind if you want to look into the code to see how each system works or to make your own customizations. Storm was developed at BackType and Twitter; Spark Streaming was developed at UC Berkeley.

Programming API
Storm comes with a Java API, as well as support for other languages. Spark Streaming can be programmed in Scala as well as Java.

Batch Framework Integration
One nice feature of Spark Streaming is that it runs on Spark. Thus,  you can use the same (or very similar) code that you write for batch processing and/or interactive queries in Spark, on Spark Streaming. This reduces the need to write separate code to process streaming data and historical data.

Storm vs. Spark Streaming: implementation and programming API.

Summary
Two advantages of Spark Streaming are that (1) it is not implemented in Clojure :) and (2) it is well integrated with the Spark batch computation framework.

Production, Support

Production Use
Storm has been around for several years and has run in production at Twitter since 2011, as well as at many other companies. Meanwhile, Spark Streaming is a newer project; its only production deployment (that I am aware of) has been at  Sharethrough since 2013.

Hadoop Distribution, Support
Storm is the streaming solution in the  Hortonworks Hadoop data platform, whereas Spark Streaming is in both  MapR's distribution and  Cloudera's Enterprise data platform. In addition,  Databricks is a company that provides support for the Spark stack, including Spark Streaming.

Cluster Manager Integration
Although both systems can run on their own clusters, Storm also  runs on Mesos, while Spark Streaming runs on both YARN and Mesos.

Storm vs. Spark Streaming: production and support.

Summary
Storm has run in production much longer than Spark Streaming. However, Spark Streaming has the advantages that (1) it has a company dedicated to supporting it (Databricks), and (2) it is compatible with YARN.


Further Reading

For an overview of Storm, see these  slides.

For a good overview of Spark Streaming, see the  slides to a Strata Conference talk. A more detailed description can be found in this  research paper.

http://stackoverflow.com/questions/24119897/apache-spark-vs-apache-storm

Apache Spark is an in-memory distributed data analysis platform-- primarily targeted at speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing. One of Spark's primary distinctions is its use of RDDs or Resilient Distributed Datasets. RDDs are great for pipelining parallel operators for computation and are, by definition, immutable, which allows Spark a unique form of fault tolerance based on lineage information. If you are interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great option (although memory requirements must be considered).

Apache Storm is focused on stream processing or what some call complex event processing. Storm implements a fault tolerant method for performing a computation or pipelining multiple computations on an event as it flows into a system. One might use Storm to transform unstructured data as it flows into a system into a desired format.

Storm and Spark are focused on fairly different use cases. The more "apples-to-apples" comparison would be between Storm and Spark Streaming. Since Spark's RDDs are inherently immutable, Spark Streaming implements a method for "batching" incoming updates in user-defined time intervals that get transformed into their own RDDs. Spark's parallel operators can then perform computations on these RDDs. This is different from Storm which deals with each event individually.

One key difference between these two technologies is that Spark performs Data-Parallel computationswhile Storm performs Task-Parallel computations. Either design makes tradeoffs that are worth knowing. I would suggest checking out these links.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值