spark streaming 与 storm的对比

最新推荐文章于 2020-03-20 09:38:40 发布

layne_liang

最新推荐文章于 2020-03-20 09:38:40 发布

阅读量2.4k

点赞数

分类专栏： Spark Storm

Spark 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

5 篇文章 0 订阅

订阅专栏

feature	strom （trident）	spark streaming	说明
并行框架	基于DAG的任务并行计算引擎（task parallel continuous computational engine Using DAG）	基于spark的数据并行计算引擎（data parallel general purpose batch processing engine）
数据处理模式	(one at a time)一次处理一个事件（消息） trident： (Micro-batch)一次处理多个事件	(Micro-batch)一次处理多个事件
延时	小于一秒 trident（数秒）	数秒）	Josh December 8, 2014 at 4:23 PM Thanks for the article! Could you please explain this point in a bit more detail? "But, it relies on transactions to update state, which is slower and often has to be implemented by the user." If I want to write my output to a persistent store e.g. redis, then why would it be slower in Storm than in Spark Streaming? Reply Replies Xinh Huynh December 13, 2014 at 5:24 PM Hi Josh, please check out the slide about Storm/Trident here: http://spark-summit.org/wp-conte ... Spark-Streaming.pdf If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. I.e., in word-count, for each word, you would store both the count as well as a transaction ID; each key-value pair would look like: (Key:word, Value: count, txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes extra latency. If you are using redis in memory, that might be okay, but if it has to go to disk then that would add noticeable latency to the update. Whereas in Spark, you don't have to store a per-state transaction ID. For the details of Trident transactional processing, see http://storm.apache.org/documentation/Trident-state Josh December 15, 2014 at 9:18 AM Hi Xinh, thanks for the explanation. I see, isn't that similar to Spark checkpointing - where it saves states to HDFS every ~10 seconds? or is your point that with Storm it would (by default) persist the state much more frequently than Spark? Xinh Huynh December 15, 2014 at 11:43 AM Hi Josh, yes, the fault tolerance in Spark involves periodic (~10 second) checkpointing of RDDs. Yes, my point is that with Storm Trident the persistence occurs when each batch is processed, and by default that occurs a lot more than once every 10 seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case of failure.
容错	至少一次 trident：精确一次	精确一次
源出处	BackType and Twitter	UCB
实现语言	Clojure	scala
API支持	java、python、ruby等	jscala、java、python
平台集成	NA(基于zookeeper)	spark（所以可以统一（或共用）时事处理与历史数据的处理）
产品、支持	Storm has been around for several years and has run in production at Twitter since 2011, as well as at many other companies	Meanwhile, Spark Streaming is a newer project; its only production deployment (that I am aware of) has been at Sharethrough since 2013.
计算理论框架	Storm is the streaming solution in the Hortonworks Hadoop data platform	Spark Streaming is in both MapR's distribution and Cloudera's Enterprise data platform. Databricks
集群集成，部署方式	依赖zookeeper，standalone，messo	standalone，yarn，messo
google trend
bug燃烧图	https://issues.apache.org/jira/browse/STORM/	https://issues.apache.org/jira/browse/SPARK/	可见spark问题解决比storm要及时得多

spark stream和storm之间的争论源远流长。。
refer:
http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

http://www.zdatainc.com/2014/09/apache-storm-apache-spark/

feature	strom （trident）	spark streaming	说明
并行框架	基于DAG的任务并行计算引擎（task parallel continuous computational engine Using DAG）	基于spark的数据并行计算引擎（data parallel general purpose batch processing engine）
数据处理模式	(one at a time)一次处理一个事件（消息） trident： (Micro-batch)一次处理多个事件	(Micro-batch)一次处理多个事件
延时	小于一秒 trident（数秒）	数秒）	Josh December 8, 2014 at 4:23 PM Thanks for the article! Could you please explain this point in a bit more detail? "But, it relies on transactions to update state, which is slower and often has to be implemented by the user." If I want to write my output to a persistent store e.g. redis, then why would it be slower in Storm than in Spark Streaming? Reply Replies Xinh Huynh December 13, 2014 at 5:24 PM Hi Josh, please check out the slide about Storm/Trident here: http://spark-summit.org/wp-conte ... Spark-Streaming.pdf If you want exactly-once semantics with Trident, you have to store a per-state transaction ID for each state. I.e., in word-count, for each word, you would store both the count as well as a transaction ID; each key-value pair would look like: (Key:word, Value: count, txid). Before updating the count, you would read in the old transaction ID to make sure it's up to date, and this read causes extra latency. If you are using redis in memory, that might be okay, but if it has to go to disk then that would add noticeable latency to the update. Whereas in Spark, you don't have to store a per-state transaction ID. For the details of Trident transactional processing, see http://storm.apache.org/documentation/Trident-state Josh December 15, 2014 at 9:18 AM Hi Xinh, thanks for the explanation. I see, isn't that similar to Spark checkpointing - where it saves states to HDFS every ~10 seconds? or is your point that with Storm it would (by default) persist the state much more frequently than Spark? Xinh Huynh December 15, 2014 at 11:43 AM Hi Josh, yes, the fault tolerance in Spark involves periodic (~10 second) checkpointing of RDDs. Yes, my point is that with Storm Trident the persistence occurs when each batch is processed, and by default that occurs a lot more than once every 10 seconds. And, in tuning any of these parameters, there's a tradeoff in the frequency of persistence vs. recovery time in the case of failure.
容错	至少一次 trident：精确一次	精确一次
源出处	BackType and Twitter	UCB
实现语言	Clojure	scala
API支持	java、python、ruby等	jscala、java、python
平台集成	NA(基于zookeeper)	spark（所以可以统一（或共用）时事处理与历史数据的处理）
产品、支持	Storm has been around for several years and has run in production at Twitter since 2011, as well as at many other companies	Meanwhile, Spark Streaming is a newer project; its only production deployment (that I am aware of) has been at Sharethrough since 2013.
计算理论框架	Storm is the streaming solution in the Hortonworks Hadoop data platform	Spark Streaming is in both MapR's distribution and Cloudera's Enterprise data platform. Databricks
集群集成，部署方式	依赖zookeeper，standalone，messo	standalone，yarn，messo
google trend
bug燃烧图	https://issues.apache.org/jira/browse/STORM/	https://issues.apache.org/jira/browse/SPARK/	可见spark问题解决比storm要及时得多

spark stream和storm之间的争论源远流长。。
refer:
http://xinhstechblog.blogspot.com/2014/06/storm-vs-spark-streaming-side-by-side.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

http://www.zdatainc.com/2014/09/apache-storm-apache-spark/

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。