Spark 是什么 ## [草稿 - 待整理]
Spark is a java virtual machine(JVM) based distributed data processing engine that scales. And it is fast compare to other data processing framework
Spark的内核是一套通用的执行引擎。Spark平台基于该引擎的其他功能包括:
Spark Core is the general execution engine for the Spark platform that other functionality is built atop
Mesos文献
- 高速内存计算能力
- in-memory computing capabilities deliver speed - 广泛的用例支持
- general execution model supports wide variety of use cases - 易于开发
- ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)
Situation | Trade-off | Tech |
---|---|---|
Cheap Memory | Recompute | RDD |
Cheap Storage | Replicate | DFS |
Cheap Network | Reference | URI |
RDD: Resilient Distributed Dataset
DFS:Distributed File System
Spark 优势
- 比Hadoop MapReduce快100倍
- 利用了工作节点的主要内存进行处理
- 防止了磁盘不必要的I/O运行
- 在应用程序层串联任务的能力
- 最小化磁盘书写数量
Spark 效率
- 使用DAG(Directed Acyclic Graph) 有向无环图
Spark 安装
Scala, Python, R –> REPL
—————————————————————————–
Spark 集成
- Databricks Cloud
-
- Spark + Hadoop + Hbase + etc.
- Spark + PyData
- Kafka + Spark + Cassandra
- Spark + ElasticSearch
- Spark + Play
- Spark +Mesos
-