Spark - 简介

简介

跨不同的workloads和platforms,是统一的分布式计算引擎。它使用各种范式(paradigms,比如Spark streaming, Spark ML, Spark SQL, and Spark GraphX),可以连接不同的platforms,处理不同的数据workloads。

fast in-memory data processing engine。
由core和库组成。
core是分布式的计算引擎,提供了Java、Scala和Python API。
Spark provides real-time streaming, queries, machine learning, and graph processing.

  • Uses in-memory processing as much as possible
  • General purpose engine to be used for batch, real-time workloads
  • Compatible with YARN and also Mesos
  • Integrates well with HBase, Cassandra, MongoDB, HDFS, Amazon S3, and other file systems and data sources

特性:

  • Transparently(透明) processes data on multiple nodes via a simple API
  • Resiliently(弹性) handles failures
  • 主要使用内存,必要时溢出到磁盘
  • The same Spark code can run standalone, in Hadoop YARN, Mesos, and the cloud

Apache Spark does not provide a Storage layer and relies on HDFS or Amazon S3 and so on.
Hadoop provides distributed storage and a MapReduce distributed computing framework, Spark on the other hand is a data processing framework that operates on the distributed data storage provided by other technologies.
if you need to do analytics on streaming data or your processing requirements need multistage processing logic, you will probably want to want to go with Spark.

three layers:

  • cluster manager: can be standalone, YARN, or Mesos。Using local mode, you don’t need a cluster manager to process
  • core:which provides all the underlying APIs to perform task scheduling and interacting with storage
  • such as Spark SQL to provide interactive(互动) queries, Spark streaming for real-time analytics, Spark ML for machine learning, and Spark GraphX for graph processing

three layers

Spark core

底层通用执行引擎。包含运行作业所需的功能,以及其他组件所需要的功能。
提供内存计算,引用外部存储中的数据集,Resilient Distributed Dataset (RDD)。

提供了访问各种文件系统的逻辑,比如such as HDFS, Amazon S3, HBase, Cassandra, relational databases。
也提供基本的:

  • networking, security, scheduling支持函数
  • data shuffling(清洗)to build a high scalable(可扩展), fault-tolerant(容错) platform for distributed computing

DataFrames and datasets built on top of RDDs。

Spark SQL

Spark SQL is a component on top of Spark core that introduces a new data abstraction called SchemaRDD。
支持结构化的和半结构化的数据。
使用Spark and Hive QL支持的SQL子集,可操作大量分布式数据。
通过DataFrames and datasets,简化了对结构化数据的处理。
支持read/write各种数据源(比如文件、Hive, HDFS, S3,关系型数据库)。
提供了查询优化框架-Catalyst-提高速度(比RDDs快)。
包含Thrift server-可以使用JDBC,从外部系统查询数据。

Spark streaming

可以从各种源(HDFS, Kafka, Flume, Twitter, ZeroMQ, Kinesis)执行流分析。
使用micro-batches of data处理块数据。
可以在RDDs之上运行。
可以从各种故障中自动恢复。
可以和其他组件在一个程序中组合。

Spark GraphX

GraphX provides functions for building graphs, represented as Graph RDDs
可以使用Pregel abstraction API,为用户定义的图建模。
GraphX also contains implementations of the most important algorithms of graph theory, such as page rank, connected components, shortest paths, SVD++, and others.

Spark ML

MLlib是分布式机器学习框架。
providing various algorithms such as logistic regression, Naive Bayes classification, Support Vector Machines (SVMs), decision trees, random forests, linear regression, Alternating Least Squares (ALS), and k-means clustering。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值