Spark - 简介

最新推荐文章于 2021-02-21 16:45:53 发布

此心光明-超然

最新推荐文章于 2021-02-21 16:45:53 发布

阅读量146

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/weixin_43364172/article/details/93379012

版权

Spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

简介

跨不同的workloads和platforms，是统一的分布式计算引擎。它使用各种范式（paradigms，比如Spark streaming, Spark ML, Spark SQL, and Spark GraphX），可以连接不同的platforms，处理不同的数据workloads。

fast in-memory data processing engine。
由core和库组成。
core是分布式的计算引擎，提供了Java、Scala和Python API。
Spark provides real-time streaming, queries, machine learning, and graph processing.

Uses in-memory processing as much as possible
General purpose engine to be used for batch, real-time workloads
Compatible with YARN and also Mesos
Integrates well with HBase, Cassandra, MongoDB, HDFS, Amazon S3, and other file systems and data sources

特性：

Transparently（透明） processes data on multiple nodes via a simple API
Resiliently（弹性） handles failures
主要使用内存，必要时溢出到磁盘
The same Spark code can run standalone, in Hadoop YARN, Mesos, and the cloud

Apache Spark does not provide a Storage layer and relies on HDFS or Amazon S3 and so on.
Hadoop provides distributed storage and a MapReduce distributed computing framework, Spark on the other hand is a data processing framework that operates on the distributed data storage provided by other technologies.
if you need to do analytics on streaming data or your processing requirements need multistage processing logic, you will probably want to want to go with Spark.

three layers：

cluster manager： can be standalone, YARN, or Mesos。Using local mode, you don’t need a cluster manager to process
core：which provides all the underlying APIs to perform task scheduling and interacting with storage
such as Spark SQL to provide interactive（互动） queries, Spark streaming for real-time analytics, Spark ML for machine learning, and Spark GraphX for graph processing

three layers

Spark core

底层通用执行引擎。包含运行作业所需的功能，以及其他组件所需要的功能。
提供内存计算，引用外部存储中的数据集，Resilient Distributed Dataset (RDD)。

提供了访问各种文件系统的逻辑，比如such as HDFS, Amazon S3, HBase, Cassandra, relational databases。
也提供基本的：

networking, security, scheduling支持函数
data shuffling（清洗）to build a high scalable（可扩展）, fault-tolerant（容错） platform for distributed computing

DataFrames and datasets built on top of RDDs。

Spark SQL

Spark SQL is a component on top of Spark core that introduces a new data abstraction called SchemaRDD。
支持结构化的和半结构化的数据。
使用Spark and Hive QL支持的SQL子集，可操作大量分布式数据。
通过DataFrames and datasets，简化了对结构化数据的处理。
支持read/write各种数据源（比如文件、Hive, HDFS, S3,关系型数据库）。
提供了查询优化框架-Catalyst-提高速度（比RDDs快）。
包含Thrift server-可以使用JDBC，从外部系统查询数据。

Spark streaming

可以从各种源（HDFS, Kafka, Flume, Twitter, ZeroMQ, Kinesis）执行流分析。
使用micro-batches of data处理块数据。
可以在RDDs之上运行。
可以从各种故障中自动恢复。
可以和其他组件在一个程序中组合。

Spark GraphX

GraphX provides functions for building graphs, represented as Graph RDDs
可以使用Pregel abstraction API，为用户定义的图建模。
GraphX also contains implementations of the most important algorithms of graph theory, such as page rank, connected components, shortest paths, SVD++, and others.

Spark ML

MLlib是分布式机器学习框架。
providing various algorithms such as logistic regression, Naive Bayes classification, Support Vector Machines (SVMs), decision trees, random forests, linear regression, Alternating Least Squares (ALS), and k-means clustering。

此心光明-超然

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark - 简介

简介跨不同的workloads和platforms，是统一的分布式计算引擎。它使用各种范式（paradigms，比如Spark streaming, Spark ML, Spark SQL, and Spark GraphX），可以连接不同的platforms，处理不同的数据workloads。fast in-memory data processing engine。由core和库组成。...
复制链接

扫一扫

专栏目录