apache spark
Apache Spark is an open-sourced, distributed data processing system for big data applications that follows the in-memory caching technique for fast response almost against any data size. From Its official site,
Apache Spark是适用于大数据应用程序的开源分布式数据处理系统,它遵循内存缓存技术,几乎可以对任何数据大小快速响应。 在其官方网站上 ,
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs
Apache Spark是用于大规模数据处理的统一分析引擎。 它提供Java,Scala,Python和R中的高级API,以及支持常规执行图的优化引擎。
Four advantages of Apache Spark from its developers,
开发人员从Apache Spark获得的四个优势:
1. 速度 (1. Speed)
Runs approx. 100X faster than its competitor Hadoop Eco. It achieves high performance for both Batch and Streaming data.
运行约。 比其竞争对手Hadoop Eco快100倍。 它为批处理和流数据实现了高性能。
2.易于使用 (2. Ease of Use)
Supports over 80 high-level operators to build parallel apps including industry rulers like Scala, Python, R, SQL and more
支持80多个高级操作员来构建并行应用程序,包括Scala,Python,R,SQL等行业标尺
3.普遍性 (3. Generality)
Spark comes up with a pack of libraries for SQL, Machine Learning, Streaming, and Graphing. These applications can be used in the same application seamlessly.
Spark附带了一整套用于SQL,机器学习,流和图形的库。 这些应用程序可以在同一应用程序中无缝使用。
4.无处不在 (4. Runs Everywhere)
No need for any special infrastructure requirements. Runs on already available environments like Hadoop, Mesos, Kubernetes, Cloud, or can be run as a standalone.
无需任何特殊的基础架构要求。 在Hadoop,Mesos,Kubernetes,Cloud等已经可用的环境上运行,或者可以独立运行。
为什么Spark赢得Hearts? (Why is Spark winning Hearts?)
The main reason for using Apache Spark is It’s Unified Engine. Before the entry of Spark, there are so many tools that were used to do specific jobs on the top of HDFS-MapReduce.
使用Apache Spark的主要原因是它的Unified Engine 。 在进入Spark之前,HDFS-MapReduce顶部有许多用于完成特定工作的工具。
Before Apache Spark, There was a king who started this Big Data Processing Era named HDFS-MapReduce which was used to store and process a large volume of the data. But when time goes, data grows with its very own friend complexity. This brought the difficulty to handle different types of requirements. Lots of tools developed to take care of different needs. For example,
在Apache Spark诞生之前,曾有一位国王开始了这个大数据处理时代,名为HDFS-MapReduce ,该时代用于存储和处理大量数据。 但是随着时间的流逝,数据以其自身复杂的朋友而增长。 这给处理不同类型的需求带来了困难。 开发了许多工具来满足不同需求。 例如,
Impala: MPP (Massive Parallel Processing) SQL query engine for processing huge volumes of data that is stored in a Hadoop cluster.
Impala: MPP(大规模并行处理)SQL查询引擎,用于处理存储在Hadoop集群中的大量数据。
Storm: Tool for real-time data processing
Storm:实时数据处理工具
Mahout: Used for creating scalable machine learning algorithms
Mahout:用于创建可扩展的机器学习算法
Drill: Low latency distributed query engine for large-scale datasets
演练:适用于大规模数据集的低延迟分布式查询引擎
and so on tools …
等等工具...
So, It became very difficult to manage all these requirements for data processing and manipulating. Here Apache Spark comes in. Its Unified Engine can do the job of all these tools.
因此,管理数据处理和操作的所有这些要求变得非常困难。 Apache Spark来了。 它的Unified Engine可以完成所有这些工具的工作 。
Spark Core is the underlying general execution engine for the Spark platform that all other functionalities like Spark SQL, Spark Streaming, MLib and GraphX uses.
Spark Core是Spark平台的基础通用执行引擎,所有其他功能(例如Spark SQL,Spark Streaming,MLib和GraphX)都使用Spark核心。
Yes, obviously SPEED is one more important reason to pick SPARK.
是的,显然SPEED是选择SPARK的另一个重要原因。
Spark架构: (Spark Architecture:)
Once again thank you for the great picture from great learning.in. The below one explains the Core Architecture of the Spark neatly.
再次感谢您提供了来自learning.in的出色介绍 。 下面的内容巧妙地解释了Spark的核心架构。
So get into the image. The centered red-yellow thing here is our loved SPARK.The two things on it referring to the ROM and RAM of the environment the Spark belongs to. Those are important because Spark uses them much effectively so that its response is speedy.
因此,进入图像。 这里居中的红黄色是我们喜爱的SPARK。 它上面的两件事涉及Spark所属环境的ROM和RAM。 这些很重要,因为Spark非常有效地使用它们,因此其响应速度很快。
The first layer is about the languages supported by Apache Spark.
第一层是Apache Spark支持的语言 。
- Scala (Spark itself written in Scala !) Scala(用Scala编写的Spark本身!)
- Python Python
- Java Java
- R [R
- SQL SQL
Second layer about the libraries available in Spark.There is a main part of the Spark is SPARK CORE. This is not a library of Spark but itself is a Spark Engine Which connects all the libraries. Totally 4 libraries are available in the Spark package. (Most of the below definitions are taken from Databricks website.)
关于Spark中可用库的第二层。 Spark的主要部分是SPARK CORE 。 这不是Spark的库,而是本身是连接所有库的Spark引擎。 Spark软件包中共有4个库。 (以下大多数定义来自Databricks网站 。)
Spark SQL: Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
Spark SQL: Spark SQL是用于结构化数据处理的Spark模块。 它提供了一个称为DataFrames的编程抽象,还可以充当分布式SQL查询引擎。
MLib:Apache Spark MLlib is the Apache Spark machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives
MLib: Apache Spark MLlib是Apache Spark机器学习库,它由常见的学习算法和实用程序组成,包括分类,回归,聚类,协作过滤, 降维和基础优化原语
GraphX:GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction.
GraphX: GraphX是Spark中用于图形和图形并行计算的新组件。 在较高的层次上,GraphX通过引入新的Graph抽象来扩展Spark RDD 。
Spark Streaming:Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called DStream, which represents a continuous stream of data.
Spark Streaming: Spark Streaming是核心Spark API的扩展,可实现实时数据流的可伸缩,高吞吐量,容错流处理。 Spark Streaming提供了称为DStream的高级抽象,它表示连续的数据流。
The Final layer is about the different modes of how we can use Spark.
最后一层是关于如何使用Spark的不同模式 。
Local Mode: We can run Spark in the local machine itself which can be useful for development, testing purposes.
本地模式:我们可以在本地计算机本身中运行Spark,这对于开发,测试目的很有用。
Standalone Mode:When we wanted to install Spark in the servers but we don't want to use other tools such as YARN or MESOS. This is like local mode in the cluster without any further tools attached to the Spark.
独立模式:当我们想在服务器上安装Spark但不想使用其他工具(例如YARN或MESOS)时。 这就像集群中的本地模式,没有任何其他工具附加到Spark。
YARN:When Spark needs to be run in the Cluster, And there is a need of Cluster management tool, then YARN comes in. Simply, YARN is a generic resource-management framework for distributed workloads.
YARN:当Spark需要在集群中运行,并且需要集群管理工具时,YARN就应运而生。简单来说,YARN是用于分布式工作负载的通用资源管理框架。
Mesos:Apache Mesos is a centralized, fault-tolerant cluster manager, designed for distributed computing environments. It provides resource management and isolation, scheduling of CPU & memory across the cluster.
Mesos: Apache Mesos是一个集中的,容错的群集管理器,专门用于分布式计算环境。 它提供资源管理和隔离,跨集群调度CPU和内存。
ZOO-KEEPER:
动物园饲养员:
Zookeeper is something like a communication bridge between the instances of the cluster.
Zookeeper类似于集群实例之间的通信桥。
Zookeeper is the tool that makes sure of the high availability of the Spark Cluster. Zookeeper initializes the standby instance as a master when there is a failure in the current master, Recovers the older master’s state, and resumes the scheduling.
Zookeeper是确保Spark群集的高可用性的工具。 当当前主节点发生故障时,Zookeeper将备用实例初始化为主节点,恢复旧主节点的状态,并恢复调度。
YARN is the resource utilization tool that does resource allocation, co-ordination between the resources, and scheduling whereas ZOOKEEPER is a centralized service for maintaining configuration information, naming, providing distributed synchronization.
YARN是一种资源利用工具,可以进行资源分配,资源之间的协调和调度,而ZOOKEEPER是用于维护配置信息,命名和提供分布式同步的集中式服务。
Some Sparky Details:
一些闪亮的细节:
Spark is not a Storage system. It is just an execution Engine. Spark can get the data from other Storage systems like HDFS, local machine, etc.
Spark 不是存储系统 。 它只是一个执行引擎。 Spark可以从其他存储系统(例如HDFS,本地计算机等)获取数据。
- we can use Yarn, Mesos, or Kubernetes as a resource manager for Apache Spark. In Standalone mode, Spark itself a resource manager. 我们可以使用Yarn,Mesos或Kubernetes作为Apache Spark的资源管理器。 在独立模式下,Spark本身就是资源管理器。
- Spark can read data from many storage systems like HDFS, AWS S3, local storage, and more. And in the same way, it can write back the data to many storages too. Spark可以从许多存储系统(例如HDFS,AWS S3,本地存储等)读取数据。 同样,它也可以将数据写回到许多存储中。
In Spark Streaming, Spark can get the data from the Flume Agent or Kafka queue. Spark itself can receive the streaming data from the sources. But to avoid the data lost due to instance down, It is always better to use other sources to get the streaming data. Now when there is a failure of the receiver instance in the Spark Cluster, the data will not be lost and It can be read by another Spark instance from Flume/Kafka.
在Spark Streaming中, Spark可以从Flume Agent或Kafka队列中获取数据 。 Spark本身可以从源接收流数据。 但是为了避免由于实例关闭而导致数据丢失,最好使用其他源来获取流数据。 现在,当Spark群集中的接收器实例出现故障时,数据将不会丢失,并且可以由Flume / Kafka的另一个Spark实例读取。
Here, Tachyon is something that stands between frameworks like Spark and Storage systems. Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster jobs.
在这里,Tachyon介于Spark和存储系统等框架之间。 Tachyon是一个以内存为中心的分布式存储系统,可在整个群集作业之间以内存速度实现可靠的数据共享 。
Apache Spark Windows 10安装 (Apache Spark Windows 10 Installation)
Spark is written in Scala. Scala source code is intended to be compiled to Java bytecode so that the resulting executable code runs on a Java virtual machine. Spark requires Java 8. Open the command prompt,
Spark用Scala编写。 Scala 源代码旨在被编译为Java字节码,以便生成的可执行代码在Java虚拟机上运行。 Spark需要Java8。打开命令提示符,
type
类型
java -version to check Java is available in the system. If Java not available, go to this link Java and download it.
java -version以检查Java在系统中是否可用。 如果Java不可用,请转到此链接Java并下载。
Spark can work with languages like Scala, Python, Java, and R. Since Spark is pre-built with Scala, we can continue with the Spark download now. Spark with other languages is covered later in this post.
Spark可以与Scala,Python,Java和R等语言一起使用 。 由于Spark是使用Scala预先构建的,因此我们现在可以继续下载Spark。 这篇文章的后面部分将介绍使用其他语言的Spark。
Download the required Spark distribution from Spark_Download.Extract it to the path needed.
从Spark_Download下载所需的Spark发行版 。 将其提取到所需的路径。
Spark comes with inbuilt Hadoop. But to make this work, we need winutils.exe file. Head to this Github Link, Inside <hadoop_version>/bin path, search for winutils.exe, download it.
Spark带有内置的Hadoop。 但是要使此工作正常,我们需要winutils.exe文件。 转到Github Link的 <hadoop_version> / bin路径中,搜索winutils.exe,然后下载。
Once download the file, create a folder hadoop, inside it create a folder named bin, paste the downloaded winutils.exe file. Now the winutils file should be in <selected_path>/hadoop/bin path.
下载文件后,创建一个文件夹hadoop,在其中创建一个名为bin的文件夹,粘贴下载的winutils.exe文件。 现在,winutils文件应位于<selected_path> / hadoop / bin路径中。
Now we have to update the environment variables to make Spark and Hadoop work. Press the Windows button, type “Edit Environment Variables for your Account” & click it. As given in the below picture,
现在,我们必须更新环境变量以使Spark和Hadoop正常工作。 按Windows按钮,键入“为您的帐户编辑环境变量”,然后单击它。 如下图所示,
Using the New button, add SPARK_HOME and HADOOP_HOME variables. Click OK to update the changes. Now we can use the Spark util scripts.
使用“ 新建”按钮,添加SPARK_HOME和HADOOP_HOME变量。 单击确定以更新更改。 现在我们可以使用Spark util脚本。
Now go the <spark_extracted_folder>/bin and open the command line, type “spark-shell” and hit Enter. Prompt should give output something like below.
现在进入<spark_extracted_folder> / bin并打开命令行,输入“ spark-shell ”并点击 输入。 提示 应该给输出类似下面的内容。
If python installed in the system, Use “pyspark” to open spark in python-mode. The output will like mostly the same as Scala-mode.
如果系统中安装了python,请使用“ pyspark” 以 python-mode打开spark 。 输出将与Scala模式大致相同。
In both Scala, Python version images, command throwing two WARN, unable to load native library & Exception when trying to compute page size. I will update if I solved the warnings in future..
That's all as of now folks on the introduction to Apache Spark and Its environment pool installation. I am planning to write one more post for Hands-on with Spark on top of Scala. Hope we meet again soon on another day. ta ta !!
到目前为止,有关Apache Spark及其环境池安装的介绍到此为止。 我计划在Scala上再写一篇关于Spark上手的文章。 希望我们能在第二天再见面。 !
翻译自: https://medium.com/swlh/apache-spark-a-processing-friend-86a59eafa291
apache spark

4189

被折叠的 条评论
为什么被折叠?



