【Spark: Big Data Cluster Computing in Production 翻译】第一章

最新推荐文章于 2024-07-21 21:43:54 发布

colossus_bigdata

最新推荐文章于 2024-07-21 21:43:54 发布

阅读量453

点赞数

分类专栏：外文原版好书学习文章标签： spark big data hadoop

外文原版好书学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

当你第一次像集群中添加一个spark application时，可能经常遇到的问题之一就是这个job跑不起来。spark集群拥有极大的扩展能力，但是一些特性并不是开箱即用的，尽管spark框架设计的首要原则是让人容易上手使用，但是在你学习了最基本的spark application开发后，为了更加深入理解spark内部原理和配置，你需要做进一步的练习。本章我们将介绍一些spark运行需要的一些基础知识。首先我们将致力于介绍一些application运行在生产环境所需要的硬件以及系统层面的设计。讨论工业生产的spark集群的各种安装部署方式。将根据实际的工作负荷，给出有效率的部署方式和通用设置。为了更好的理解spark的设计原理以及更好的匹配具体的使用场景，我们将谈一谈spark的发展历史。接下来，我们将探讨一下资源管理：一个spark application运行的时候，内存，cpu，磁盘如何工作。然后的话题是spark以及子系统的存储问题。最后将介绍spark的监控问题。

安装所需要的组件

在开始跑你开发的spark程序之前，你需要一个可以跑任务的spark集群用来测试。你可以用各种方式去下载，编译，安装部署spark（有些容易些，有些麻烦些），本章将涵盖一些主要的方法。让我们从配置一个本地spark开始，spark安装后，将使用各种hadoop（cloudera，hortonwork等）分布式系统，我们还将对在aws上部署spark做一个简单的介绍。在更深入介绍部署spark集群之前，一个重要的问题是，我的spark集群应该跑在什么样配置的硬件之上？对这个问题可能有多种多样的答案，但是我们尽可能给出一些通用的原则而不是一个一成不变的答案。

“spark基于内层运行”牢记这一点是非常重要的。因此，就整个系统而言，为了最大程度的提升效率，应该尽可能根据最大负荷和数据集评估出需要消耗的最大内存量。这不是说集群后面不能扩展，而是要提前计划好，特别对于那些采购硬件需要数周甚至数月的公司来说。当我们计算规划所需要的内存时需要注意的是，内存消耗的大小与数据大小并不是1:1的。就是说，比如你有1T的数据，但是计算机需要的内存就不止1T了。这是因为你将一个数据集表示成一个java对象的时候，这个对象通常比原来的数据本身大得多。用数据集数倍的内存量你将有一个更准确的表示系统执行给定任务所需的内存量。

为了更好的解决这个内存耗费过大的问题，笔者在写本书的时候，spark启动了一个称为“钨丝计划”的项目。该项目可以大大减少原来spark利用堆内存的内存开销。学习本书不需要了解更多细节，该项目可能在spark后续的release版本中出现，该项目将成为一个spark的内存管理系统。

在你挑选硬件时，第二个重要的组件是每台物理机的cpu核数的选择。这个问题更加没有一个确定的答案，一旦数据被正常的加载到内存，一个application通常会绑定cpu和网络。就是说，最简单的方式是利用一个小的数据集去测试你的应用程序资源使用的上限来决定该问题的答案。

standalone方式本地安装spark集群

最简单的spark部署方式是使用一个standalone集群。这种模式中，将spark的编译好的二进制文件部署到集群的各台机器上，修改一些配置文件，然后启动主节点和从节点的一些相关进程。在第2章中，我们将详细讨论这个过程，并提出了一个简单的覆盖基本spark的安装、部署和执行一个spark job的方案。由于spark并不与hadoop绑定，因此这种部署方式（standalone）除了jdk以外没有任何其他的依赖。目前spark推荐的jdk版本是1.7。如果你想把spark跑在现有的hadoop集群之上，那么你可以把spark部署到与hadoop相同的机器上，并且在配置文件中配置相关的hadoop的配置。

【注】更多的cloudera spark安装信息请访问：http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_spark_installation.html.

更多关于spark on hortonwork安装信息请访问：http://hortonworks.com/hadoop/spark/#section_6

更多关于spark on aws请访问：http://aws.amazon.com/articles/4926593393724923.

spark分布式计算史

We have introduced Spark as a distributed compute framework; however, we
haven’t really discussed what this means. Until recently, most computer sys-
tems available to both individuals and enterprises were based around single
machines. These single machines came in many shapes and sizes and differed
dramatically in terms of their performance, as they do today.
We’re all familiar with the modern ecosystem of personal machines. At the
low-end, we have tablets and mobile phones. We can think of these as rela-
tively weak, un-networked computers. At the next level we have laptops and
desktop computers. These are more powerful machines, with more storage and
computational ability, and potentially, with one or more graphics cards (GPUs)
that support certain types of massively parallel computations. Next are those
machines that some people have networked with in their home, although gen-
erally these machines were not networked to share their computational ability,
but rather to provide shared storage—for example, to share movies or music
across a home network.
Within most enterprises, the picture today is still much the same. Although
the machines used may be more powerful, most of the software they run, and
most of the work they do, is still executed on a single machine. This fact limits

the scale and the potential impact of the work they can do. Given this limita-
tion, a few select organizations have driven the evolution of modern parallel
computing to allow networked systems of computers to do more than just share
data, and to collaboratively utilize their resources to tackle enormous problems.
In the public domain, you may have heard of the SETI at Home program from
Berkeley or the Folding@Home program from Stanford. Both of these programs
were early initiatives that let individuals dedicate their machines to solving
parts of a massive distributed task. In the former case, SETI has been looking
for unusual signals coming from outer space collected via radio telescope. In the
latter, the Stanford program runs a piece of a program computing permutations
of proteins—essentially building molecules—for medical research.
Because of the size of the data being processed, no single machine, not even
the massive supercomputers available in certain universities or government
agencies, have had the capacity to solve these problems within the scope of a
project or even a lifetime. By distributing the workload to multiple machines,
the problem became potentially tractable—solvable in the allotted time.
As these systems became more mature, and the computer science behind these
systems was further developed, many organizations created clusters of machines—
coordinated systems that could distribute the workload of a particular problem across
many machines to extend the resources available. These systems first grew in research
institutions and government agencies, but quickly moved into the public domain.

Enter the Cloud
The most well-known offering in this space is of course the proverbial “cloud.”
Amazon introduced AWS (Amazon Web Services), which was later followed
by comparable offerings from Google, Microsoft, and others. The purpose of a
cloud is to provide users and organizations with scalable clusters of machines
that can be started and expanded upon on-demand.
At about the same time, universities and certain companies were also build-
ing their own clusters in-house and continuing to develop frameworks that
focused on the challenging problem of parallelizing arbitrary types of tasks
and computations. Google was born out of its PageRank algorithm—an exten-
sion of the MapReduce framework that allowed a general class of problems to
be solved in parallel on clusters built with commodity hardware.
This notion of building algorithms, that, while not the most efficient, could
be massively parallelized and scaled to thousands of machines, drove the next
stage of growth in this area. The idea that you could solve massive problems by
building clusters, not of supercomputers, but of relatively weak and inexpensive
machines, democratized distributed computing.
Yahoo, in a bid to compete with Google, developed, and later open-sourced
under the Apache Foundation, the Hadoop platform—an ecosystem for distrib-
uted computing that includes a file system (HDFS), a computation framework

(MapReduce), and a resource manager (YARN). Hadoop made it dramatically
easier for any organization to not only create a cluster but to also create software
and execute parallelizable programs on these clusters that can process huge
amounts of distributed data on multiple machines.
Spark has subsequently evolved as a replacement for MapReduce by build-
ing on the idea of creating a framework to simplify the difficult task of writing
parallelizable programs that efficiently solve problems at scale. Spark’s primary
contribution to this space is that it provides a powerful and simple API for per-
forming complex, distributed operations on distributed data. Users can write
Spark programs as if they were writing code for a single machine, but under
the hood this work is distributed across a cluster. Secondly, Spark leverages the
memory of a cluster to reduce MapReduce’s dependency on the underlying dis-
tributed file system, leading to dramatic performance gains. By virtue of these
improvements, Spark has achieved a substantial amount of success and popu-
larity and has brought you here to learn more about how it accomplishes this.
Spark is not the right tool for every job. Because Spark is fundamentally designed
around the MapReduce paradigm, its focus is on excelling at Extract, Transform,
and Load (ETL) operations. This mode of processing is typically referred to as
batch processing—processing large volumes of data efficiently in a distributed
manner. The downside of batch processing is that it typically introduces larger
latencies for any single piece of data. Although Spark developers have been dedi-
cating a substantial amount of effort to improving the Spark Streaming mode, it
remains fundamentally limited to computations on the order of seconds. Thus,
for truly low-latency, high-throughput applications, Spark is not necessarily the
right tool for the job. For a large set of use cases, Spark nonetheless excels at
handling typical ETL workloads and provides substantial performance gains
(as much as 100 times improvement) over traditional MapReduce.
Understanding Resource Management
In the chapter on cluster management you will learn more about how the oper-
ating system handles the allocation and distribution of resources amongst the
processes on a single machine. However, in a distributed environment, the cluster
manager handles this challenge. In general, we primarily focus on three types
of resources within the Spark ecosystem. These are disk storage, CPU cores, and
memory. Other resources exist, of course, such as more advanced abstractions like
virtual memory, GPUs, and potentially different tiers of storage, but in general
we don’t need to focus on those within the context of building Spark applications.
Disk Storage
The first type of resource, disk, is vital to any Spark application since it stores
persistent data, the results of intermediate computations, and system state.

When we refer to disk storage, we are referring to data stored on a hard drive
of some kind, either the traditional rotating spindle, or newer SSDs and flash
memory. Like any other resource, disk is finite. Disk storage is relatively cheap
and most systems tend to have an abundance of physical storage, but in the world
of big data, it’s actually quite common to use up even this cheap and abundant
storage! We tend to enable replication of data for the sake of durability and to
support more efficient parallel computation. Also, you’ll usually want to persist
frequently used intermediate dataset(s) to disk to speed up long-running jobs.
Thus, it generally pays to be cognizant of disk usage, and treat it as any other
finite resource.
Interaction with physical disk storage on a single machine is abstracted away
by the file system—a program that provides an API to read and write files. In a
distributed environment, where data may be spread across multiple machines,
but still needs to be accessed as a single logical entity, a distributed file system
fulfills the same role. Managing the operation of the distributed file system and
monitoring its state is typically the role of the cluster administrator, who tracks
usage, quotas, and re-assigns resources as necessary. Cluster managers such as
YARN or Mesos may also regulate access to the underlying file system to better
distribute resources between simultaneously executing applications.
CPU Cores
The central processing unit (CPU) on a machine is the processor that actually
executes all computations. Modern machines tend to have multiple CPU cores,
meaning that they can execute multiple processes in parallel. In a cluster, we have
multiple machines, each with multiple cores. On a single machine, the operat-
ing system handles communication and resource sharing between processes.
In a distributed environment, the cluster manager handles the assignment of
CPU resources (cores) to individual tasks and applications. In the chapter on
cluster management, you’ll learn specifically how YARN and Mesos ensure
that multiple applications running in parallel can have access to this pool of
available CPUs and share it fairly.
When building Spark applications, it’s helpful to relate the number of CPU
cores to the parallelism of your program, or how many tasks it can execute
simultaneously. Spark is based around the resilient distributed dataset (RDD)—
an abstraction that treats a distributed dataset as a single entity consisting of
multiple partitions. In Spark, a single Spark task will processes a single partition
of an RDD on a single CPU core.
Thus, the degree to which your data is partitioned—and the number of avail-
able cores—essentially dictates the parallelism of your program. If we consider
a hypothetical Spark job consisting of five stages, each needing to run 500 tasks,
if we only have five CPU cores available, this may take a long time to complete!
In contrast, if we have 100 CPU cores available, and the data is sufficiently

partitioned, for example into 200 partitions, Spark will be able to parallelize much
more effectively, running 100 tasks simultaneously, completing the job much
more quickly. By default, Spark only uses two cores with a single executor—
thus when launching a Spark job for the first time, it may unexpectedly take a
very long time. We discuss executor and core configuration in the next chapter.
Memory
Lastly, memory is absolutely critical to almost all Spark applications. Memory
is used for internal Spark mechanisms such as the shuffle, and the JVM heap is
used to persist RDDs in memory, minimizing disk I/O and providing dramatic
performance gains. Spark acquires memory per executor—a worker abstrac-
tion that you’ll learn more about in the next chapter. The amount of memory
that Spark requests per executor is a configurable parameter and it is the job of
the cluster manager to ensure that the requested resources are provided to the
requesting application.
Generally, cluster managers assign memory the same way that the cluster
manager assigns CPU cores as discrete resources. The total available memory
in a cluster is broken up into blocks or containers, and these containers are
assigned (or offered in the case of Mesos) to specific applications. In this way,
the cluster manager can act to both assign memory fairly, and schedule resource
usage to avoid starvation.
Each assigned block of memory in Spark is further subdivided based on Spark
and cluster manager configurations. Spark makes tradeoffs between the memory
allocated for dynamic memory allocated during shuffle, the memory used to
store cached RDDs, and the amount of memory available for off-heap storage.
Most applications will require some degree of tuning to determine the appro-
priate balance of memory based on the RDD transformations executed within
the Spark program. A Spark application with improperly configured memory
settings may run inefficiently, for example, if RDDs cannot be fully persisted
in memory and instead are swapped back and forth from disk. Insufficient
memory allocated for the shuffle operation can also lead to slowdown since
internal tables may be swapped to disk, if they cannot fit entirely into memory.
In the next chapter on cluster management, we will discuss in detail the
memory structure of a block of memory allocated to Spark. Later, when we
cover performance tuning, we’ll show how to set the parameters associated with
memory to ensure that Spark applications run efficiently and without failures.
In newer versions of Spark, starting with Spark 1.6, Spark introduces dynamic
automatic memory tuning. As of 1.6, Spark will automatically adjust the frac-
tion of memory allocated for shuffle and caching, as well as the total amount of
allocated memory. This allows you to fit larger datasets into a smaller amount
of memory, as well as to more easily create programs that execute successfully
out of the box, without extensive tuning of a multitude of memory parameters.

colossus_bigdata

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Spark: Big Data Cluster Computing in Production 翻译】第一章

当你第一次像集群中添加一个spark application时，可能经常遇到这个job不能执行成功的问题。spark集群拥有极大的扩展能力，【】，spark框架被设计成很容易上手使用。When you scale out a Spark application for the first time, one of the more commonoccurrences you will en
复制链接

扫一扫