DOT--A Matrix Model for Analyzing,Optimizing and Deploying Software for Big Data Analytics in Distri
分类：
版权声明：本文为博主原创文章，未经博主允许不得转载。
1. Abstract
Traditional parallel processing models, such as BSP, are “scale up” based, aiming to achieve
high performance by increasing computing power, interconnection network bandwidth, and memory/storage capacity within dedicated systems, while
big data analytics tasks aiming for high throughput demand that large distributed systems “scale out” by continuously adding computing and storage resources through networks. Each
one of the “scale up” model and “scale out” model has a different set of performance requirements and system bottlenecks.
In this paper, we develop a general model that abstracts critical computation and communication
behavior and computation-communication interactions for big data analytics in a scalable and fault-tolerant manner. Our model is called DOT, represented by three matrices
for data sets (D), concurrent data processing operations (O), and data transformations (T), respectively.
With the DOT model, any big data analytics job execution in various software frameworks
can be represented by a specific or non-specific number of elementary/composite DOT blocks, each of which performs operations on the data sets, stores intermediate
results, makes necessary data transfers, and performs data transformations in the end. The DOT model achieves the goals of scalability and fault-tolerance by enforcing
a data-dependency-free relationship among concurrent tasks. Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation
independent, and applicable to a wide variety of big data analytics jobs. Finally, we demonstrate the effectiveness of the DOT model through several case studies.
2. two common traditional goals（include Google MapReduce, Hadoop,
Dryad and Pregel）
(1) for distributed applications, to provide a scalable and fault-tolerant system infrastructure
and supporting environment; and
(2) for software developers and application practitioners, to provide an easy-to-use
programming model that hides the technical details of parallelization and fault-tolerance.
3. the following three issues to be addressed demand more basic and fundamental
research efforts
-Behavior Abstraction: The “scale out” model of big data analytics mainly concerns two issues:
(1) how to maintain the scalability, namely to ensure a proportional increase of data processing throughput as
the size of the data and the number of computing nodes increase; and
(2) how to provide a strong fault-tolerance mechanism in underlying distributed systems, namely to be able
to quickly recover processing activities as some service nodes crash.
However, the basis and principles that jobs can be executed with
scalability and fault-tolerance is not well studied.
-Application Optimization:
Current practice on application optimization for big data analytics jobs
is underlying software framework dependent, so that optimization opportunities are only applicable to a specific software framework or a specific
system implementation. A bridging model between applications and underlying software frameworks would enable us to gain opportunities of software framework
and implementation independent optimization,which can enhance performance and productivity without impairing scalability and fault tolerance. With this bridging
model, system designers and application practitioners can focus on a set of general optimization rules regardless of the structures of software frameworks and underlying infrastructures.
-System Comparison, Simulation and Migration:
The diverse requirements of various big data analytics applications cause the needs of system comparison and application
migration among existing and/or new designed software frameworks. However, without a general abstract model for the processing paradigm of various software frameworks for
big data analytics, it is hard to fairly compare different frameworks in several critical aspects, including scalability, fault-tolerance and framework functionality. Additionally,
a general model can provide guide to building software framework simulators that are greatly desirable when designing new frameworks or customizing existing frameworks for
certain big data analytics applications. Moreover, since a bridging model between applications and various underlying software frameworks is not available, application migration from
one software framework to another depends strongly on programmers’ special knowledge of both frameworks and is hard to do in an efficient way. Thus, it is desirable to have guidance
for designing automatic tools used for application migration from one software framework to another.
All of above three issues demand a general model that bridges applications and various underlying software frameworks
for big data analytics.
4. we propose a candidate for the general model, called DOT, which characterizes
the basic behavior of big data analytics and identifies its critical issues.The DOT model also serves as a powerful tool for analyzing, optimizing and deploying software for big data
analytics. Three symbols “D”, “O”, and “T” are three matrix representations for distributed data sets, concurrent
data processing operations, and data transformations, respectively. Specifically, in the DOT model, the dataflow of
a big data analytics job is represented by a DOT expression containing multiple root building blocks, called elementary DOT blocks, or their extensions, called composite DOT blocks.
For every elementary DOT block, a matrix representation is used to abstract basic behavior of computing and communications for a big data analytics
job. The DOT model eliminates the data dependency among concurrent tasks executed by concurrent data processing units (called “workers”
in the rest of the paper), which is a critical requirement for the purpose of achieving scalability and fault-tolerance of a large distributed system.
5. THE DOT MODEL: The DOT model consists
of three major components to describe a big data analytics job:
(1) a root building block, called an elementary DOT block: A
big data (multi-)set; A set of workers; Mechanisms that regulate the processing paradigm of workers to interact the big data (multi-)set in two steps.
(2) an extended building block, called a composite DOT block, that is organized by a
group of independent elementary DOT blocks
(3) a method that is used for building the dataflow of a big data analytics
job with elementary/composite DOT blocks.
An elementary DOT block is illustrated by Figure 1 with a three-layer structure. The bottom layer (D-layer) represents the
big data (multi-)set. A big data (multi-)set is divided into n parts (from D1 to Dn)
in a distributed system, where each part is a sub-dataset (called a chunk in the rest of the paper). In the middle layer (O-layer), n workers
directly process the data (multi-)set and oi is the data-processing operator
associated with the ith worker. Each worker only processes a chunk (as shown by the arrow from Di to oi) and
stores intermediate results. At the top layer (T-layer), a single worker with operator t collects all intermediate results
(as shown by the arrows from oi to t, i =
1, . . . , n), then performs the last-stage data transformations based on intermediate results, and finally
outputs the ending result.
Based on the definitions of the composite DOT block, there are three restrictions on communications among workers:
(1) workers in the O-layer cannot communicate with each other;
(2) workers in the T-layer cannot communicate with each other; and
(3) intermediate data transfers from workers in the O-layer to their corresponding workers in the T-layer are
the only communications occurring in a composite DOT block.
6. Big Data Analytics Jobs:is described by
its dataflow, global information and halting conditions.
(1) Dataflow of a Job: is represented by a specific or non-specific number
of elementary/composite DOT blocks.
(2) Global Information: need to access
some lightweight global information, e.g. system configurations.
(3) Halting Conditions: determine when
or under what conditions a job will stop.
7. Formal Definitions
7.1 The Elementary DOT Block :
In the above matrix representation, matrix multiplication follows the row-column pair rule of the conventional
matrix product. The multiplication of corresponding elements of the two matrices is defined as:
(1) a multiplication between a data chunk Di and
an operator f ( f can either be the operator in matrix O or
the one in matrix T) means to apply the operator on the chunk, represented by f(Di);
(2) multiplication between two operators (e.g. f1 ×
f2) means to form a composition of operators (e.g., f = f2(f1)).
In contrast to the original matrix summation, in the DOT model, the summation operator P is replaced by
a group operator F. The operation nFi(fi(Di))=
(f1(D1),・
・ ・, fn(Dn)) means to compose
a collection of data sets f1(D1) to fn(Dn).
It is not required that all elements of the collection locate in a single place.
7.2 The Composite DOT Block:
Given m elementary DOT blocks ~DO1T1 to ~DOmTm,
a composite DOT block ~DOT is formulated as:
7.3 An Algebra for Representing the Dataflow of Big
Data Analytics Jobs
a big data analytics job can be represented by an expression, called a DOT expression:
For example, a job can be composed by three composite DOT blocks, ~D1O1T1, ~D2O2T2 and ~D3O3T3,
where the results of ~D1O1T1 and ~D2O2T2 are
input of ~D3O3T3.
With the algebra defined in this section, the DOT expression of this job is:
A context-free grammar to derive a DOT expression is shown in Figure 5:
With the algebra used for representing the dataflow of a big data analytics job as
a DOT expression, the job can be described by a DOT expression, global information and halting conditions.
8. Scalability and fault-tolerance
- 顶
- 0
- 踩
- 0
相关文章推荐
- • DOT--A Matrix Model for Analyzing,Optimizing and Deploying Software for Big Data Analytics in Distri
- • Python即将成为第一语言
- • Project Soul: Toward Social Big Data Analytics for Information Discovery and Recommendation
- • 构建企业级高性能OLAP引擎--董西成
- • Real Time Analytics for Big Data: An Alternative Approach
- • JDK9新特性解读
- • NSDI'17-论文阅读［CherryPick:Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics］
- • 华为工程师，带你实战C++
- • Getting Started with Greenplum for Big Data Analytics.
- • Android自定义控件全知道
- • Getting Started with Greenplum for Big Data Analytics
- • TensorFlow入门基础知识详解
- • Pentaho for Big Data Analytics
- • Getting Started with Greenplum for Big Data Analytics(PACKT,2013)
- • Graduate Programs in Big Data Analytics and Data Science
- • #Paper Reading# Online Knowledge-Based Model for Big Data Topic Extraction
查看评论
* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场