DOT--A Matrix Model for Analyzing,Optimizing and Deploying Software for Big Data Analytics in Distri
Traditional parallel processing models, such as BSP, are “scale up” based, aiming to achieve high performance by increasing computing power, interconnection network bandwidth, and memory/storage capacity within dedicated systems, while big data analytics tasks aiming for high throughput demand that large distributed systems “scale out” by continuously adding computing and storage resources through networks. Each one of the “scale up” model and “scale out” model has a different set of performance requirements and system bottlenecks.
In this paper, we develop a general model that abstracts critical computation and communication behavior and computation-communication interactions for big data analytics in a scalable and fault-tolerant manner. Our model is called DOT, represented by three matrices for data sets (D), concurrent data processing operations (O), and data transformations (T), respectively.
With the DOT model, any big data analytics job execution in various software frameworks can be represented by a specific or non-specific number of elementary/composite DOT blocks, each of which performs operations on the data sets, stores intermediate results, makes necessary data transfers, and performs data transformations in the end. The DOT model achieves the goals of scalability and fault-tolerance by enforcing a data-dependency-free relationship among concurrent tasks. Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation independent, and applicable to a wide variety of big data analytics jobs. Finally, we demonstrate the effectiveness of the DOT model through several case studies.
2. two common traditional goals（include Google MapReduce, Hadoop, Dryad and Pregel）
(1) for distributed applications, to provide a scalable and fault-tolerant system infrastructure and supporting environment; and
(2) for software developers and application practitioners, to provide an easy-to-use programming model that hides the technical details of parallelization and fault-tolerance.
3. the following three issues to be addressed demand more basic and fundamental research efforts
-Behavior Abstraction: The “scale out” model of big data analytics mainly concerns two issues:
(1) how to maintain the scalability, namely to ensure a proportional increase of data processing throughput as the size of the data and the number of computing nodes increase; and
(2) how to provide a strong fault-tolerance mechanism in underlying distributed systems, namely to be able to quickly recover processing activities as some service nodes crash.
However, the basis and principles that jobs can be executed with scalability and fault-tolerance is not well studied.
Current practice on application optimization for big data analytics jobs is underlying software framework dependent, so that optimization opportunities are only applicable to a specific software framework or a specific system implementation. A bridging model between applications and underlying software frameworks would enable us to gain opportunities of software framework and implementation independent optimization,which can enhance performance and productivity without impairing scalability and fault tolerance. With this bridging model, system designers and application practitioners can focus on a set of general optimization rules regardless of the structures of software frameworks and underlying infrastructures.
-System Comparison, Simulation and Migration:
The diverse requirements of various big data analytics applications cause the needs of system comparison and application migration among existing and/or new designed software frameworks. However, without a general abstract model for the processing paradigm of various software frameworks for big data analytics, it is hard to fairly compare different frameworks in several critical aspects, including scalability, fault-tolerance and framework functionality. Additionally, a general model can provide guide to building software framework simulators that are greatly desirable when designing new frameworks or customizing existing frameworks for certain big data analytics applications. Moreover, since a bridging model between applications and various underlying software frameworks is not available, application migration from one software framework to another depends strongly on programmers’ special knowledge of both frameworks and is hard to do in an efficient way. Thus, it is desirable to have guidance for designing automatic tools used for application migration from one software framework to another.
All of above three issues demand a general model that bridges applications and various underlying software frameworks for big data analytics.
4. we propose a candidate for the general model, called DOT, which characterizes the basic behavior of big data analytics and identifies its critical issues.The DOT model also serves as a powerful tool for analyzing, optimizing and deploying software for big data analytics. Three symbols “D”, “O”, and “T” are three matrix representations for distributed data sets, concurrent
data processing operations, and data transformations, respectively. Specifically, in the DOT model, the dataflow of a big data analytics job is represented by a DOT expression containing multiple root building blocks, called elementary DOT blocks, or their extensions, called composite DOT blocks. For every elementary DOT block, a matrix representation is used to abstract basic behavior of computing and communications for a big data analytics job. The DOT model eliminates the data dependency among concurrent tasks executed by concurrent data processing units (called “workers” in the rest of the paper), which is a critical requirement for the purpose of achieving scalability and fault-tolerance of a large distributed system.
5. THE DOT MODEL: The DOT model consists of three major components to describe a big data analytics job:
(1) a root building block, called an elementary DOT block: A big data (multi-)set; A set of workers; Mechanisms that regulate the processing paradigm of workers to interact the big data (multi-)set in two steps.
(2) an extended building block, called a composite DOT block, that is organized by a group of independent elementary DOT blocks
(3) a method that is used for building the dataflow of a big data analytics job with elementary/composite DOT blocks.
An elementary DOT block is illustrated by Figure 1 with a three-layer structure. The bottom layer (D-layer) represents the big data (multi-)set. A big data (multi-)set is divided into n parts (from D1 to Dn) in a distributed system, where each part is a sub-dataset (called a chunk in the rest of the paper). In the middle layer (O-layer), n workers directly process the data (multi-)set and oi is the data-processing operator associated with the ith worker. Each worker only processes a chunk (as shown by the arrow from Di to oi) and stores intermediate results. At the top layer (T-layer), a single worker with operator t collects all intermediate results (as shown by the arrows from oi to t, i = 1, . . . , n), then performs the last-stage data transformations based on intermediate results, and finally outputs the ending result.
Based on the definitions of the composite DOT block, there are three restrictions on communications among workers:
(1) workers in the O-layer cannot communicate with each other;
(2) workers in the T-layer cannot communicate with each other; and
(3) intermediate data transfers from workers in the O-layer to their corresponding workers in the T-layer are the only communications occurring in a composite DOT block.
6. Big Data Analytics Jobs:is described by its dataflow, global information and halting conditions.
(1) Dataflow of a Job: is represented by a specific or non-specific number of elementary/composite DOT blocks.
(2) Global Information: need to access some lightweight global information, e.g. system configurations.
(3) Halting Conditions: determine when or under what conditions a job will stop.
7. Formal Definitions
7.1 The Elementary DOT Block :
In the above matrix representation, matrix multiplication follows the row-column pair rule of the conventional matrix product. The multiplication of corresponding elements of the two matrices is defined as:
(1) a multiplication between a data chunk Di and an operator f ( f can either be the operator in matrix O or the one in matrix T) means to apply the operator on the chunk, represented by f(Di);
(2) multiplication between two operators (e.g. f1 × f2) means to form a composition of operators (e.g., f = f2(f1)). In contrast to the original matrix summation, in the DOT model, the summation operator P is replaced by a group operator F. The operation nFi(fi(Di))= (f1(D1),・ ・ ・, fn(Dn)) means to compose a collection of data sets f1(D1) to fn(Dn). It is not required that all elements of the collection locate in a single place.
7.2 The Composite DOT Block:
Given m elementary DOT blocks ~DO1T1 to ~DOmTm, a composite DOT block ~DOT is formulated as:
7.3 An Algebra for Representing the Dataflow of Big Data Analytics Jobs
a big data analytics job can be represented by an expression, called a DOT expression:
For example, a job can be composed by three composite DOT blocks, ~D1O1T1, ~D2O2T2 and ~D3O3T3, where the results of ~D1O1T1 and ~D2O2T2 are input of ~D3O3T3. With the algebra defined in this section, the DOT expression of this job is:
A context-free grammar to derive a DOT expression is shown in Figure 5:
With the algebra used for representing the dataflow of a big data analytics job as a DOT expression, the job can be described by a DOT expression, global information and halting conditions.
8. Scalability and fault-tolerance