DOT--A Matrix Model for Analyzing,Optimizing and Deploying Software for Big Data Analytics in Distri

原创 2013年12月02日 15:54:50
1. Abstract
      Traditional parallel processing models, such as BSP, are “scale up” based, aiming to achieve high performance by increasing computing power, interconnection network bandwidth, and memory/storage capacity within dedicated systems, while big data analytics tasks aiming for high throughput demand that large distributed systems “scale out” by continuously adding computing and storage resources through networks. Each one of the “scale up” model and “scale out” model has a different set of performance requirements and system bottlenecks. 
      In this paper, we develop a general model that abstracts critical computation and communication behavior and computation-communication interactions for big data analytics in a scalable and fault-tolerant manner. Our model is called DOT, represented by three matrices for data sets (D), concurrent data processing operations (O), and data transformations (T), respectively. 
      With the DOT model, any big data analytics job execution in various software frameworks can be represented by a specific or non-specific number of elementary/composite DOT blocks, each of which performs operations on the data sets, stores intermediate results, makes necessary data transfers, and performs data transformations in the end. The DOT model achieves the goals of scalability and fault-tolerance by enforcing a data-dependency-free relationship among concurrent tasks. Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation independent, and applicable to a wide variety of big data analytics jobs. Finally, we demonstrate the effectiveness of the DOT model through several case studies.
2. two common traditional goals(include Google MapReduce, Hadoop, Dryad and Pregel
      (1) for distributed applications, to provide a scalable and fault-tolerant system infrastructure and supporting environment; and
      (2) for software developers and application practitioners, to provide an easy-to-use programming model that hides the technical details of parallelization and fault-tolerance.
3. the following three issues to be addressed demand more basic and fundamental research efforts
      -Behavior Abstraction: The “scale out” model of big data analytics mainly concerns two issues:
            (1) how to maintain the scalability, namely to ensure a proportional increase of data processing throughput as the size of the data and the number of computing nodes increase; and
            (2) how to provide a strong fault-tolerance mechanism in underlying distributed systems, namely to be able to quickly recover processing activities as some service nodes crash.
            However, the basis and principles that jobs can be executed with scalability and fault-tolerance is not well studied.
       -Application Optimization:
       Current practice on application optimization for big data analytics jobs is underlying software framework dependent, so that optimization opportunities are only applicable to a specific software framework or a specific system implementation. A bridging model between applications and underlying software frameworks would enable us to gain opportunities of software framework and implementation independent optimization,which can enhance performance and productivity without impairing scalability and fault tolerance. With this bridging model, system designers and application practitioners can focus on a set of general optimization rules regardless of the structures of software frameworks and underlying infrastructures.
      -System Comparison, Simulation and Migration:
       The diverse requirements of various big data analytics applications cause the needs of system comparison and application migration among existing and/or new designed software frameworks. However, without a general abstract model for the processing paradigm of various software frameworks for big data analytics, it is hard to fairly compare different frameworks in several critical aspects, including scalability, fault-tolerance and framework functionality. Additionally, a general model can provide guide to building software framework simulators that are greatly desirable when designing new frameworks or customizing existing frameworks for certain big data analytics applications. Moreover, since a bridging model between applications and various underlying software frameworks is not available, application migration from one software framework to another depends strongly on programmers’ special knowledge of both frameworks and is hard to do in an efficient way. Thus, it is desirable to have guidance for designing automatic tools used for application migration from one software framework to another.
All of above three issues demand a general model that bridges applications and various underlying software frameworks for big data analytics.
4. we propose a candidate for the general model, called DOT, which characterizes the basic behavior of big data analytics and identifies its critical issues.The DOT model also serves as a powerful tool for analyzing, optimizing and deploying software for big data analytics. Three symbols “D”, “O”, and “T” are three matrix representations for distributed data sets, concurrent
data processing operations, and data transformations, respectively. Specifically, in the DOT model, the dataflow of a big data analytics job is represented by a DOT expression containing multiple root building blocks, called elementary DOT blocks, or their extensions, called composite DOT blocks. For every elementary DOT block, a matrix representation is used to abstract basic behavior of computing and communications for a big data analytics job. The DOT model eliminates the data dependency among concurrent tasks executed by concurrent data processing units (called “workers” in the rest of the paper), which is a critical requirement for the purpose of achieving scalability and fault-tolerance of a large distributed system.
5. THE DOT MODEL:  The DOT model consists of three major components to describe a big data analytics job:
     (1) a root building block, called an elementary DOT block: A big data (multi-)set; A set of workers; Mechanisms that regulate the processing paradigm of workers to interact the big data (multi-)set in two steps. 
     (2) an extended building block, called a composite DOT block, that is organized by a group of independent elementary DOT blocks  
     (3) a method that is used for building the dataflow of a big data analytics job with elementary/composite DOT blocks.
      An elementary DOT block is illustrated by Figure 1 with a three-layer structure. The bottom layer (D-layer) represents the big data (multi-)set. A big data (multi-)set is divided into n parts (from D1 to Dn) in a distributed system, where each part is a sub-dataset (called a chunk in the rest of the paper). In the middle layer (O-layer), n workers directly process the data (multi-)set and oi is the data-processing operator associated with the ith worker. Each worker only processes a chunk (as shown by the arrow from Di to oiand stores intermediate results. At the top layer (T-layer), a single worker with operator t collects all intermediate results (as shown by the arrows from oi to t, i = 1, . . . , n), then performs the last-stage data transformations based on intermediate results, and finally outputs the ending result.
      Based on the definitions of the composite DOT block, there are three restrictions on communications among workers:
      (1) workers in the O-layer cannot communicate with each other;
      (2) workers in the T-layer cannot communicate with each other; and
      (3) intermediate data transfers from workers in the O-layer to their corresponding workers in the T-layer are the only communications occurring in a composite DOT block.
6. Big Data Analytics Jobs:is described by its dataflow, global information and halting conditions.
    (1) Dataflow of a Job: is represented by a specific or non-specific number of elementary/composite DOT blocks.
    (2) Global Information: need to access some lightweight global information, e.g. system configurations.
    (3) Halting Conditions: determine when or under what conditions a job will stop.
7. Formal Definitions
     7.1 The Elementary DOT Block :
     In the above matrix representation, matrix multiplication follows the row-column pair rule of the conventional matrix product. The multiplication of corresponding elements of the two matrices is defined as: 
    (1) a multiplication between a data chunk Di and an operator f ( f can either be the operator in matrix O or the one in matrix T) means to apply the operator on the chunk, represented by f(Di);
    (2) multiplication between two operators (e.g. f1 × f2) means to form a composition of operators (e.g., f = f2(f1)). In contrast to the original matrix summation, in the DOT model, the summation operator P is replaced by a group operator F. The operation nFi(fi(Di))= (f1(D1),・ ・ ・, fn(Dn)) means to compose a collection of data sets f1(D1) to fn(Dn). It is not required that all elements of the collection locate in a single place.
   7.2 The Composite DOT Block: 
     Given m elementary DOT blocks ~DO1T1 to ~DOmTm, a composite DOT block ~DOT is formulated as:
     7.3  An Algebra for Representing the Dataflow of Big Data Analytics Jobs
     a big data analytics job can be represented by an expression, called a DOT expression:
     For example, a job can be composed by three composite DOT blocks, ~D1O1T1, ~D2O2T2 and ~D3O3T3, where the results of ~D1O1Tand ~D2O2T2 are input of ~D3O3T3. With the algebra defined in this section, the DOT expression of this job is:
         A context-free grammar to derive a DOT expression is shown in Figure 5:
     With the algebra used for representing the dataflow of a big data analytics job as a DOT expression, the job can be described by a DOT expression, global information and halting conditions.
8. Scalability and fault-tolerance


BlueDBM个人读感     BlueDBM是麻省理工学院Sang-Woo Jun等人发表在2015年ISCA会议上的论文。该篇论文的启发点似乎来自于RAMCloud,该团队发现完全基于DRAM的系...
  • xiaorenzhi
  • xiaorenzhi
  • 2015年07月31日 13:56
  • 1663


  • chszs
  • chszs
  • 2016年04月19日 21:08
  • 3391

分清big data,ML,AI之间的关系

How are big data and machine learning related?(大数据与机器学习间关系)下面是回答: 1. Big data and machine learning...
  • he_world
  • he_world
  • 2016年06月01日 11:46
  • 1447


AppIntent:Analyzing Sensitive Data Transmission in Android for Privacy Leakage Detection APPIntent...
  • Grace_0642
  • Grace_0642
  • 2014年01月23日 02:16
  • 3427

《大数据时代(BIG DATA)》

—-豆瓣链接—- 大数据的时代思维变革 不是随机样本,而是全体数据 小数据时代的随机采样,最少的数据获得最多的信息 采样分析的精确性随着采样随机性的增加而大幅提高,但与样本数量的增加...
  • my_precious
  • my_precious
  • 2016年10月12日 09:56
  • 1204

我们分析了全美Top Business Analyst 和 Data Science专业,最后给你总结了这几点

  • zw0Pi8G5C1x
  • zw0Pi8G5C1x
  • 2018年01月06日 00:00
  • 2033

Oracle 大数据集成实施

Oracle 大数据实施架构Oracle为广大客户提供了一个预装的用于测试和学习目的的免费大数据环境。你可以在这个环境中对Oracle大数据一体机(Big Data Appliance)上的可选软件产...
  • caixingyun
  • caixingyun
  • 2016年08月07日 18:54
  • 1737

DataMatrix 编码生成和译码原理即方法

===================================================== 非常感谢博主pooran 转载自:
  • liu236141068
  • liu236141068
  • 2015年05月22日 12:40
  • 5958

二维码Data Matrix编码、解码使用举例

二维码Data Matrix编码、解码使用举例
  • fengbingchun
  • fengbingchun
  • 2016年12月26日 21:27
  • 2849

二维码Data Matrix简介及在VS2010中的编译

二维码Data Matrix简介及在VS2010中的编译!
  • fengbingchun
  • fengbingchun
  • 2015年03月15日 19:59
  • 8179
您举报文章:DOT--A Matrix Model for Analyzing,Optimizing and Deploying Software for Big Data Analytics in Distri