数据湖溯源追踪系统（Crossing Analytics Systems: A Case for Integrated Provenance in Data Lakes）

humanity11

已于 2023-08-26 09:49:18 修改

阅读量278

点赞数

分类专栏： IT搬运库文章标签：大数据数据湖

于 2023-08-26 09:47:44 首次发布

原文链接：https://legacy.cs.indiana.edu/ftp/techreports/TR728.pdf

版权

IT搬运库专栏收录该内容

7 篇文章 0 订阅

订阅专栏

说明：本文节选自https://legacy.cs.indiana.edu/ftp/techreports/TR728.pdf

Abstract

INTRODUCTION

DATA LAKE ARCHITECTURE

Role of Provenance

Challenges

Provenance Integration Across Systems

Reference Architecture

PROTOTYPE IMPLEMENTATION

Komadu

. Data Lake Use Case

Provenance Queries and Visualization

Performance Evaluation

CONCLUSION AND FUTURE WORK

Abstract

The volumes of data in Big Data, their variety and unstructured nature, have had researchers looking beyond the data warehouse. The data warehouse, among other features, requires mapping data to a schema upon ingest, an approach seen as inflexible for the massive variety of Big Data. The Data Lake is emerging as an alternate solution for storing data of widely divergent types and scales. Designed for high flexibility, the Data Lake follows a schema-on-read philosophy and data transformations are assumed to be performed within the Data Lake. During its lifecycle in a Data Lake, a data product may undergo numerous transformations performed by any number of Big Data processing engines leading to questions of traceability. In this paper we argue that provenance contributes to easier data management and traceability within a Data Lake infrastructure. We discuss the challenges in provenance integration in a Data Lake and propose a reference architecture to overcome the challenges. We evaluate our architecture through a prototype implementation built using our distributed provenance collection tools

论文的目的：数据仓库面临需要将数据mapping成结构化数据，缺乏灵活性，数据湖使用schema读时模式并具有很好的查询性能，但是如何追踪数据源很困难，因为数据被各种计算引擎加工处理过，这篇论文提出并评估了一种追踪数据源的架构。

INTRODUCTION

Industry, academia, and research alike are grappling with the opportunities that Big Data brings in mining data from numerous sources for insight, decision making, and predictive forecasts. These sources (e.g., clickstream, sensor data, IoT devices, social media, server logs) are frequently both external to an organization and internal. Data from sources such as social media and sensors is generated continuously. Depending on the source, data can be structured, semi-structured, or unstructured. The traditional solution, the data warehouse, is proving inflexible and limited [1] as a data management framework in support of multiple analytics platforms and data of numerous sources and types. The response to the limits of the data warehouse is the Data Lake [2], [1]. A feature of the Data Lake is its schema-on-read (as opposed to schema-on-write which happens at ingest time) where commitments to a particular schema are deferred to time of use. Schema-on-read suggests that data are ingested in a raw form, then converted to a particular schema as needed to carry out analysis. The Data Lake paradigm acknowledges that high throughput analytics platforms in use today are varied, so has to support multiple Big Data processing frameworks like Apache Hadoop1 , Apache Spark2 and Apache Storm3 . The vision of the Data Lake is one of data from numerous

1http://hadoop.apache.org/

2http://spark.apache.org/

3http://storm.apache.org/

sources being dropped into the lake quickly and easily, with tools around the lake as designated fishers of the lake, intent on catching insight by the rich ecosystem of data within the Data Lake. This greater flexibility of the Data Lake leads to rich collections of data from various sources. It, however, leaves greater manageability burdens in the hands of the lake administrators. The Data Lake can easily become a “dump everything” place due to absence of any enforced schema, sometimes referred to in literature as a “data swamp” [3]. The Data Lake could easily ignore the fact that data items in a Data Lake can exist in different stages of their life cycle. One data item may be in raw stage after recent generation where another may be the fined result of analysis by one or more of the analysis tools. The complications of data life cycle increases the need for proper traceability mechanisms. In this paper the critical focus of our attention is on metadata and lineage information through a data life cycle which are key to good data accessibility and traceability [4]. Data provenance is the information about the activities, entities and people who involved in producing a data product. Data provenance is often represented in one of two standard provenance representations (i.e., OPM [5] or PROV [6]). Data provenance can help with management of a Data Lake by making it clear where an object is in its lifecycle. This information can ease the burden of transformation needed by analysis tools to operate on a dataset. For instance, how does a researcher know what datasets are available in the lake for Apache Spark analysis, and which can be available through a small amount of transformation? This information can be derived by hand by the lake administrator and stored to a registry, but that approach runs counter to the ease with which new data can be added to the data lake. Another issue with the Data Lake is trust. Suppose a Data Lake is set up to organize data for the watershed basin of the Lower Mekong River in southeast asia. The contributors are going to be from numerous countries through which the Mekong River passes. How does the Data Lake ensure that the uses of the data in the lake are proper and adhere to the terms of contribution? How does a researcher, who uses the Data Lake for their research, prove that their research is done in a way that is consistent with the terms of contribution? Provenance contributes to these questions of use and trust.

If the Data Lake framework can ensure that every data product’s lineage and attribution are in place starting from the origin, critical traceability can be had. However, that is a challenging task because a data product in a Data Lake may go through different analytics systems such as Hadoop, Spark and Storm which do not produce provenance information by default. Even if there are provenance collection techniques for those systems, they may use their own ways of storing provenance or use different standards. Therefore generating integrated provenance traces across systems is difficult.

In this paper we propose a reference architecture for provenance in a Data Lake based on a central provenance subsystem that stores and processes provenance events pumped into it from all connected systems. The reference architecture, which appeared in early work as a poster [7], is deepened here. A prototype implementation of the architecture using our distributed provenance collection tools shows that the proposed technique can be introduced into a Data Lake to capture integrated provenance without introducing much overhead. The paper’s three main contributions are: identification of the data management and traceability problems in a Data Lake that are solvable using provenance highlighting the challenges in capturing integrated provenance. Second, a reference architecture to overcome those challenges. Third, an evaluation of the viability of the proposed architecture using a prototype implementation with techniques that can be used to reduce the overhead of provenance capture.

这里作者说当前数据现状：各式各样的数据源（媒体、设备、日志等），这个些数据源有结构化、半结构化、非结构化的数据，数仓处理各类数据缺乏灵活性，数据湖显然是一个不错的选择。但是，数据湖缺乏数据生命周期、数据追踪的管理，这会导致数据信任问题、没有约束的数据入湖最终可能变成了数据沼泽。数据被多个计算引擎处理后，使得数据追踪很困难，缺少集中处理数据追踪的工具。应对上述问题，本论文有3大贡献：

（1）梳理当前数据湖管理、追踪的问题和挑战

（2）提出来一个应对当前挑战的架构

（3）对提出的架构做了评估。

DATA LAKE ARCHITECTURE

The general architecture of a Data Lake, shown in Figure 1, contains three main activities: (1) Data ingest, (2) Data processing or transformation and (3) Data analysis. A Data Lake may open up number of ingest APIs to bring data from different sources into the lake. In most cases, raw data is ingested into the Data Lake for later use by researchers for multiple purposes at different points of time. Activity in the Data Lake can be viewed as data transformation: where data in the Data Lake is input to some task, and output is stored back to the Data Lake. Modern large scale distributed Big Data processing frameworks like Hadoop, Spark and Storm are the source of such transformations, especially for Data Lakes implemented on HDFS. Mechanisms like scientific workflow systems such as Kepler [8] and legacy scripts may apply as well. As shown in Figure 1, a data product created as an output from one transformation can be an input into another transformation which itself may produce another one as a result. Finally when all processing steps are done, resulting data products are used for different kinds of analysis reports and predictions.

通用的数据湖架构主要包含数据摄入、处理、输出，所有这些操作都可以通过当前的计算引擎在数据湖上实现。

Role of Provenance The Data Lake achieves increased flexibility at the cost of reduced manageability. In the research data environment, when differently structured data is ingested by different organizations through one of multiple APIs, tracking becomes an issue. Chained transformations that continuously derive new data from existing data in the lake further complicate the management. How can minimal management be added to the lake without invalidating the attractive benefit of ease of ingest? We posit that this minimal management is in the form of mechanisms to track origins of the data products, rights of use and suitability of transformations applied to them, and quality of data generated by the transformations. Carefully captured provenance can satisfy these needs allowing, for instance, answers to the following two questions: 1.) Suppose sensitive data are deposited into a Data Lake; social science survey data for instance. Can data provenance prevent improper leakage into derived data? 2.) Repeating a Big Data transformation in a Data Lake is expensive due to high resource and time consumption. Can live streaming provenance from experiments identify problems early in their execution?

Role of Provenance

The Data Lake achieves increased flexibility at the cost of reduced manageability. In the research data environment, when differently structured data is ingested by different organizations through one of multiple APIs, tracking becomes an issue. Chained transformations that continuously derive new data from existing data in the lake further complicate the management. How can minimal management be added to the lake without invalidating the attractive benefit of ease of ingest? We posit that this minimal management is in the form of mechanisms to track origins of the data products, rights of use and suitability of transformations applied to them, and quality of data generated by the transformations. Carefully captured provenance can satisfy these needs allowing, for instance, answers to the following two questions: 1.) Suppose sensitive data are deposited into a Data Lake; social science survey data for instance. Can data provenance prevent improper leakage into derived data? 2.) Repeating a Big Data transformation in a Data Lake is expensive due to high resource and time consumption. Can live streaming provenance from experiments identify problems early in their execution?

数据湖增加了灵活性，但却未考虑数据管理维护成本，使用数据溯源架构是否会阻碍数据入湖，能否发现执行过程中数据重复计算问题。

Challenges

in Provenance Capture Data in a Data Lake may go through number of transformations performed using different frameworks selected according to the type of data and application. For example, in a HDFS based Data Lake, it is common to use Storm or Spark Streaming for streaming data and Hadoop MapReduce or Spark for batch data. Other legacy systems and scripts may be included as well. To achieve traceability across transformations, provenance captured from these systems must be integrated, a challenge since many do not support provenance by default. Techniques exist to collect provenance from Big Data processing frameworks like Hadoop and Spark [9], [10], [11], [12]. But most are coupled to a particular framework. If the provenance collection within a Data Lake depends on such system specific methods, provenance from all subsystems should be stitched together to create a deeper provenance trace. There are stitching techniques [13], [14] which bring all provenance traces into a common model and then integrate them together. However the process of converting provenance traces from different standards into a common model may lose provenance information depending on the data model followed by each standard. As a Data Lake deals with Big Data, most transformations generate large provenance graphs. Converting such large provenance graphs into a common model and stitching them together can introduce considerable compute overheads as well.

挑战：各个计算引擎尽管有这样的数据追踪方法，但耦合性强。大量的数据追踪信息生成图也存在计算问题。

Provenance Integration Across Systems

To address provenance integration, we propose a central provenance collection system to which all components within the Data Lake stream provenance events. Well accepted provenance standards like W3C PROV [6] and OPM [5] represent provenance as a directed acyclic graph (G = (V, E)). A node (v ∈ V) can be an activity, entity or agent while an edge (e = hvi , vj i where e ∈ E and vi , vj ∈ V) represents a relationship between two nodes. In our provenance collection model, a provenance event always represents an edge in the provenance graph. For example, if process p generates the data product d, the provenance event adds a new edge (e = hp, di where p, d ∈ V) into the provenance graph to represent the ‘generation’ relationship between activity p and entity d. In addition to capturing usage and generation, additional details like configuration parameters and environment information (e.g., CPU speed, memory capacity, network bandwidth) can be stored as attributes connected to the transformation. Inside each transformation, there can be number of intermediate tasks which may themselves generate intermediate data products. A MapReduce job for instance has multiple map and reduce tasks. Capturing provenance from such internal tasks at a high enough level to be useful helps in debugging and reproducing transformations. When the output data from one analysis tool is used as the input to another, integration of provenance collected from both transformations can be guaranteed only by a consistent lakeunique persistent ID policy [6]. This may require a global policy enforced for all contributing organizations to a Data Lake. This unique ID notion could be based on file URLs and randomly generated data identifiers which are appended to data records when producing outputs so that the following transformations can use the same identifiers. It could also be achieved using globally persistent IDs such as the Handle system or DOIs. As a simple example, consider Figure 2a. The data product d1 is subject to transformation T1 and generates d2 and d3 as results. T2 uses d3 together with a new data product d4 and generates d5, d6 and d7. Finally T3 uses d6 and d7 and generates d8 as the final output. When all three transformations T1, T2 and T3 have sent provenance events, a complete provenance graph is created in the central provenance collection system. Figure 2b shows the high level data lineage graph which represents the data flow starting from d8.

为了应对这些挑战，于是提出了一个系统，所有组件都可以将事件信息发送至该系统，这些信息可以是cpu、内存、网络，这些系统通过形成点和边，最后构建成一个图（如上图所示）

Reference Architecture

The reference architecture, shown in Figure 3, uses a central provenance collection subsystem. Provenance events captured from components in the Data Lake are streamed into the provenance subsystem where they are processed, stored and analysed. The Provenance Stream Processing and Storage component at the heart of this architecture accepts the stream of provenance notifications (Ingest API) and supports queries (Query API). A live stream processing subsystem supports live queries while storage subsystem persists provenance for long term usage. When long running experiments in the Data Lake produce large volumes of provenance data, stream processing techniques become extremely useful as storing full provenance is not feasible. The Messaging System guarantees reliable provenance event delivery into the central provenance processing layer. Usage subsystem shows how provenance collected around the Data Lake can be used for different purposes. Both live and post-execution queries over collected provenance with Monitoring and Visualization helps in scenarios like the two use cases that we discussed above. There are other advantages as well such as Debugging and Reproducing experiments in the Data Lake. In order to capture information about the origins of data, provenance must be captured at the Ingest. Some data products may carry their previous provenance information which should be integrated as well. Researchers may export data products from the Data Lake in some situations. Such data products should be coupled with their provenance for better usage.

PROTOTYPE IMPLEMENTATION

We set up a prototype Data Lake and implemented a use case on top of it to evaluate the feasibility of our reference architecture. We used our provenance collection tools to capture, store, query and visualize provenance in our Data Lake. The reference architecture introduces both stored provenance processing and real time provenance processing for Data Lakes. In this prototype, we implement stored provenance processing; real time provenance processing is future work. The central provenance subsystem uses our Komadu [15] provenance collection framework。

这样的系统在KOMADU中得到了实现（如下图所示）

Komadu

Komadu is a W3C PROV based provenance collection framework which accepts provenance from distributed components through RabbitMQ4 messaging and web services channels. It does not depend on any global knowledge about the system in a distributed setting. This makes it a good match for a Data Lake environment where different systems are used to perform different data transformations. Komadu API can be used to capture provenance events from individual components of the Data Lake. Each ingest operation adds a new relationship (R) between two nodes (a node can be an activity(A), entity(E) or agent(G)) of the provenance graph being generated. For example, when an activity A generates an entity E, the addActivityEntityRelationship(A, E, R) operation can be used to add a wasGeneratedBy relationship between A and E. Using the query operations, full provenance graphs including all connected edges can be generated for Entities, Activities and Agents by passing the relevant identifier. Backward only and forward only provenance graphs can be generated for Entities. In addition to that, Komadu API consists of operations to access the attributes of all types of nodes. Komadu Cytoscape5 plugin can be used to visualize and navigate through provenance graphs.

komadu是基于rabbitmq4和web 服务的系统，从数据湖组件中获取事件，构建实体图。

. Data Lake Use Case

The Data Lake prototype was implemented using an HDFS cluster and the transformations were performed using Hadoop and Spark. Analysing data from social media to identify trends is commonly seen in Data Lakes. As shown in Figure 4, we have implemented a chain of transformations based on Twitter data to first count hash tags and then to get aggregated counts based on categories. Apache Flume6 was used to collect Twitter data and store in HDFS through the Twitter public streaming API. For each tweet, Flume captures the Twitter handle of the author, time, language and the full message and writes a record into an HDFS file. After collecting Twitter data over a period of five days, a Hadoop job was used to count hash tags in the full Twitter dataset. A new HDFS file with hash tag counts is generated as the result of the first Hadoop job which is used by a separate Spark job to get aggregated counts according to categories (sports, movies, politics etc). We just used a fixed set of categories for this prototype implementation to make it simple. In real Data Lakes, these transformations can be performed by different scientists at different times. They may use frameworks based on their preference and expertise. That is why we used two different frameworks for the transformations in our prototype to show how provenance can be integrated across systems. Komadu and its tool kit was used to build the provenance subsystem (shown in Figure 3) in our prototype. Komadu supports RabbitMQ messaging system and includes tools to fetch provenance notifications from RabbitMQ queues. A RabbitMQ instance was deployed in front of our Komadu instance so that all provenance notifications generated by Flume, Hadoop and Spark goes through a message queue in RabbitMQ. Ingested provenance events are asynchronously processed by Komadu and stored in relational tables. Stored provenance remains as a collection of edges until a graph generation request comes in. This delayed graph generation leads to efficient provenance ingest with minimum back pressure. This helps in a Data Lake environment where high volumes of provenance are generated. To assign consistent identifiers for data items in our Data Lake, we followed the practice of appending identifies to data records when output data is written to the Data Lake. Subsequent transformation uses the same identifiers for provenance collection. Provenance events were captured in our prototype by instrumenting the application code that we implemented for each transformation. Tweet capturing code in Flume was instrumented to capture provenance at the data ingest into the Data Lake. Map and Reduce functions in the Hadoop job and MapToPair and ReduceByKey functions in the Spark job were instrumented to capture provenance from transformations. We implemented a client library with a simple API (like Log4J API for logging) which can be used to easily instrument Java applications for provenance capture. It minimizes the provenance capturing overhead by using a dedicated thread pool to asynchronously send provenance events into the provenance subsystem. In addition to that, the client library uses an event batching mechanism to minimize the network overhead by reducing the number of messages sent into the provenance subsystem over the network.

在数据湖上的使用案例，如图所示，数据被flume 这样的组件收集来在tweet 上的信息，被计算引擎haddop，spark处理后，将数据统一发送至komadu处理。komadu暴露了一些api接口可以供这些引擎调用

Provenance Queries and Visualization

After executing the provenance enabled Hadoop and Spark jobs on collected Twitter data, Komadu query API was used to generate provenance graphs. Komadu generates PROVXML provenance graphs and it comes with a Cytoscape plugin which can be used to visualize and explore them. Fine-grained provenance includes input and output datasets for each transformation, intermediate function executions and all intermediate data products generated during the execution. Provenance from Flume, Hadoop and Spark have been integrated together through the usage of unique data identifiers. Figure 5 shows forward and backward provenance graphs generated for a very small subset of tweets. Forward provenance is useful to derive details about the usages of a particular data item. Figure 5a shows a forward provenance graph for a single tweet. It shows the hash tags generated by that particular tweet in Hadoop outputs and the categories to which those hash tags contributed in Spark outputs. A backward provenance graph starting from a category under Spark outputs is shown in Figure 5b. This graph can be used to find all tweets which contributed for that category. For example, if a scientist wanted to get an age distribution of the authors who tweeted about sports, it can be done by finding the set of Twitter handles of the authors through backward provenance

spark、hadoop处理后的数据后就可以产生数据流图，通过query api 可以查询。

Performance Evaluation

To build our prototype, we used five small VM instances with 2 CPU cores of 2.5GHz speed, 4 GB of RAM and 50 GB local storage on each instance. Four instances were used for the HDFS cluster including one master node and three slave nodes. Total of 3.23 GB Twitter data was collected over a period of five days by running Flume on the master node. Hadoop and Spark clusters were set up on top of our four node HDFS cluster. One separate instance was allocated to set up the provenance subsystem using RabbitMQ and Komadu tools. In order to minimize the provenance capture overhead, we used a dedicated thread pool and a provenance event batching mechanism in our client library. When the batch size is set to a relatively large number (>500), execution time becomes almost independent of the thread pool size as the number of messages sent through the network reduces. Therefore, we set the client thread pool size to 5 in each of our experiments. Figure 6a shows how the provenance enabled Hadoop execution time for a particular job varies when the batch size is increased from 100 to 30000 (provenance events). As per this result, we set the batch size to 5000 in each of our experiments. We used JSON format to encode provenance events and the average event size is around 120 bytes. The average size of a batched message sent over the network is around 600 KB (5000 x 120 bytes).

Figure 6b shows the execution times of the Hadoop job for different scenarios. Column ‘original’ represents the Hadoop execution time without capturing any provenance. In order to relate Map and Reduce provenance, we had to use a customized value field (in key-value pair) which contains data identifiers like in Ramp [10]. As shown by ‘custom val’ column in the chart, usage of customized value introduces an overhead of 19.28% and that is included in all other cases. Execution overhead depends on the granularity of provenance as well. Columns ‘data prov komadu’ and ‘full prov komadu’ shows the execution times of Hadoop when our technique is used to capture provenance. Data provenance (data relationships only) case adds a 36.47% overhead while full (data and process relationships) provenance case adds a 56.93% overhead. Table I shows a breakdown of provenance sizes generated for each case in Hadoop for the input size of 3.23 GB. Size of provenance doubles for full provenance case compared to data provenance and that leads to greater capturing overheads. As it is a common practice [10] to write provenance into HDFS in Hadoop jobs, we modified the same Hadoop job to store provenance events in HDFS as well and compared the overhead with our method. As shown by ‘data prov HDFS’ and ‘full prov HDFS’ columns in Figure 6b, that adds larger overheads compared to our techniques. Better performance have been achieved by modifying or extending Hadoop [11]. But our techniques operate completely on application level without modifying existing frameworks. Figure 6c shows the execution times for the Spark job for different scenarios. Like in Hadoop, we used a customized output value to include data identifiers in Spark as well. That adds an overhead of 7.5% compared to original execution

time as shown by ‘custom val’ column. Data provenance and full provenance cases using Komadu add overheads of 76.1% and 108.35% respectively. Overhead percentages added by provenance capture in Spark is larger compared to Hadoop as Spark works faster than Hadoop and our techniques introduce same level of overhead in both cases.

这里作者通过一些实验和数据说明这个系统在性能方面的表现，不敢兴趣的读者可以直接跳过。

Apache Falcon7 manages the data lifecycle in Hadoop Big Data stack. Falcon supports creating lineage enabled data processing pipelines by connecting Hadoop based processing systems. Apache Nifi8 is another data flow tool which captures lineage while moving data among systems. Neither tool captures detailed provenance within transformation steps (like in Figure 5). Few recent studies target provenance in individual Big Data processing frameworks like Hadoop and Spark. Wang J. et al. [9], [16] present a way of capturing provenance in MapReduce workflows by integrating Hadoop into Kepler. Ramp [10] and HadoopProv [11] are two attempts to capture provenance by extending Hadoop. Provenance in Apache Spark [12] and provenance in streaming data [17] have also been studied. Capturing provenance in traditional scientific workflows [18], [19] is another area which has been studied in depth. None of these studies focus on integrating provenance from different frameworks in a shared environment. While any of these Big Data processing frameworks can be connected to a Data Lake, as we argued above, a Data Lake can not depend on such framework specific provenance collection mechanisms due to provenance integration challenges. Therefore, provenance stitching [13] techniques are hard to apply. Wang, J. et al. [20] identify the challenges in Big Data provenance which are mostly applicable in a Data Lake environment. Distributed Big Data provenance integration has been identified as a challenge in their work where we present a solution in the context of a Data Lake.

apache falcon7使用了大数据栈技术可以管理数据生命周期、血缘。apache nifi8也可以捕捉数据流使用spark、hadoop框架。但它们都没有像当前系统一样做数据源集成，正如前面所分析的，这一类系统因为要与这些框架深度耦合使得在应用上很难流行开来和使用。

CONCLUSION AND FUTURE WORK

The reference architecture for integrated provenance demonstrates early value of data provenance as a lightweight approach to traceability. Future work addresses the viability of the approach in obtaining necessary information without excessive instrumentation or manual intervention. Scalability of the technique is to be further assessed within a real Data Lake environment. Persistent ID solutions have tradeoffs; the suitability of one over another in the Data Lake setting is an open question. We have implemented only the stored provenance processing techniques in the presented prototype

未来方向：减少人工的接入和昂贵的数据获取能力，以及扩展能力上。