本次报告中,Gartner引用了其资深分析师Nick Heudecker发表的一篇名为《数据湖设计的最佳实践》(Best Practices for Designing Your Data Lake)的文章。报告指出,一个成功的数据湖系架构需要数据的管理者合理地区分出数据的来源、挖掘、优化、监管和使用体系等,并逐一分析了如何更好地设计这些模块,从而最大化地使用数据。








A Big Data PlatformBased on Data Lake Architecture



In the era of big data, data collision, compared with traditional data analysis, inspires greater value. The premises for data collision are the convergence, sharing and opening of data among different business lines in enterprises, various organizations and industries, which stand as challenges for the applications of big data technologies. Bingo Data Lake, BY means of a deep integration of cloud computing and big data technology, provides distributed big data application platforms facing multiple organizational users and helps the users to build a sustainable data ecological chain. Data exchange and collision based on platforms among users can beachieved, which further excavates data values, promotes data application innovation, and improves the data application capability of the organizations.



With the big data technology curve currently reaching an inflection point into the stage of practical application, the maturation and stability of cloud computing technologies guarantee to keep big data technologies On Track. Big data technologies have a wide coverage from basic structures of software and hardware to layers of specific applications, which contributes to a richer technical ecology and might bring about brand-new or even disruptive cognition. Therefore, there is huge space for value exploration of the big data technologies and their applications, with uncertainty andgreater possibility coexisting, bringing exciting both challenges and opportunities.

从技术方面,大数据技术生态繁荣,发展日新月异, Hadoop、Spark,MPP、NoSQL、kafka、机器学习、深度学习不断发展,不同技术解决不同问题,企业的大数据平台必定是混合式的架构,如何有效融合异构的技术成为企业构建大数据平台必须面临的问题。

In technical aspects, the big data technology ecology is booming. Technologies like Hadoop, Spark, MPP, NoSQL, Kafka, Machine Learning, and Deep Learning, each of which copes with distinct issues, are ever-developing. The big data platform of an enterprise has to be based on ahybrid architecture. How to achieve an effective heterogeneous technology convergence has become a prompt issue facing enterprises establishing their big data platforms.


In terms of data, demands on data fusion across departments, enterprises and industries have been gradually evident, while the association and collision of data are the basis on which data innovation isignited. Thus another critical issue facing enterprises has been how to effectively break the data silos, solve the issues of data sovereignty, and achieve a unified data convergence and sharing.


Consequently, to address the above technical and data integration problems, fast-moving enterprises like Amazon and Microsoft, following the market trends, introduced their data lake solutions based onpublic clouds in 2016. On the other hand, due to realistic requirements of internal data integration and external data exchange, many enterprises and organizations have begun comparing and learning from public cloud data lake solutions in their planning of the construction of private data lake platforms.


Bingo has been dedicated to the markets of enterprises, and perceived the challenges facing the big data technologies during their landing in enterprises. After two years’ development, Bingo introduced its data lake overall solutions based on private clouds in early 2017, aiming to help enterprises and organizations to build their private big data platforms and make possible the big data application and value innovation in the organizational level.

品高数据湖方案Bingo Data Lake Solutions


Relying on BingoCloudOS, Bingo Data Lake helps enterprises establish their data lakes on the basis of S3 Object-based Storage, providing universal data supporting environments for the exchange of data resources and the innovation of data applications among different departments, branches, organizations, and industries. Specifically, Bingo Data Lake offers one-stop services covering the storage, integration, processing, management, and consumption of data, and can serve the whole life cycle of the data.


Bingo Data Lake solutions are comprised of 5 parts: datalake storage, data integration, data processing, data management, and data consumption. Meanwhile, to keep data lake efforts on track, four clauses need to be stressed, which are data acquisition, insight discovery and development, data governance and analytics consumption. Bingo’s data lake solutions, as it were, share the same thoughts with Gartner.

数据湖存储Data Lake Storage


Data lake storage is based on BingoCloudOS object-based storage technology, and is able to realize the storage of all data types (structured data, texts, images, audio and video files, etc.). It has characteristics including:


  • 高可用:可以实现99.999999999%的高可用性,支持大规模节点部署,单集群可以支持1024台服务器,单云16000台服务器,可以支撑海量数据存储、汇聚、共享

    High availability: Availability as high as 99.999999999%; supporting large-scale node deployment with a single cluster supporting 1024 servers and a single cloud supporting 16000 servers, thus achieving massive data storage, convergence and sharing

  • 良好的兼容性:兼容AWS S3协议,可与Hadoop、Spark、Greenplum等主流大数据计算技术无缝集成,快速支撑数据的开发、处理,高安全性

    Good compatibility: Compatible with AWS S3, being able to achieve seamless integration with mainstream big data computing technologies like Hadoop, Spark, and Greenplum, supporting data development and processing, and highly secure

  • 安全性:可以实现多个租户的数据隔离和共享,基于存储桶隔离多个租户的数据,并通过权限策略授权实现数据共享,支持服务端加密,实现敏感性数据的自动加密

    Security: Capable of data isolation and sharing among multiple tenants. Data of tenants are isolated with buckets, and further realize data sharing via access policy authorization. Server side encryption is supported with automatic encryptionfor sensitive data

  • 高性能:支持大文件切片、多节点并发传输,提升数据传输效率

    High efficiency: Large file slicing and multi-node concurrent transmission supported, improving data transmission efficiency

  • 支持跨数据中心的自动复制、同步,不受数据中心的限制,支持跨数据中心的全局命名空间管理,可构建联邦数据湖

    Automatic duplication and synchronization across data centers supported without limitations from the data centers; global name space management across data centers supported; federated data lake buildable


数据集成Data Integration


Data integration refers to the process of the extraction, conversion and loading of data, in which data are automatically extracted from source systems, converted into consistent formats, and loaded to the data lake. Bingo Data Lake provides data lake integration tools and can ensure that heterogeneous data sources could pour into the data lake fast and alive.


  • 易用:无需编码,通过可视化配置即可将数据发布至数据湖

    Ease of use: Noneed for coding, data being able to be transmitted to the data lake with visual configurations



  • 异构数据源支持:支持与各种关系型数据库、Hadoop、NoSQL数据库、MPP等主流大数据技术无逢对接,自动获取数据至数据湖

    Heterogeneous data sources supported: Seamless integration with mainstream big data computing technologies like Hadoop, NoSQL, and MPP supported, with data automatically acquired into the data lake



  • 任务调度:采用分布式的集成任务调度,并支持分钟、小时、日、周、月灯多种时间调度周期,提升数据湖的数据集成效率

    Task scheduling: Distributed task schedule adopted, supporting time scheduling cycles of minutes, hours, days, weeks, and months, thus improving the data integration efficiency of the data lake



  • 多种控制策略:支持集成作业重试、作业依赖、人工重跑等多种作业控制策略,保障数据集成作业的SLA

    Multiple control policies: Job control policies such as job retry, job dependence, and manualre-run supported, ensuring SLA of data integration jobs


数据探索和开发Data Discovery and Development


When data of a data lake are collected after the data integration, Bingo offers a built-in Hadoop package that can help users rapidly explore, analyze and process the data in the data lake.


  • 内置Hadoop套件运行在品高云LXC(Linux container)上,性能损耗接近物理机,实现Hadoop集群的云托管,一方面,使得大数据处理集群的运维能够交给云平台管理,另外一方面,使得大数据技术能够与云计算技术进行深度的融合

    The built-in Hadoop package runs on BingoCloudOS LXC (Linux container) with a performance cost close to a physical machine. It can perform cloud hosting of Hadoop clusters. The operation and maintenance of big data processing clusters can be managed on the cloud platform, and, at the same time, big data technologies can have a deep integration with cloud computing technologies



  • 支持多租户使用统一Hadoop集群,多个部门、多个应用通过资源分配、资源隔离共享计算资源有效提升资源利用率

    Multiple tenants using unified Hadoop clusters supported. Departments and applications can share the computing resources through resource allocation and isolation, thus effectively raising the level of resources utilization



  • 支持Hadoop外部表直连数据湖的数据,可实现与本地数据碰撞关联计算,计算完后的数据可存储回数据湖

    Direct connections between Hadoop external tables and the data in the data lake supported. It can calculate the collision and association with local data with the result data stored back into the data lake



  • 多种计算方法支持,除品高内置Hadoop外,其它Hadoop、CDH、Greenplum均可连接和使用数据湖的数据

    Multiple computing methods supported. In addition to Bingo’s built-in Hadoop, other Hadoop, CDH and Greenplum can also access and use the data in the data lake


数据管理Data Management


Without effective governance and optimization, a data lake is bound to be TURNed into a data swamp. Data management, therefore, is acritical part of the construction of a data lake. By means of meta data management, data catalog, data statistics & monitoring, and data quality,Bingo guarantees the data in its data lake readable, retrievable, manageable and available.


  • 支持通过元数据描述、注册数据湖数据样的元数据,包括数据资源名称、数据资源业务描述、数据资源字段信息、关联数据资源等信息,保障数据的可读性,并且能够自动从数据所属的数据源捕获相关元数据,减少元数据的维护工作

    Metadata of the data samples in the data lake can be described and registered through metadata, including the names, business descriptions, field information, and associationof the data resources, thus ensuring data’s readability. Also, metadata can automatically be captured from relevant data resources, resulting in less maintenance work



  • 数据湖的数据资源支持按主题、组织、专题等维度编目数据,保障数据的可检索性

    Data resources of the data lake can be catalogued according to subjects, organizations and features, ensuring data’s findability



  • 可通过数据及时性、数据完整性、数据一致性、数据准确性等多个维度监控和分析数据湖的数据质量,并能够实现数据质量监控、分析、检查、报告的闭环管理,此外,还支持数据消费者对数据资源的质量进行评价评论,持续提升数据湖的数据质量

    Data quality canbe monitored and analyzed in terms of data’s timeliness, integrity, consistency, and accuracy, and it’s possible to perform a closed-loop management of the monitoring, analysis, inspection and report of the data quality. Moreover, data consumers can also evaluate and comment on the quality of the data resources, which will continuously improve the data quality of the data lake



  • 能够实现从数据集成、数据存储、数据处理、数据消费的全过程性能指标的监控分析,实时监控分析各个环节的处理情况,帮助管理人员第一时间掌握数据湖的整体运行状况,对于数据湖的运营、可持续发展具有指导意义

    Monitoring and analysis of the performance indexes can be achieved throughout the process ofthe integration, storage, processing and consumption of data. It will monitorand analyze in real time the handling of each link, which can help the managers to grasp the overall running conditions of the data lake in the first place andhas guiding significance for the operation and sustainable development of a data lake


数据分析与消费Data Analysis and Consumption


Massive data can be collected into the data lake and then developed and processed. Processed available data can then be stored back into the data lake, providing data support for various big data analysis applications.


Bingo Data Lake solutions provide platforms for big data analytics, and enable users to conduct data consumption and explore the potential and value of data by means of self-analysis and data visualization. Built-in analysis components in the platforms include dashboards, data source management, data reports, and data processing and demonstration combined with geographic positions. Meanwhile, third-party data analysis tools and tools developed by users are also supported.


  • 提供内置的自助查询工具,可直接通过图形化界面建立数据分析,用户可通过配置数据模型、过滤条件、结果字段等查询条件,即可获得相应的数据分析结果报表

    Built-in query tools can help to perform data analysis with graphic interfaces. Users can set query conditions such as data model, filter condition and result field, andacquire relevant result reports of the data analysis



  • 提供多样化的数据分析呈现图表,如地图工具、数据报表、数据脑图、数据报告等,依据数据可视化的科学方法以合理的方式为用户呈现分析结果,极大提升分析结论的可读性

    Diverse data analysis charts are provided, such as maps, data reports, data mind maps, etc. Analysis results are presented in the scientific and reasonable way of data visualization, contributing to much greater readability

  • 支持数据分析过程的协作共享,从源数据到得出分析结果的过程中,可分别由不同的用户分工协作,其中可能包含数据管理员、分析人员、一线业务人员等等,让各类用户均能够参与到数据分析的过程中来,并以社交化的方式分享数据分析报告

    Collaboration and sharing is allowed during data analysis. In the process of getting a result from source data, users can coordinate and distribute responsibilities. Persons involved might include data managers, analysts, first-line business personnel, etc., which allows participation of various users in the process of data analysis and enables the sharing of data analysis reports in a socialized manner


应用场景Application Scenarios


In accordance with the characteristics and innovations of Bingo data solutions, 3 scenarios suitable for data lake solutions are listed as follows.

场景1:跨组织边界的数据共享Scenario 1: Data SharingAcross Organizational Boundaries


As big data further develops, enterprises and governments have successively established their big data platforms, which contributes tothe improvement of the enterprises’ production efficiency and sales patternsand the governments’ governance. The applications of data are no more confinedto one’s own data, and the convergence analysis following data sharing among multiple parties can realize greater data innovation and improve the governance of enterprises or government organizations.


Problems with the Traditional Solutions


  • 难实现异构技术融合

    Difficulties in Achieving Heterogeneous Technology Convergence



Complicated and diverse data generated from organizations result in huge difficulty of data convergence. Hadoop technology is able to settle the data storage and processing of a single department, while unable to address issues over data integration and sharing rights across organizations. Big data technical routes across organizational boundaries are varied, which causes huge difficulty in technology integration.


  • 数据共享模式存在不足

    Defects of Data Sharing Modes


跨组织边界的数据共享开放常见模式有数据查询接口、FTP 文件交换、大数据交易所等。

Common modes of data sharing across organization boundaries include data query interface, FTP file exchange, big data exchange, etc.


  • FTP 文件交换存在安全性弱、交换性能差、数据主权难界定、需拷贝数据等问题

    FTP file exchange is weak in terms of security and exchange performance. Here, data sovereignty is hard to define, and data has to be replicated

  • 大数据交易所缺乏数据汇聚基础,难以满足大量数据的关联碰撞

    Big data exchange is in lack of a basis for data convergence, and is hard to fulfill the association and collision of massive data

  • 缺乏对运营体系的支持

    Lack of Support for Operation Systems



Big data platforms often pay more attention to technologies than their operation and quality, which results in its difficulty in sustainable development. It is essential to create a comprehensive data operating system by referring to data’s assessment, quality and index of opening, and protect the sustainable development of data sharing.


Coping Solutions


Aiming at problems and demands listed above, on the basis of data storage, by integrating cloud computing and big data technology, and by taking advantages of its innovative capabilities on the integration, development, management and consumption of data, Bingo Data Lake solutions settle the data sharing and opening across departments, organizations and industries, help organizations to create a healthy and sustainable data ecological chain, and further excavate data values through data association so as to promote data innovation.

场景2:促进基于数据的产学研的合作Scenario 2: Promoting Production-Study-ResearchCooperation Based on Data


Contradiction between Production Dataand Research


Government agencies and large scale enterprises possess massive production data but weak technical reserves and algorithm models, while universities and research institutions turn out to be the opposite.


Building A Bridge between Production and Research with A Data Lake


On account of the problems above, production data can be desensitized through the data lake, stored in it, and opened to research institutions and universities for research purposes. Meanwhile, research results can in turn be applied by enterprises, which may effectively promote the Production-Study-Research Cooperation based on data.

场景3:联邦数据湖Scenario 3: Federated Data Lake


Security and Trust Issues in Cross-organizational Data Collection


During the constructions of data lakes, we will frequently encounter cross-organizational constructions across enterprises or different government departments. If we manage all data with a unified datalake, data collection will become difficult, and issues like mutual trust, sovereignty and security of the data will occur.


Data Ecology Based on Federated Data Lakes


To address the situation, Bingo Data Lake solutions offer federated data lakes that are decentralized. The platform based on federated data lakes can realize data sharing across departments and organizations. Relevant catalogs, tools, services and models can be opened for all organizations and relevant software developers to collaborate, thus helping enterprises and governments to establish a healthy and sustainable data ecological chain.






