Data lakes, hubs and warehouses — when to use what

Often in any “technical” field (and I use that term very loosely), it can be quite hard to differentiate between the facts and the fiction — the latter normally created either by over-zealous product marketing or an over-increasing circle of folklore.

One particular area that has a lot of attention is the whole big data arena. I won’t even go into the “how big is big data” as that’s another very subjective discussion stream.

What I want to discuss and briefly develop in this post is an objective view of the relative positioning of data lakes, enterprise data hubs (EDH) and data warehouses, including their associated terminology and technology for all those budding data scientists and data architects out there.

Data science lens

Before we start, though, it’s always a good idea to get a clear point of reference to base the assertions against. Within the big data world, the framework I have chosen is looking through the lens of data science — data science being the end-to-end methods and techniques of gaining as much knowledge or insight from the data as possible. In other words, if we are going to assess these three types of data storage, then their usage is paramount.

The framework I have used is the one written by Donoho. In this model, there are six key categories of data science: data exploration and preparation, data representation and transformation, computing with data, data modeling, data visualization and presentation and, finally, the science of data science.

Starting right at the beginning and fundamental to data science is how the data is going to be stored.

Let’s focus on the first two categories. From a data exploitation and preparation perspective, it’s reported that at least 80% of the effort devoted to data science is spent understanding the basics of the data and making it ready for further exploration and use. From a data representation and transformation perspective, the challenge is managing a complex set of different formats and physical database types while managing the relevant transformations to make the data into a more revealing form.

From the above, we come to our three “options.”

Data lake

The first option is to use a “data lake.” Definitions are consistent here in that it’s a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured and unstructured data. The data structure and requirements are not defined until the data is needed. The Hadoop community has popularised it a lot, with the focus on moving from disparate silos to a single Hadoop/HDFS. Furthermore, the data does need not be harmonized, indexed, searchable or even easily usable, but at least you don’t have to connect to a live production system every time you want to access a record. Its other key feature is that it can be built on relatively inexpensive hardware.

Pentaho CTO James Dixon has generally been credited with coining the term “data lake.” He describes a data mart (a subset of a data warehouse) as akin to a bottle of water, “cleansed, packaged and structured for easy consumption,” while a data lake is more like a body of water in its natural state.

According to Gartner, “In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format,” said Nick Heudecker, research director at Gartner. “The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”

However, while the marketing hype suggests that audiences throughout an enterprise will leverage data lakes, this positioning assumes that all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.

Hence, step forward toward some alternative, complementary solutions.

Data warehouse

Previously, the most common solution would be the data warehouse or enterprise data warehouse. This is a system used for reporting and data analysis, and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources.

The characteristics of data warehouses are different from a data lake in the following key dimensions.

  • The data: A data warehouse will have a structured and processed data set. A data lake will include every source type including unstructured and raw
  • The processing: A data warehouse will use a schema on write and a data lake will use a schema on read
  • The storage: Tends to be expensive for a data warehouse, whereas a data lake is designed for low-cost storage
  • Agility – A data warehouse by its very nature will be a fixed configuration and less agile. A data lake is highly agile and will be configured and reconfigured as required
  • Security – A data warehouse has a mature model and a data lake is “maturing”
  • User perspective – A data warehouse is primarily designed for business professionals via the tools provided whereas a data lake tends to be the focus for data scientists

So if a data warehouse and data lake have opposite competing characteristics, step forward to a data hub or even an enterprise data hub (EDH).

Data hub

A data hub is a hub-and-spoke approach to data integration, where data is physically moved and re-indexed into a new system. A data lake will run the same process but will always keep the source format. Data is ingested in as close to the raw form as possible without enforcing any restrictive schema. To be a data hub (vs. a data lake) this system would support discovery, indexing and analytics. Data lakes do not index and cannot harmonise because of the incompatible forms that will be held. The prime objective of an EDH is to provide a centralised and unified data source for diverse business needs.

Not surprisingly, the major vendors have latched on to this concept. Cloudera, for example, have published the following information. A simplistic summary of this offering is Cloudera’s relationship with EMC recognising a large deployment of, for example, Isilon data lakes, which via Cloudera can be turned into a data hub architecture.

In conclusion, there is no ubiquitous solution here (sorry). Data needs to be stored from its multitude of sources and used by a very wide range of users who vary in terms of their technical competence, from business people who need report-driven analytics to data scientists using the latest deep learning algorithms. How the data is stored becomes a consequence to the use case, so the simpler the use case, the more complex the data storage needs to be, and conversely, the more science that will be applied the closer to the raw state.  An enterprise is likely to see all these use cases and therefore it is more about the complementary usage of these techniques rather than seeing them as divergent uses.

This article was originally posted in Neil’s blog.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 内容概要 《计算机试卷1》是一份综合性的计算机基础和应用测试卷,涵盖了计算机硬件、软件、操作系统、网络、多媒体技术等多个领域的知识点。试卷包括单选题和操作应用两大类,单选题部分测试学生对计算机基础知识的掌握,操作应用部分则评估学生对计算机应用软件的实际操作能力。 ### 适用人群 本试卷适用于: - 计算机专业或信息技术相关专业的学生,用于课程学习或考试复习。 - 准备计算机等级考试或职业资格认证的人士,作为实战演练材料。 - 对计算机操作有兴趣的自学者,用于提升个人计算机应用技能。 - 计算机基础教育工作者,作为教学资源或出题参考。 ### 使用场景及目标 1. **学习评估**:作为学校或教育机构对学生计算机基础知识和应用技能的评估工具。 2. **自学测试**:供个人自学者检验自己对计算机知识的掌握程度和操作熟练度。 3. **职业发展**:帮助职场人士通过实际操作练习,提升计算机应用能力,增强工作竞争力。 4. **教学资源**:教师可以用于课堂教学,作为教学内容的补充或学生的课后练习。 5. **竞赛准备**:适合准备计算机相关竞赛的学生,作为强化训练和技能检测的材料。 试卷的目标是通过系统性的题目设计,帮助学生全面复习和巩固计算机基础知识,同时通过实际操作题目,提高学生解决实际问题的能力。通过本试卷的学习与练习,学生将能够更加深入地理解计算机的工作原理,掌握常用软件的使用方法,为未来的学术或职业生涯打下坚实的基础。
### 内容概要 这份《计算机试卷1》包含多个部分,主要覆盖了计算机基础知识、操作系统应用、文字处理、电子表格、演示文稿制作、互联网应用以及计算机多媒体技术。试卷以单选题开始,涉及计算机历史、基本概念、硬件组成、软件系统、网络协议等。接着是操作应用部分,要求考生在给定的软件环境中完成一系列具体的计算机操作任务。 ### 适用人群 本试卷适用于计算机科学与技术、信息技术相关专业的学生,以及准备计算机水平考试或职业资格认证的人士。它适合那些希望检验和提升自己计算机操作能力的学习者,也适用于教育工作者作为教学评估工具。 ### 使用场景及目标 1. **学习评估**:作为教育机构的课程评估工具,帮助教师了解学生对计算机基础知识的掌握程度。 2. **自学检验**:供个人自学者检验自己的计算机操作技能和理论知识,为进一步学习提供方向。 3. **职业发展**:为职场人士提供计算机技能的自我提升途径,增强其在信息时代的竞争力。 4. **考试准备**:为准备计算机相关考试的考生提供实战演练的机会,加强考试自信。 5. **教学资源**:教师可以将其作为教学资源,设计课程和实验,提高教学效果。 试卷的目标是通过理论知识的测试和实践技能的操作,全面提升考生的计算机应用能力。考生应掌握从基础的计算机组成原理到复杂的数据处理、演示文稿制作、网络应用以及多媒体技术处理等多方面技能。通过本试卷的学习与练习,考生将能够更加熟练地使用计算机解决实际问题,为未来的学术或职业生涯打下坚实的基础。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值