Often in any “technical” field (and I use that term very loosely), it can be quite hard to differentiate between the facts and the fiction — the latter normally created either by over-zealous product marketing or an over-increasing circle of folklore.
One particular area that has a lot of attention is the whole big data arena. I won’t even go into the “how big is big data” as that’s another very subjective discussion stream.
What I want to discuss and briefly develop in this post is an objective view of the relative positioning of data lakes, enterprise data hubs (EDH) and data warehouses, including their associated terminology and technology for all those budding data scientists and data architects out there.
Data science lens
Before we start, though, it’s always a good idea to get a clear point of reference to base the assertions against. Within the big data world, the framework I have chosen is looking through the lens of data science — data science being the end-to-end methods and techniques of gaining as much knowledge or insight from the data as possible. In other words, if we are going to assess these three types of data storage, then their usage is paramount.
The framework I have used is the one written by Donoho. In this model, there are six key categories of data science: data exploration and preparation, data representation and transformation, computing with data, data modeling, data visualization and presentation and, finally, the science of data science.
Starting right at the beginning and fundamental to data science is how the data is going to be stored.
Let’s focus on the first two categories. From a data exploitation and preparation perspective, it’s reported that at least 80% of the effort devoted to data science is spent understanding the basics of the data and making it ready for further exploration and use. From a data representation and transformation perspective, the challenge is managing a complex set of different formats and physical database types while managing the relevant transformations to make the data into a more revealing form.
From the above, we come to our three “options.”
Data lake
The first option is to use a “data lake.” Definitions are consistent here in that it’s a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured and unstructured data. The data structure and requirements are not defined until the data is needed. The Hadoop community has popularised it a lot, with the focus on moving from disparate silos to a single Hadoop/HDFS. Furthermore, the data does need not be harmonized, indexed, searchable or even easily usable, but at least you don’t have to connect to a live production system every time you want to access a record. Its other key feature is that it can be built on relatively inexpensive hardware.
Pentaho CTO James Dixon has generally been credited with coining the term “data lake.” He describes a data mart (a subset of a data warehouse) as akin to a bottle of water, “cleansed, packaged and structured for easy consumption,” while a data lake is more like a body of water in its natural state.
According to Gartner, “In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format,” said Nick Heudecker, research director at Gartner. “The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”
However, while the marketing hype suggests that audiences throughout an enterprise will leverage data lakes, this positioning assumes that all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.
Hence, step forward toward some alternative, complementary solutions.
Data warehouse
Previously, the most common solution would be the data warehouse or enterprise data warehouse. This is a system used for reporting and data analysis, and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources.
The characteristics of data warehouses are different from a data lake in the following key dimensions.
- The data: A data warehouse will have a structured and processed data set. A data lake will include every source type including unstructured and raw
- The processing: A data warehouse will use a schema on write and a data lake will use a schema on read
- The storage: Tends to be expensive for a data warehouse, whereas a data lake is designed for low-cost storage
- Agility – A data warehouse by its very nature will be a fixed configuration and less agile. A data lake is highly agile and will be configured and reconfigured as required
- Security – A data warehouse has a mature model and a data lake is “maturing”
- User perspective – A data warehouse is primarily designed for business professionals via the tools provided whereas a data lake tends to be the focus for data scientists
So if a data warehouse and data lake have opposite competing characteristics, step forward to a data hub or even an enterprise data hub (EDH).
Data hub
A data hub is a hub-and-spoke approach to data integration, where data is physically moved and re-indexed into a new system. A data lake will run the same process but will always keep the source format. Data is ingested in as close to the raw form as possible without enforcing any restrictive schema. To be a data hub (vs. a data lake) this system would support discovery, indexing and analytics. Data lakes do not index and cannot harmonise because of the incompatible forms that will be held. The prime objective of an EDH is to provide a centralised and unified data source for diverse business needs.
Not surprisingly, the major vendors have latched on to this concept. Cloudera, for example, have published the following information. A simplistic summary of this offering is Cloudera’s relationship with EMC recognising a large deployment of, for example, Isilon data lakes, which via Cloudera can be turned into a data hub architecture.
In conclusion, there is no ubiquitous solution here (sorry). Data needs to be stored from its multitude of sources and used by a very wide range of users who vary in terms of their technical competence, from business people who need report-driven analytics to data scientists using the latest deep learning algorithms. How the data is stored becomes a consequence to the use case, so the simpler the use case, the more complex the data storage needs to be, and conversely, the more science that will be applied the closer to the raw state. An enterprise is likely to see all these use cases and therefore it is more about the complementary usage of these techniques rather than seeing them as divergent uses.
This article was originally posted in Neil’s blog.