2020 Planning Guide for Data Management

最新推荐文章于 2024-03-29 09:50:52 发布

weixin_29379325

最新推荐文章于 2024-03-29 09:50:52 发布

阅读量3.4k

点赞数

原文链接：https://www.gartner.com/doc/reprints?id=1-1Y58KSW0&ct=200115&st=sb

版权

Licensed for Distribution

This research note is restricted to the personal use of ().

2020 Planning Guide for Data Management

Published 7 October 2019 - ID G00401283 - 91 min read

By Analysts Sanjeev Mohan, Lyn Robison, Sumit Pal, Joe Maguire, Dk Mukherjee

Data management architectures and technologies continue to evolve rapidly. They promise higher efficiencies but demand deeper understanding of risks and implications. Data and analytics technical professionals must prepare to adopt the innovations to meet current business needs and future demand.

Overview

Key Findings

Cloud deployment is fueling enormous growth in the data management space, which is leading to changes in architecture, processes, workflows and roles. Cloud introduces serverless and augmented capabilities, which will further morph data management processes and roles.
Data management approaches are unifying as operational and analytical options converge and more databases offer multimodel and real-time streaming capabilities. This enables more complex uses cases with focus shifting from just collecting data to connecting and exploiting data.
Data management complexity rises as organizations handle multistructured data spread across on-premises, edge and multiple cloud vendors, leading to greater data governance challenges.
As artificial intelligence (AI) and machine learning (ML) become “table stakes” in data management products, they are introducing new options and challenges for data and analytics architects, who must evolve how they deliver services to meet business imperatives.

Recommendations

To deliver effective data management solutions, data and analytics technical professionals should:

Embrace public cloud while supporting on-premises data architectures. New greenfield projects should design cloud-native architecture to handle data gravity, governance and financial management.
Deploy pragmatic use cases involving advanced analytics on real-time event streaming to deliver faster results on current data instead of batch-processed data from the previous day. Technology choices should expand business capabilities and not be driven by hype.
Embed end-to-end data governance to meet evolving global compliance regulations into all data management modernization initiatives.
Assimilate new skills and roles to support hybrid multicloud solutions and include edge technologies and AI/ML by investing in new technologies or hiring externally.

Data Management Trends

Data management has always been a linchpin in the success of any mission-critical IT program. This space is ripe with innovations and new advancement that can upend your current approaches to managing data. Conventional wisdom that the data management discipline has plateaued is being challenged at every point in the pipeline.

Data management technology innovations represent an existential threat to organizations that don’t embrace them, but also an unparalleled opportunity for those that do. These opportunities are often disguised as challenges.

New capabilities provide the promise to hypercharge your data management processes, but they come with their own peculiarities that must be understood fully to gain the most advantage. Innovations are happening everywhere — edge, cloud and on-premises — to predominantly support frictionless movement of data. As data and analytics technical professionals tackle the scourge of data silos by deploying new tools and technologies, there is a risk where organizations end up with multiple knowledge silos across different business units.

According to “Market Share: All Software Markets, Worldwide, 2018,” the global DBMS market grew 18.4% to reach $46 billion in 2018. The growth translated into $7.2 billion, with Amazon Web Services (AWS) and Microsoft Azure accounting for more than 70% of this growth. The growth of DBMSs in the cloud is unstoppable. In 2018, 23% of the DBMS market, exceeding $10 billion, was in the cloud.

Although a lot of attention is paid to new DBMS technologies, traditional DBMS vendors still account for the most market share, as shown below:

DBMS vendors (top five are Oracle, Microsoft, AWS, IBM and SAP): 86.7%
Nonrelational (including Apache Hadoop): 13.3%

Table 1 shows the top data management trends emerging in 2020, based on Gartner client inquiries.

Table 1: Top 10 Emerging Data Management Trends

Enlarge Table

No.	Area	Trend
1	Cloud	The inexorable march to the cloud as seen above in the market share numbers is evident in the rapid growth of cloud-deployed data stores and database platform as a service (dbPaaS). However, for the foreseeable future, the focus will be on hybrid multicloud.
2	Automation	Increasing deployment of AI/ML in data management technologies makes them more autonomous and fully managed, leading to more time spent on business strategic initiatives rather than on administering products.
3	Governance	Enabling “data democratization” through a semantic-layer-based data catalog and establishing data lineage through metadata-based data governance is becoming imperative to meeting compliance regulations.
4	Use Cases	Expansion beyond structured data and batch processing into multistructured data and stream processing affects every component of the architecture: ingestion, persistence, analytics and operationalization.
5	Data Pipeline	Trends here include automating data integration to support expanding data sources and sinks and leveraging ML and AI to augment pipeline design.
6	Data Access	Maturation of logical data warehouses (LDWs) as data virtualization technologies provides low-latency data access through semantic catalogs.
7	AI/ML	Convergence of business intelligence (BI) and advanced analytics into DBMS is increasing because it is more efficient to train models where data resides through functions that can be called from more familiar SQL.
8	Licensing	Open-source databases are becoming ubiquitous, and many organizations are mandating their use. Open-source is available as community-supported or through public cloud providers, leading the developers of the software to make their license models more protective.
9	Infrastructure	Advancements in the hardware such as nonvolatile memory allow large-scale-up databases on single nodes at very low latency. The cost for these solutions will continue to go down in 2020 as adoption increases.
10	Data Ops	More efficient deployment and management of data architectures through serverless, containers and orchestration advancements will support the hybrid multicloud trend. This will also ease management of data infrastructure through “a single pane of glass.”

Source: Gartner (October 2019)

In addition, trends from previous years that continue to grow include:

Continued focus on event streaming architectures, especially for time-series-based use cases such as the Internet of Things (IoT)
Reemergence of data lakes in a more secure and governed manner
Improvements in data management technologies based on graphics processing unit (GPUs), ledger databases and graph databases
Focus on delivering advanced analytics and end-to-end solutions for predetermined use cases instead of focusing on collecting as much data as possible in a data lake without a prescribed use case: Examples of high-level use cases include customer journey analytics or customer360, security data lakes, and anomaly and fraud detection analytics.

This Planning Guide assesses the major technology planning trends in 2020 for data management and outlines important planning considerations for developing successful data architectures. Figure 1 summarizes the key 2020 data management planning considerations examined in this research.

Figure 1. 2020 Key Planning Considerations for Data Management

These planning considerations are framed by the following 2020 data management technical planning trends:

Frictionless data movement will demand a modular, flexible, end-to-end architecture that spans edge, cloud and on-premises.
Data stores will accommodate more complex use cases.
Revolutionary changes in data management will drive IT to adopt new operating models and roles.
New regulations and compliance will demand comprehensive distributed and coordinated data governance.
AI/ML advancement will improve data management workload delivery, but with an added burden.

These planning trends are discussed in greater detail in subsequent sections.

Frictionless Data Movement Will Demand a Modular, Flexible, End-to-End Architecture That Spans Edge, Cloud and On-Premises

The volume and velocity of data being produced through both internal and external sources is increasing faster than ever before. The notion that the cost of storing the data is low does not tell the full story. There is a price to be paid for this data volume. If the data is not properly secured, tracked and governed, it can quickly go from being an asset to a liability. In addition, the boundaries between cloud, ground and the edge are disappearing. Hence, data architectures must adapt accordingly. Choosing vendors and technologies is a challenge as functional and market boundaries are blurring and as data management becomes increasingly diverse.

One of the biggest themes of 2020 for data and analytics technical professionals will be handling hybrid multicloud environments. As a result, a common refrain of Gartner inquiries is the strong desire to architect future solutions to be “future-proof.” For example, in 2019, the Apache Hadoop vendor choices shrunk significantly in the U.S. as Cloudera and Hortonworks merged and MapR Technologies was acquired by Hewlett Packard Enterprise (HPE). This has led many organizations to reevaluate their investments in the Hadoop ecosystem.

Many organizations are now in more than one public cloud. Some have strategically chosen this direction to alleviate the fear of cloud vendor lock-in, but others have found themselves in this situation due to different business units pursuing their best interests. A common example is where the IT department may have created a cloud data warehouse, but various business units have procured SaaS applications or BI tools that reside in different public clouds and yet need to be integrated.

In the past decade of the big data movement, the focus was on data accumulation. In the next decade, the focus will shift to data exploitation. Your architecture needs to have the flexibility to handle current and future use cases and yet be cost-effective to deliver high business value.

Data and analytics technical professionals’ role will transform from technology architects to technologist innovators focused on helping businesses become more competitive and efficient.

There is now much more clarity on the goals of modernizing your data and analytics architectures. Well-understood concepts such as the LDW, multimodel databases and the role of data governance are helping drive modernization trends. However, a new trend that is adding complexity is the frantic migration of database management systems to the cloud. This requires data and analytics technical professionals to build solutions that address some use cases that will remain on-premises for the foreseeable future and some that will be in the cloud.

How technical professionals build their 2020 data management strategies will determine how successful they are in meeting current requirements and future needs. Traditionally organizations have focused on structured data, consisting of spreadsheets, or relational data, with distinct fields identifying and categorizing the content. However, organizations from industries such as healthcare insurance are now starting to exploit data from wearable devices, such as pacemakers and fitness devices, to develop models that can help identify patterns and lower healthcare claims and costs. This data may be semistructured or unstructured, such as sensor data, email, log or streaming data, with no discernible schema.

Modernizing your data architecture will continue to be a top priority in 2020. Technical professionals must take steps to evaluate opportunities across the entire pipeline (see Figure 2).

Figure 2. The Data and Analytics Continuum

Important priorities include:

Design your architecture knowing that your deployment options may change. An obvious thought is on-premises to cloud migration, but it may even be the reverse of it. How you design the data architecture will impact the analytics architecture as it is more efficient to keep them close. Gartner refers to this concept as data gravity.
Plan for automation so that changes can be easily incorporated. There will be much more emphasis on creating orchestration pipelines that span various deployment options and provide reliability.
Select the technology vendors that meet your architecture goals. There has been an explosion of new companies offering solutions in almost every category of the pipeline. In addition, an increasing number of products are adding capabilities, thereby sowing more confusion among users. To handle this, document your requirements in sufficient detail and perform proof of value (PoV) on the selected products. PoVs will test the viability of the products, framework or vendors.
Comply with data privacy and regulatory rules. Data governance should be part of the PoVs and should not be an afterthought.
Control cost carefully as modernization introduces new financial models. Many organizations that are used to annual capital expenditure (capex) are now finding that many of the costs are becoming operating expenditure (opex). Some categories of the pipeline (such as DBMSs) offer open-source alternatives, while others (such as data governance) don’t.

Planning Considerations

Given the trends described above, data and analytics technical professionals should focus their 2020 planning efforts on the following activities:

Extend the data architecture to acquire streaming and cloud-borne external data.
Enable use cases by modernizing integration strategy.
Offer versatile data delivery styles, including data virtualization and logical data architecture.
Incorporate a data life cycle management process to support continuous intelligence.

Extend the Data Architecture to Acquire Streaming and Cloud-Borne External Data

Applications that require real-time processing of high-volume event streams are pushing the limits of traditional data processing architectures and infrastructures. Organizations are finding it increasingly important to act on data in motion in order to remain competitive. Digital transformation is driving the need for analysis of what is happening now, not what happened yesterday or last month.

In 2020, the ability to act in real time will become a key element of SLAs across multiple domains, from IoT, to finance, to cybersecurity, to retail. Enterprises are looking to embrace streaming architectures coupled with cutting-edge data processing engines and frameworks to create streaming data applications. This is referred to by many names: real-time analytics, streaming analytics, complex-event processing (CEP), real-time streaming analytics and event processing.

Stream processing technology grew from the demands of enterprises that experienced a huge surge of data volume, velocity and variety and a pressing need to ingest and evaluate this data quickly to make strategic business decisions. Streaming enables continuous processing of the inflowing data.

Compared with the batch-mode, “data at rest” practices of traditional systems, the capability of streaming data applications to process and analyze data in motion has become a key differentiator for organizations. The importance of streaming has grown enormously in recent years because it provides a competitive advantage that reduces the time gap between data’s arrival and its analysis. Some insights are more valuable immediately after an event occurs, but the value diminishes very rapidly with time.

In an end-to-end data architecture and data flow pipeline, streaming can be leveraged across the different layers of the system — from stream-based data ingestion and integration to stream processing, streaming analytics and ML. Figure 3 shows the typical composition of an end-to-end streaming application that is composed of components such as stream ingestion for data capture, stream processing for data processing and streaming consumption for streaming insight delivery.

Figure 3. Stream-Processing Architectural Components

Technical professionals responsible for adopting stream processing architectures into the data pipelines must:

Evaluate stream processing frameworks across several dimensions — including checkpointing, back pressure handling, windowing capabilities and watermark capabilities — and align them with your latency, throughput and domain requirements.
Establish an abstraction layer, both for the stream processing code from the underlying streaming platform and for the stream processing layer from other components in the data pipeline, to avoid vendor lock-in.
Ensure that streaming pipelines are easily configurable and tunable with myriad configurations for optimal performance and to avoid data loss. Devote time, resources and tools to ensure that these pipelines can be easily orchestrated, operationalized, troubleshot, monitored and debugged.

Enable Use Cases by Modernizing Integration Strategy

Considering the growing need to ensure seamless connectivity among a diverse set of applications with varying latency and heterogeneous data formats, the data integration space has seen a set of integration patterns emerge that cater to a panoply of use cases. In 2020, data and analytics technical professionals must focus on modernizing their data integration strategy to prepare for new data sources, uses, formats, systems and technologies. Industry trends bring new needs to accommodate, transform and integrate a variety of data types and formats.

Existing integration patterns include:

Enterprise integration platforms as a service (EiPaaS)
Hybrid integration platforms (HIPs)
Enterprise application integration (EAI)
B2B integration
APIs
Managed integration services

Key areas for data and analytics technical professionals to focus on in their planning efforts include:

Develop a modern data integration strategy that can handle disparate dataset formats and varying latencies. This effort involves developing an integration framework to read this disparate data, curate it and process/disseminate it for consumption.
Incorporate natural language processing (NLP) to decompose overwhelming content in unstructured data. As organizations seek to gain value from increasing volumes of unstructured data, they need to explore new approaches or technologies — such as NLP — to transform that data into more meaningful forms.

Develop a Modern Data Integration Strategy

Your data integration strategy should handle disparate types of datasets that are becoming more prevalent in modern data architectures. Different approaches are needed to build an integration pipeline that can handle structured, semistructured and unstructured data. To illustrate the topic of diverse data sources and protocols needed to integrate the data, Table 2 shows the example of delivering a comprehensive Customer 360 use case.

Table 2: Diverse Data Sources and Protocols

Enlarge Table

Source	Location	Data Type	Integration Option	Frequency	Purpose examples
ERP/Operational	Internal database	Relational database management system (RDBMS)	CDC	Daily	Transactions
Sales	External (cloud)	RDBMS	API	Daily	Customer data such as loyalty ID, name or address
Digital Commerce	External/internal	RDBMS	iPaaS, API	Ad hoc or batch	Customer profile data, transactional data
Data Warehouse	Internal	RDBMS	SQL	Batch	Customer history
Website (Clickstream)	Digital analytics platform	JavaScript Object Notation (JSON)	API	Near real time	Browsing history (anonymous or logged in)
Logs	Security information and event management (SIEM), etc.	Key value, text	FTP, Logstash, fluentd, etc.	Daily	Detect anomalies, usage, metrics
Social Media	Twitter, etc.	JSON	API	Ad hoc	Sentiment, loyalty
Contracts	Internal (file store)	File (PDF, MS Word), optical character recognition (OCR)	Apache Tika	Batch	Legal terms
Help Desk, Call Center	Internal/external	RDBMS	iPaaS, enterprise service bus (ESB)	Batch	Tickets, customer satisfaction
Surveys	Internal/external	NoSQL	API	Ad hoc	Custom data (e.g., reputation)
Customer Digital Analytics	Google, Adobe	JSON	API	Near real time	Web or mobile app activity
Master Data Management (MDM) and Reference Data Repositories	Internal	RDBMS or graph DBs	Open Database Connectivity (ODBC)/Java Database Connectivity (JDBC)	Batch	Customer system of record
Images/Speech	Cloud or internal storage area network (SAN)	Unstructured data in files	NLP, ML, speech-to-text	Daily	Image recognition/customer intent/segmentation
IoT Sensor (e.g., Smart Meter)	External	XML	Apache Sqoop	15 minutes	Usage profiles
Customer Satisfaction Survey Data	Internal/External	Text	File	Half-yearly	Customer satisfaction or sentiment

Source: Gartner (October 2019)

A planning priority for 2020 is to ensure that your data ingestion approach can handle a variety of data source formats by considering the following factors:

Build your source ingestion pipelines in an agile manner that can allow new sources and targets to be added dynamically. Consider using an ingestion pipeline and orchestration platform such as StreamSets or Apache NiFi with built-in support for alerts, monitoring and metadata management.
Harmonize data. Because source systems may have disparate and volatile schemas, ensure that data is unified into common formats. Many organizations convert source data into comma-separated values (CSV) files and further optimize it into columnar formats such as Apache ORC and Apache Parquet.
Allow for bulk loading into the target data stores by incorporating a “system aware” data routing capability. The need for multiple target systems to maintain a consistent view of the data requires a “data router/transaction manager” capability that ensures reliable data load coordination.
Profile data to handle data quality and order sequencing. The ingestion pipeline needs to account for late-arriving data and missing values. Data profiling is also needed to identify sensitive data and apply corporate policies.
Reduce latency. Leverage temporary storage (cache) and respond to an unexpected spike in data volume by buffering the data in a local area.
Maintain atomicity of operations by ensuring that state is recoverable in the event of a failure. If ingestion and load are performed by distinct pipelines, then the data flow between the two needs to be recovered. Arriving data (queryable for real-time analytics) needs to be managed in isolation of data cached from the downstream curated data store.
Scale up to cater to an increase in the number of sources and sinks. Accordingly, scale down the resources when the load wanes. Investigate products that allow autoscaling to maintain an optimal throughput to preserve the desired latency.

Use NLP to Decompose Diverse and Complex Content

As increasing amounts of unstructured data are generated that contain potentially valuable insights, an approach or technology is needed to decompose data into something meaningful that can be better captured, imported and analyzed. This enables organizations to better connect, relate and categorize diverse information to gain new insights from it. Gartner’s inquiry volume shows a steady increase in deployment of use cases involving unstructured data. Some of the increase is organic as organizations further mature their modern data architectures. Others are being forced to handle unstructured data due to external reasons such as compliance and regulations. For example, healthcare organizations are leveraging digital business to operate more efficiently and improve research, discovery and outcomes. They convert inbound calls into text and then send them to third parties for analysis, but first they need to mask protected health information (PHI) due to Health Insurance Portability and Accountability Act (HIPAA) regulations in the U.S.

The need to parse unstructured data has led to an increased adoption of NLP. NLP helps better analyze, contextualize and understand unstructured data, reduce complexity, and start the transition into a format that can be reported on and searched.

NLP has the power to automate the extraction of meaningful contextual information from unstructured text. NLP can be used to examine text and decompose it into a semantic layer that is persistent in a database (relational or graph). In the example shown in Figure 4, descriptive text that includes analysts’ names and coverage areas has been converted to a database structure that can be used for easy search and reporting.

Figure 4. Converting Text to a Database Structure

Another area that will soon leverage NLP is the design of a business rule engine (BRE) from unstructured text. Data quality rules can be gleaned and populated into a centralized repository for data validation. Using the appropriate NLP technologies, once nouns, adjectives and verbs have been identified, entities, attributes and relationships can start to be forged.

NLP may also be incorporated into the processing layer of a streaming ingestion framework. This is done to identify structures that may not conform to a preagreed schema. The processor can then make a web service call to retrieve a classifier for the nonobvious data, which can then be persisted.

In 2020, data and analytics technical professionals should explore the viability of using NLP to build conceptual data models. Key factors to consider in starting an NLP journey include:

Analyze the business case: Most common NLP business cases perform sentiment analysis on unstructured data, automate business processes such as onboarding of new hires using chatbots and understand customer churn.
Identify skills: NLP architectures are very specific to business cases and usually require some customization to NLP packages. Depending on your priorities, these skills can be developed in-house, or external system integrators can be leveraged to develop your solutions.
Determine accuracy: Gartner clients often ask how accurate the NLP algorithms are for their use cases. This leads to the previous point about the level of customization. Regulatory use cases demand very high accuracy. Interestingly, NLP models should improve with higher usage and with human curation.
Vendor selection: Your requirements (e.g., sensitive information) determine deployment options, such as on-premises or via an API-accessed cloud vendor. Other critical requirements specify security and integration with CRM, BI solutions and AI/ML frameworks. Finally, your volume and scale will determine the cost-effectiveness of your NLP initiative.

Offer Versatile Data Delivery Styles Including Virtualized and Logical Data Architecture

IT has been on a long journey to provide business with access to data in the most optimal manner. This journey has taken many shapes and forms, such as:

Corporate or enterprise data warehouses
Data lakes
LDWs

LDWs combine various data marts and/or existing enterprise data warehouses (EDWs) with the newest data lake deployments, both on-premises and in the cloud.

The location of data in joined repositories is more fluid than in the past. Organizations use the LDW approach to combine traditional data warehouse infrastructure that has moved to a data warehouse in the cloud, or even data warehouse as a service, with other forms of analytic data processing.

The LDW is an architectural design. It is not a purchased commodity.

The LDW allows for faster exploration of new data assets while assimilating it with existing data assets. The LDW is both an evolution and an augmentation of existing data warehouse architecture practices. It is also an approach for starting a data lake initiative and building “backward” to combine data lakes with traditional data warehouse solutions as needed. It reflects the fact that not all analytical, querying and reporting needs can be supported by a traditional, centralized, repository-style data warehouse, nor, conversely, by data lake implementations. It implies a much broader and more inclusive data management solution for analytics.

Gartner estimates (see “Hype Cycle for Data Management, 2019”) that, by the end of 2019, 18% of all data warehouse deployments will be LDWs. LDW adoption is increasing, so data and analytics professionals should prepare to introduce this option by focusing on the following aspects:

Take an incremental approach to building this ecosystem to avoid the failures of past “big bang” approaches.
Select a data virtualization product that provides the following critical capabilities:
- Federate queries between various components of the LDW. This requires the product to have connectors to your list of databases.
- Accelerate queries through a cost-based optimizer that selects the most cost-effective option with the lowest latency. Vendors deploy options such as massively parallel processing (MPP) pushdown, in-memory caching and other optimization techniques.
- Deploy metadata catalog to query the semantic layer and locate data elements that are spread across different systems and yet have common business terms and glossaries. Some advanced metadata catalogs can identify when the schema changes in the source systems and perform impact analysis on the downstream reports, dashboards or systems.
- Maintain security policies to ensure that users can see only the data they have been authorized to see.

Providers that offer stand-alone data virtualization middleware tools include AtScale, Denodo, Dremio, IBM, Informatica, Information Builders, Oracle and TIBCO Software. The core of the “organize” stage of the end-to-end architecture is the LDW (see “Solution Path for Planning and Implementing the Logical Data Warehouse”).

Incorporate a Data Life Cycle Management Process to Support DataOps

Today’s data is big, fast, complex and changing. It is created faster, and data changes more quickly. Questions asked of the data change quickly, and decisions need to be made in minutes and seconds, not days and hours. The data to support decision making must be delivered in the right format to the right teams and within the right context. Data is an asset but can become a liability if not used correctly, in the right context and within the right time frame.

One of the most active areas of innovation in 2019 was orchestration of data pipelines that connect data from multiple and disparate data sources straddling edge, on-premises and cloud to different consumers and applications. These data orchestration engines require continuous monitoring and reporting.

The DataOps concept has emerged as a solution to the challenges in operationalization and deployment of data and analytics projects in the enterprise. DataOps is becoming an important cog to drive continuous integration (CI) and continuous deployment (CD). DataOps applies traditional DevOps concepts of agility, CI/CD and end-user feedback to data and analytics efforts.

DataOps is the application of DevOps practices to data management, data integration and data processing to reduce the cycle time of end-to-end data insights. It focuses on automation, collaboration, monitoring and repeatability of processes and data flows with a configuration-driven approach.

In 2020, data and analytics technical professionals should explore increased use of DataOps for the operationalization of end-to-end data platforms to build a data-driven organization and consider the following:

Select a DataOps platform to integrate functions that support rapid deployment and governance.
Provide version control of datasets to manage changes in artifacts essential for governance and iterative development.
Manage lineage, history and metadata of system and activity logs.
Implement authorization and permission controls to access the environments.
Capture metrics and reports at the right granularity for reports and analysis to provide a big-picture assessment of the state of the analytics and data team.
Automate deployment of the code, frameworks, libraries and tools, each with their own configuration from one environment (e.g., staging) to another (e.g., production).
Create and maintain the environment by enabling infrastructure-as-a-code to streamline development, quality assurance (QA) and DevOps teams.
Orchestrate, test, monitor, schedule and alert for failure or anomalous conditions.
Detect data drift automatically and respond to changes in schema and data semantics to provide reactive change management.

Figure 5 shows an overall architecture for orchestrating, managing and automating data architectures and data pipelines across different environments with DataOps.

Figure 5. Using DataOps to Orchestrate, Manage and Automate Pipelines

DataOps can span the entire gamut from data ingestion all the way to data delivery. This is a complex chain of components. And because this is a nascent process, there are no end-to-end tools or products that provide an all-encompassing solution for DataOps. Some of the tools and vendors in this space include Composable Analytics, DataKitchen, Infoworks, Nexla and StreamSets,

Data Stores Will Accommodate More Complex Use Cases

Data stores today are being pushed to store ever larger amounts of data that dwarf some of the largest databases of the last decade. In addition, the data management systems must accordingly process vast amounts of that data in ever shorter time periods to enable reporting and advanced analytics on current data rather than data that was batch processed in the past. The trifecta of greater volume, velocity and variety of data is opening doors to new use cases that, until recently were not conceivable. One area that has benefited with the advancements of technology is precision medicine. The ability to store and analyze humans’ genome sequence and subsequent copies of edited sequences has already contributed to the discovery of cures for many diseases.

Because data stores are experiencing unprecedented transformations in their capabilities, Gartner has refined the classification of the six primary use cases that drive data architecture in Table 3.

Table 3: Six DBMS Use Cases

Enlarge Table

Use Cases	Description
Traditional Transactions	Manages centralized transactions with fixed, stable schema, high-speed, high-volume data insert/update, and atomicity, consistency, isolation and durability (ACID) properties. Hybrid/cloud deployment is increasingly important.
Traditional Data Warehouse	Manages structured historical data coming from multiple sources. Data is mainly loaded through bulk and batch loading.
Logical Data Warehouse (LDW)	Manages data variety and volume of data for both structured and other content data types, where the LDW acts as a logical tier to a variety of data sources.
Augmented Transaction Processing (Formerly Known as Hybrid Transactional/Analytical Processing [HTAP])	Analytical and transaction processing techniques are woven together to accomplish the business task. Typically, augmented transactions involve the use of in-memory DBMS products and use various forms of advanced analytics, including AI and ML, to automatically guide the execution of transactional processes.
Stream/Event Processing (IoT)	Manages events/observations captured “at the edge.” This may include processing at the edge and transmission of results to other stages of a business process.
Data Science Exploration/Deep Learning	Exploration of new data values, data form variants and relationships. This supports search, graph and other capabilities to uncover new information models.

Source: Gartner (October 2019)

Traditionally, Gartner addressed operational and analytical use cases separately because specialized product architectures had evolved to address the different demands on infrastructure that these use cases required. However, almost 75% of the vendors included in either the operational DBMS (OPDBMS) or the data management solutions for analytics (DMSA) Magic Quadrants support both analytic and operational use cases. Hence, data and analytics technical professionals must consider the ecosystem holistically in order to get value most efficiently from the data and analytics landscape.

The number of data stores available to meet modern use cases is constantly rising, despite the fact that it takes many years (and even decades) to mature such a foundational component of data architecture. The demand for meeting the needs efficiently has led to creation of many niche databases, thus belying common logic that this market is rife for consolidation. This requires careful evaluation of current features and future promises (euphemistically known as roadmaps). The scourge of shiny object syndrome can easily stymie long-term success of architectures. In fact, one of the most common requests during Gartner inquiries on selecting products involves what will “future-proof” investments in technologies.

In 2020, data and analytics technical professionals will need to invest efforts in several areas, including:

Deploying multimodel databases: Many databases have added new data formats or APIs to become capable of handling more use cases without needing specialized databases. For example, it is quite common now to see document databases handle graph database use cases. In addition, databases such as AWS’ Amazon DocumentDB and Microsoft Azure’s Cosmos DB have MongoDB API compatibility. Relational databases that were optimized for row store online transaction processing (OLTP) use cases now frequently provide column store online analytical processing (OLAP) optimized options.
Ensuring conceptual modeling and semantic layers: The need to handle large amounts of diverse data introduces challenges regarding how to categorize and find that data. Business users should not have to know the location and arcane schema names. They should be able to find the data using familiar business terms in a self-service manner. This requires IT professionals to ensure that modern data stores are modeled efficiently. It also requires that the business have a semantic layer that crosses disparate sources of common data.
Increasing use of object stores: Predominantly in the cloud, but now available on-premises, object stores provide a lower-cost, highly reliable option to store data compared to DBMSs. Object stores store data in a variety of formats, such as CSV, TSV, JSON, Apache Parquet, Apache Avro and Apache ORC. Newer technologies can efficiently query the data directly or federate the data with other DBMS or data store options.

Planning Considerations

To prepare for this trend in 2020, data and technical professionals should focus on the following planning priorities:

Reduce polyglot persistence complexity by deploying multimodel data stores
Enhance conceptual modeling and develop a semantic layer for nonrelational data
Deploy data lakes using object stores in the cloud and on-premises
Explore new metamodel choices for multistructured data
Optimize databases via in-memory and nonvolatile persistent memory DBs

Reduce Polyglot Persistence Complexity by Deploying Multimodel Data Stores

When the era of NoSQL databases dawned, it was alluring to imagine that a retail website could deploy a document database to store the product catalog, a key value database to store the user session information, a search database to speed up customer interactions and, finally, a relational database to handle financial transactions. The analytical database would be a columnar data warehouse. This idea came to be known as polyglot persistence. It is similar to polyglot applications where developers may be using various computer languages to develop applications.

While the best-of-breed datastores provide excellent capabilities, they also introduce many risks. These risks include those related to the complexity of integrating various technologies, maintaining disparate skills, and the license and maintenance costs of multiple products. As a result, many databases are now offering the ability to store multiple types of data structures.

The journey from mainframes, to RDBMSs, to nonrelational databases, to multimodel data stores has been a long one. One of the most common opportunities to reduce the glut of databases comes from collapsing databases that perform operational and analytical workloads. This approach is known as the multimodel database (see Figure 6).

Figure 6. Multimodel Database

A multimodel database is a unified database for different types of data. It is designed to support multiple data models against a single, integrated back end. A multimodel database can support document, graph, relational and key-value models. The idea of multimodel databases is not new. The object-relational database is a framework that allows users to plug in their domain- or application-specific data models as user-defined functions, types or indexes. The need for multimodel databases has arisen from the fact that organizations don’t have unlimited budget to absorb multiple database technologies, invest in integrating them and acquire hard-to-get specialized skills.

Having multiple data models is useful when implementing different use cases, especially with modern datasets that are heterogeneous and change quickly. By using a multimodel database, you maintain the advantage of being able to store and provide different data models to different users without adding the additional complexity that comes with polyglot persistence. Using a multimodel database to create an agile data system will ensure that you are creating a system that is future proof and will continuously adapt to your needs.

A multimodel database supports:

Multiple data models within a single back end that use data and query standards appropriate to each model
Seamless querying across all the supported data models
Indexing, parsing and processing standards appropriate to the data models

Today, it is common to see databases that started as key value data stores but eventually added support for other data types such as document data bases. Some examples of multimodel databases include ArangoDB, Amazon DynamoDB with Neptune graph capabilities, Azure Cosmos DB and MarkLogic.

Gartner expects that most DBMSs will morph into multimodel databases. As a result, it behooves data and analytics technical professionals to explore multimodel databases by considering the following factors:

Start with a clear definition of your use cases before selecting any database.
Understand your data access patterns — random read/random write or sequential read/write workloads.
Examine the technologies. For example, a relational database provider may claim to support XML, but if the XML is stored as a binary large object (BLOB), its use is limited. If XML queries are important to your use case, consider a database that has native support for XML data.

Enhance Conceptual Modeling and Develop a Semantic Layer for Nonrelational Data

For decades, data models have been a linchpin of data management. Through an accident of history, data modeling techniques and notations are optimized for relational contexts. It is now known that data models apply to many more pieces of a data management architecture, as Figure 7 illustrates. In the figure, a filled-in circle indicates an artifact that is a data model and a blank circle indicates an artifact that is closely related to a data model.

Figure 7. Data Models and Related Artifacts

Gartner inquiry volume shows an increased focus on modeling techniques that address many aspects of the modern data architecture, including designing sharding/bucketing strategies, modeling for NoSQL solutions, and modeling to support data governance. This application of data modeling to such an expansive set of problems causes a new problem of its own.

The new problem is that the historical influence of the relational metamodel has yielded best practices in modeling that misalign with current realities and current needs of data and analytics technical professionals. Consequently, technical professionals planning for 2020 should attend to the following:

If the data already exists, plan to discover some data models automatically. There are several ways to do this:
- Data warehouse automation tools can shoulder much of the burden of designing data models for data warehouses.
- Discovery features of metadata catalogs can infer data dictionaries by scrutinizing data assets. The inferred entries of data dictionaries are then exposed to a crowdsourcing step during which business subject matter experts confirm or reject the speculated interpretations of data.
- Semantic layers can be automatically built atop data lakes and other data assets. These data layers are typically expressed in a graph metamodel, so organizations planning to use this will need to develop some expertise in non-SQL querying techniques such as GraphQL.
- Reverse-engineering features of conventional data modeling tools (e.g., erwin or IDERA ER/Studio Data Architect) can generate data models from implemented storage structures. For years, these features have worked with pure SQL structures. More recently, they can reverse-engineer from many other storage metamodels, including XML schemas, Apache Cassandra, Neo4j or Apache Hive.
If the data does not already exist, build data models manually. Not all data models can be discovered automatically. For example, data models supporting greenfield development projects for transactional systems will need traditional data modeling efforts.
Establish a conceptual modeling practice and separate it from logical modeling. Historically, conceptual modeling has been thought of as nothing more than the production of a vague sketch. By this way of thinking, conceptual models consisted of only a handful of entities and few or no attributes. Investigation of details was postponed until the logical modeling phase. Although not ideal, this approach was adequate only because logical modeling occurred in a relational context. When logical modeling occurs in NoSQL contexts, the approach is invalid. That’s because NoSQL models are a poor medium for collecting data requirements from users. If you plan to design an aggregate-oriented NoSQL structure — say, for a Cassandra database — you must understand the business perception of the data before beginning the logical modeling. That understanding must be established through conceptual data modeling that includes attributes, dependent entities and other details.

For more information, see “Data Modeling to Support End-to-End Data Architectures.”

Deploy Data Lakes Using Object Stores in the Cloud and On-Premises

Organizations moving to the cloud are increasingly adopting cloud-based object stores as the primary data storage for data lakes. Object storage provides excellent economy at scale, ensures high availability (HA) and high durability, and is optimized for data-intensive applications. Object storage also has rich metadata tag features, unlike traditional storage options like network-attached storage or storage area networks. The popularity of object stores has resulted in a proportional decline in significance of using Hadoop Distributed File System (HDFS)-based data lakes.

Object storage is where every file is an object as opposed to a file in a filesystem. Object storage doesn’t need a filesystem and file formatting. Each object has three components: a unique key, the actual data payload and metadata. The metadata is either system-generated or user-defined. The metadata is often stored in an embedded relational database and can be queried using REST APIs. Please note that object stores tend to be “eventually consistent.”

For organizations that have decided to stick with HDFS, it best to move to HDFS 3.0, which uses “erasure coding” that results in better data storage savings and removes the need to replicate data three times. Erasure coding chunks data across multiple storage locations in a manner that has lower overhead to reconstruct data compared to traditional redundant array of inexpensive disks (RAID). Although erasure coding provides more efficient storage compared to HDFS, it is more CPU-intensive and, therefore, has higher latency.

Cloud-based object stores are growing as companies are trying to take advantage of low-cost and flexible storage options. A new breed of file system is emerging to tackle big data challenges. Although cloud object stores are gaining traction for storage workloads in the cloud, organizations also want the data to be closer. That’s helping to drive adoption of distributed file systems that can make cloud-resident data appear as though it’s stored on-premises.

Object stores lack capabilities related to speed and consistency. A functionality gap is being addressed by distributed file stores that make an Amazon Simple Storage Service (S3)-compatible object store — in the cloud or on-premises — look and function like a fast local file system, no matter where users are located around the globe.

These act as software-based acceleration gateways, whereby the user points to the local namespace file system at a remote cloud object store, which starts streaming the file. It uses tricks like prefetching and caching locally to ensure the hottest data is stored locally and provides compression and encryption without synchronizing the local data store with the cloud object store. It synchronizes the metadata and streams the relevant part of that file on demand.

In the next couple of years, Gartner foresees a mass transformation of centralized data lakes into distributed data lakes across both on-premises and multicloud. The idea is that, no matter where those data stores are located, they can be made available in a single global namespace and managed like a single repository without changing application functionality.

Technical professionals should also evaluate Apache Hadoop Ozone for uniform access to data stores, irrespective of whether the underlying storage is HDFS or object store. Apache Hadoop Ozone is an S3-like distributed and scalable object store that exposes REST API and Hadoop remote procedure call (RPC) interfaces. Apache Ozone is a strongly consistent object store, like HDFS, and uses the Raft Concensus Protocol. Ozone is currently still in development and not generally available. Because it is a new development, it natively runs in containers with storage interface for Kubernetes and Apache Hadoop YARN. In 2020, explore the use of Ozone to natively run Spark, Hive and MapReduce jobs.

Figure 8 shows the next generation of data storage architectures with object stores.

Figure 8. Next-Generation Data Storage Architecture

Data architects should understand the nuances of working with a file system versus an object store. Things to be careful of when working with object stores include:

Object stores have higher input/output (I/O) variance than file systems like HDFS: Applications that need consistent I/O throughput should not use object stores.
Object stores are immutable and do not support appends and truncates: Implementing transactional and operational workloads on object stores can be a challenge.
Object stores are not Portable Operating System Interface (POSIX)-compliant: Many government and other security regulations require POSIX-compatible file system access and are thus limited with object stores.
Object stores have greater latency than HDFS: Applications that need consistent latency should not use object stores.

As technical professionals look to migrate more workloads to the cloud, they should carefully examine the pros and cons of object stores:

Lower costs: Object stores use a pay-for-what-is-used model instead of requiring the upfront purchase of storage hardware.
Decoupled compute and storage: With the network becoming faster than the disk and the move toward a decoupled compute and storage cluster, object stores can be assessed from any cluster — whether for batch processing, stream processing, machine learning or analytics.
Slow directory listing: Organizations should not use object stores as filesystems. They must understand that object stores cannot be used in traditional filesystem ways.
Eventual consistency: Object stores are not immediately consistent, and this has implications when designing strict transactional semantics on object stores.
Nonatomic bucket renaming: Bucket renaming is not atomic because buckets are metadata pointers, not true directories. Hence, under the covers, a copy with a different name followed by a delete happens, and these two operations are not atomic.

The most common cloud object stores are:

Amazon S3: This is the most widely used object store and has a well-defined API.
Azure Data Lake Storage Gen 2: This has capabilities to behave both as an object store and a filesystem. This is a very new feature and needs to be scale tested before organizations try to use it across the enterprise.
Google Cloud Storage (GCS): This is offered on Google Cloud Platform (GCP).

Extension of the object store API to include “serverless” SQL query capabilities, with tools like S3 Select and SQL engines like Apache Presto, is becoming very popular. This allows SQL queries to be executed directly on the objects and return the result sets instead of downloading the whole object to the compute layer and then running the queries. This significantly increases efficiency and throughput. The cost model is very attractive because users pay for use only. Amazon provides an open-source Presto-based offering called Amazon Athena. In addition, Starburst provides a third-party alternative Athena offering for AWS and Azure.

Advancements in Private Cloud/On-Premises Solutions

For the foreseeable future, many workloads will continue to run on-premises. One of the major reasons for this is restrictions posed by compliance regulations. On-premises workloads may also provide better costs for use cases that involve continuous and intensive use of compute and storage services. The biggest advantage of private cloud is that it offers more opportunity to tune the hardware to specific use cases.

Enterprise data centers are increasingly looking like private cloud, with object storage as the de facto storage for on-premises. A variety of companies offer S3 compatible object storage that can be installed anywhere. Some examples include Ceph, Cloudian, LeoFS, MinIO, OpenIO, Riak Cloud Storage (CS) and SwiftStack. MinIO has pioneered S3-compatible object storage for on-premises data lakes.

Organizations looking into going with hybrid or multicloud architectures should research the next generation of distributed file systems, including Elastifile Cloud File System (ECFS), Hedvig Distributed Storage Platform, Lustre, Qumulo and WekaIO Matrix. These emerging technologies expose an object-based API and share features of object stores, but they look like traditional file systems under the covers.

Explore New Metamodel Choices for Multistructured Data

Organizations should actively pursue a multimodel data store strategy, but many use cases benefit from specialized data stores. In 2020, we see the following data stores gaining higher traction:

Graph databases
Time series and geospatial databases
Ledger databases

Graph Databases

Graph data stores are one of the fastest-growing categories in data management. Gartner is seeing a robust increase in clients interested in deploying graph databases because they are robust and flexible at modeling and find and traverse relationships much more efficiently than traditional relational databases. Applying graphs to complex domains can often simplify modeling and make it intuitive and scalable. Graph data stores are rapidly maturing for enterprise use cases to support OLTP (ACID), OLAP and operational requirements.

A graph database is an online database management system with create, read, update and delete (CRUD) operations working with connected data. Relationships are first-class citizens in graph data stores. This means that applications do not have to infer data connections using any implementation-specific technique, like primary or foreign keys. By assembling the abstractions of nodes and relationships into connected structures, graph data stores enable building simple but sophisticated models that map closely to the problem domain.

Graph data stores are intuitive because they mirror the way the human brain thinks and maps associations. A graph data store efficiently stores connected data and performs graph queries efficiently using graph-traversal techniques. These are designed to excel at problems where there is no a priori knowledge of path length by using graph storage and infrastructure to ﬁnd neighbors efficiently.

Graph data stores boast deep traversal capabilities and can quickly process data and relationships at scale, which starts to break down for relational data stores. Graph data stores can change schemas online while continuing to serve queries. Relational databases can’t support the frequent schema changes that are now so commonplace in the modern data management era.

In 2020, technical professionals should explore increased use of graph databases but should consider the following factors:

Consider graph databases to manage a very high number of relationships (e.g., social networking) or for modeling complex domains such as clinical testing, where there are strong relationships between elements such as patients, drugs and treatments. Other use cases may include hardware asset management and data lineage.
Explore use of graph databases to better understand your customers via MDM or customer journey analytics.
Investigate OLTP and real-time use cases such as recommendation engines, route optimization or fraud analytics.
Use graph databases for building knowledge graphs to share and publish open data for data catalogs, metadata management, and semantic or text processing.
Do not use graph databases for aggregations, BI and queries involving joins.
Do not use graph data stores for graph processing of large-scale datasets.

Some of the tools and vendors in this space include AllegroGraph, ArangoDB, DataStax, MarkLogic, Neo4j and SAP-OrientDB.

Time Series and Geospatial Databases

Along with graph databases, time series databases (TSDBs) represent one of the faster-growing data store categories. In addition, organizations want to incorporate real-time analytics on time series with geospatial capabilities. Some of the fastest-growing companies on the planet need multidimensional analytics on real-time data to run their business on databases that can provide answers with subsecond response times.

Time Series Databases

Time series data is everywhere. Generated from mobile devices and sensors, it has become a common use case across many domains and a database category. Metrics, events and other time-based data are generated at an exponential rate, and there is a growing requirement for analyzing time series data. Time series databases are providing a platform, tools and services to collect and calculate metrics and events data and to analyze and act on the data using visualizations, alerts and notifications.

Time series databases are allowing organizations to develop next-generation monitoring, analytics, and IoT applications faster, easier and at scale to quickly deliver real business value.

Time series data involves measurements or events that are tracked, monitored and aggregated over time. A key difference with time series data from regular data is that one is always asking questions about it over time. Time series databases allow users to collect inputs from multiple and diverse sources and to store, query, process and visualize raw high-precision data.

Time series databases are being built with components to provide an end-to-end solution to solve problems with time series data out-of-the-box without application code. These components include data collection, storage, visualization and rule engines for processing, monitoring and alerting.

Some of the tools and vendors in this space include: Amazon TimeStream, Apache Druid, Apache OTDB, Graphite, InfluxData (InfluxDB), Prometheus, Timescale and Uber M3.

In 2020, technical professionals looking to incorporate time series databases should consider the following factors:

IoT: Many Gartner clients in the industrial space already have proprietary “historian” databases, which were built for operational and batch needs. Use the new TSDBs, which can analyze data in real time and provide the ability to use ML models.
Log data: Aggregate log data from multiple sources, including operating systems, applications, databases, servers, networking and security equipment. Also, collect metrics information and visualize it through Grafana, Kibana, Prometheus and the like.

Figure 9 shows a high-level architecture for most time series databases.

Figure 9. Architecture for Time Series Database

Geospatial Databases

The rapidly increasing amount of location data available in many applications has made it necessary to build specialized tools, frameworks and databases to process large-scale spatial data and spatial queries. GPS devices and smartphones have generated a huge amount of location data. Organizations need tools that support ad hoc exploration of data, arbitrary drill-down including geospatial filtering, and geospatial dimensioning of data.

Geospatial data is pervasive. It is found in mobile devices, sensors, logs and wearables. This data’s spatial context is an important variable in many predictive analytics applications. To benefit from spatial context in a predictive analytics application, organizations need to be able to parse geospatial datasets at scale, join them with target datasets that contain point-in-space information, and answer geometrical queries efficiently.

Relational DBMSs are not appropriate for storing and manipulating geospatial data due to the complexity of the data structures needed to represent geometric information and their topological relationships. They are restricted to using standard alphanumeric data types. This coerces geospatial data to be decomposed into simpler components spanning several rows. This complicates the execution and formulation of queries on such geospatial objects.

The best example of everyday usage of geospatial data is Uber’s use case, where there is a need to provide granular geospatial slicing and dicing of data to understand the marketplace properties of some geo-temporal segment of the data.

Geospatial analytics and queries need systems that can support indexed spatial joins based on point-in-polygon tests and point-to-polyline distance computations. Some of these routine computations in geo-spatial domains often results in intense trigonometric calculations, which often scale and perform well with GPUs.

Geospatial databases provide flexible data structures for optimized performance and analysis. They support special data types such as point, line and polygon. Support for multidimensional spatial indexing is used for efficient processing of spatial operations. Out of-the-box spatial functions can be accessed in SQL for querying of spatial properties and relationships.

Some examples of databases that support geospatial capabilities include: Esri ArcSDE, GeoMesa, Oracle Spatial Graph and SQL Server Spatial. In addition, PostGIS adds spatial functions such as distance, area, union, intersection and specialty geometry data type to PostgreSQL. These are a few options for managing and analyzing geospatial data.

Geospatial support in databases enables mashups across geospatially aware datasets, which is very useful for:

Social networking applications
Cross-organizational data platforms
Logistics
Command and control centers
Traffic control

Ledger Databases

A ledger database provides a transparent, immutable and cryptographically verifiable transaction log ‎owned by a central trusted authority. Such databases track data changes and maintain a complete and verifiable history of changes over time to ensure data integrity. A ledger database is a nonrelational DB and can store semistructured data using a document-oriented data model. Ledger databases also implement ACID properties, thus keeping the transaction valid and secure.

Gartner estimates (see “IT Market Clock for Database Management Systems, 2019”) that Ledger databases will emerge and gain traction in at least 20% of permissioned blockchain market share over the next couple of years.

Ledger databases provide the right path for enterprises to take advantage of ledger-like technology without the complexities of blockchain management. Enterprises that need to use blockchain but do not want to deal with management and governance can easily deploy and manage a ledger database. Ledger databases support SQL to integrate with other applications in their domain.

Ledger databases support a simpler and cost-effective way to implement blockchain use cases and achieve a cryptographically and independently verifiable audit trail of immutable data. This is useful for establishing a system of record and for satisfying various types of compliance requirements such as regulatory compliance.

Use cases for ledger databases include the following:

When there is a trusted authority recognized by all participants and centralization is not an issue. With relational databases, one has to engineer an auditing mechanism because the database is not inherently immutable. Auditing mechanisms built with relational databases can be hard to scale. They put the onus on the application developer to ensure that all the right data is being recorded.
Ledger databases make it easy to track how data has changed over time, eliminating the need to build complicated audit functionality within the application. They have a journal, which is an immutable log, where transactions are appended as blocks of data. After a transaction gets written as a block into the journal, it cannot be changed or deleted — it is a permanent record.
Ledger databases are useful for establishing a system of record and for satisfying various types of compliance requirements such as regulatory compliance.
Organizations can implement them instead of using permissioned blockchains governed by consortia.

Amazon Quantum Ledger Database (QLDB) is a fully managed ledger database that became generally available in September 2019. It has been in use internally at AWS to store configuration data for some of its key systems.

Optimize Databases via In-Memory and Nonvolatile Persistent Memory DBs

Persistent memory represents a new opportunity for transforming data architectures. This proprietary technology from Intel shipped in 2019 but is expected to become mainstream in 2020 as prices drop. Persistent memory is a storage technology that augments dynamic random-access memory (DRAM) by providing a very low latency high capacity storage at a price that will eventually be comparable to solid-state drive (SSDs). However, unlike DRAM, persistent memory is nonvolatile. It is also known as 3D XPoint, Intel Optane DC persistent memory, nonvolatile random-access memory (NVRAM) and persistent memory (PMEM).

Historically, DRAM has been the costly but reliable byte-addressable memory solution, but it lacks the economics of much less expensive/denser but slower, nonvolatile NAND flash memory used as block-addressable storage. Although flash memory provides nanosecond data access, persistent memory provides microsecond access.

Figure 10 shows various memory and storage options.

Figure 10. Memory and Storage Hierarchy

Intel Optane technology has been available for many years in SSD but is new in the memory format. Database vendors have started supporting the persistent memory option because it allows creation of large databases such as data warehouses on single servers with very high performance. Initially, the cost will be higher until mass production.

However, persistent memory is available in two modes — and it is very important to understand this distinction because support among database vendors varies. The two modes are:

Memory mode: Many DBMSs support this because it does not require any change to the DBMS code, and it helps in increasing the memory pool. Examples of databases providing this include MariaDB and MarkLogic. In this mode, there is no persistence.
App Direct: This is the optimized mode where the DBMS system calls have to be recoded to take advantage of the memory. This option supports persistence. As of September 2019, a few of the databases using the App Direct option included Aerospike, Redis and SAP HANA. Although the software is available today, it is only usable on test systems from selected hardware vendors and on GCP.

In 2020, data and analytics professionals should explore introducing this option into their portfolios. Key factors to consider include:

As the cost drops, and as servers with persistent memory become widely available, the use of in-memory DBMS will grow.
This is one area in this document where little or no change is needed to the skills required to use this new technology. However, it can have a positive impact on the organization in areas of consolidation, more-efficient HA for in-memory systems, higher performance and the enabling of architectures such as HTAP.
Other use cases that may benefit include LDW because persistent memory will allow larger, more-efficient virtualization environments. Analytics will also benefit due to the ability to cache more data in memory.

Revolutionary Changes in Data Management Will Drive IT to Adopt New Operating Models and Roles

The history of computing includes many inventions meant to allow technologists to focus more on business realities and less on technological idiosyncrasies. These inventions run the gamut from assembly language to model-driven development to data warehouse automation. Although these inventions delivered improved experiences for technical professionals, they also introduced the need for new operating models and roles. For example, the evolution from machine language to assembly language to higher-level languages introduced the need for advanced debugging techniques and tools. Likewise, the move to model-driven development necessitated the need for advanced modeling notations, automatic forward engineering tools and the specialists to make them work.

In 2020, this pattern will continue. For example, the move toward the cloud and “as a service” solutions will continue to push much of the operational workload out of enterprise IT and onto cloud-based providers. This accelerating move to the cloud will change IT operating models in two ways:

It will present new opportunities
It will impose new responsibilities

The new opportunities are obvious: Technical professionals will be able to spend more time focusing on the business problem, allowing cloud providers to handle onerous operational burdens like hardware provisioning and performance tuning — and allowing enterprise IT to avoid manually creating artifacts such as data integration flows.

The new responsibilities are less obvious — and consequently more important to include in your planning. These new responsibilities include securely managing multicloud environments and carefully monitoring and controlling how automatic provisioning of cloud resources can influence expenses.

Keeping in mind both the new opportunities and new responsibilities, technical professionals should plan to perform these activities in 2020:

Opportunity: Increase support for “data democratization” initiatives. For example, as iPaaS features improve, technical professionals should help their organizations capitalize on the improved user interface by supporting nontechnical users designing their own data integration pipelines. This is made possible not just by improved user interfaces, but by AI/ML-powered features that recommend common pipeline-design patterns and optimal times to run deployed pipelines.
Responsibility: Enhance governance to control artifacts created by self-service professionals. For example, as self-service data wranglers create data integration pipelines, use data governance to control their proliferation, to promote the pipelines that prove to be widely valuable, to impose naming and quality standards on them, to merge similar pipelines and to decommission abandoned pipelines.
Opportunity: Redirect IT resources to business-focused activities. For example, as database designers spend less time on the design of performance-tuning constructs, indexes and sharding strategies, retrain those experts to focus more on logical and conceptual modeling. Likewise, deploy technical integration experts to support self-service data wranglers as necessary.
Responsibility: Monitor and control expenses caused by elastic cloud provisioning. Consider what happens when a developer runs a poorly designed query against a cloud database, such as a query that materializes the cartesian product of two large relations. The cloud provider might automatically provision the resources to store that result, yielding an astronomical invoice. (In previous environments, the rapacious query would merely exhaust the finite storage resources of the on-premises architecture and fail.) IT professionals will need to monitor and control automatic provisioning capabilities. This will be especially important during the development phase of the software development life cycle (SDLC), when experimentation is common.
Responsibility: Include billing testing in POCs of cloud and “as a service” solutions. While performing proof of concept (POC) tests to evaluate cloud service providers, IT professionals will need to collect data about billing. The test beds for POCs might even expand to include queries that you have no intention of running in production environments, such as correlated subqueries or n-way cartesian products. These are the queries that can blow your budget.

Another key trend that will help IT specialists focus on the business is the increased use of LDWs, which can reduce the time to stand up or prototype analytics-ready datasets. LDW technology is used to integrate relevant data across heterogeneous platforms rather than collecting all data in a monolithic warehouse. Key business benefits can be achieved by applying advanced analytics to these sources of data — and by providing business users with more self-service data access and analysis capabilities. IT operations will be affected by the ability to support accelerated analysis capabilities, by providing analysis-ready datasets that include both structured and unstructured data, and by role changes with regard to interactions with business users.

Planning Considerations

Given the trends described above, technical professionals should focus on the following planning considerations in their 2020 planning for data management:

Evaluate “as a service” options such as dbPaaS to reduce IT overhead
Enhance data management tools with autonomous capabilities
Future-proof data management investments by using open standards on multicloud

Evaluate “as a Service” Options Such as dbPaaS to Reduce IT Overhead

DBMS vendors are innovating in a cloud-only or on a cloud-first basis and are moving DBMS deployment options and support models to a cloud-based future. The emphasis that DBMS vendors are placing on dbPaaS offerings has not gone unnoticed by modern enterprises. Indeed, we are seeing pervasive adoption of dbPaaS in 2019. This trend toward dbPaaS will continue, and by 2023, Gartner estimates that 75% of all databases will be on a cloud platform (see “Predicts 2019: Data Management Solutions”).

dbPaaS provides an alluring option where the database vendor is responsible for handling most of the administration tasks, leaving organizations’ on-premises database administrators to focus on more value-added business initiatives.

The potential benefits of “as a service” cloud options can be placed into two main categories. They are:

Cost and efficiency (primarily financial issues): This includes considerations of capex versus opex, total cost of ownership (TCO), reduced complexity and leverage.
Agility and innovation (primarily opportunity issues): This often includes focus on speed, time to market, and business and IT agility.

These benefits are particularly true for cloud-based data management infrastructure. In fact, the major public cloud service providers are becoming the de facto platforms for enterprise data management.

The benefits of dbPaaS can provide more database power with a more flexible cost structure than on-premises databases. The dbPaaS benefits include:

Reduced upfront investment
Pay-for-use pricing
Continuous and automatic software updates
Dynamic scale-up and scale-out of compute and storage

Most enterprises are using multiple public cloud vendors, and data and analytics technical professionals are expected to design architectures that minimize the problems and harness the power of multicloud deployments. Multicloud architectures offer some potential benefits for databases and data-centric solutions but involve greater complexity, cost and effort.

In 2020, data and analytics technical professionals should:

Make the cloud the default deployment option for enterprise database systems. Justify any databases that remain on-premises. Perform new development in the cloud, and migrate legacy on-premises databases to dbPaaS where feasible.
Embrace dbPaaS. Transition database administrators (DBAs) from low-level database maintenance tasks, and focus them on solution architecture, performance optimization, database DevOps, security and compliance, and research and development with new data-related technologies.
Standardize on a public cloud platform and on the particular database types you need. Control your cloud adoption through a policy of standardization. Consider Azure if yours is primarily a Microsoft shop. Explore AWS for moving diverse technology to the cloud. Investigate OCI for Oracle Database workloads and for Oracle Enterprise Manager applications.
Adopt a multivendor but single-cloud architecture approach to mitigate concerns of cloud vendor lock-in. Make a single-cloud architecture your first choice for databases and data-centric solutions, and use multicloud solution architectures only where the benefits of multicloud are compelling or the need for multicloud is undeniable. Set realistic expectations regarding multicloud cherry picking, system robustness, and cloud neutrality and portability. Cherry picking the best parts of each cloud platform is often not worth the effort. System resiliency, agility and economy will suffer in poorly implemented multicloud architectures. And complete cloud neutrality and portability for enterprise databases and data-centric solutions is generally not feasible for enterprise use cases.

Select the simplest multicloud architecture that achieves the business goals. To minimize the complexity and costs of multicloud solutions, start by splitting portions of a data-to-analytics pipeline between cloud platforms. Adopt advanced multicloud architectures only after thorough prototyping and exhaustive testing.

Enhance Data Management Tools With Autonomous Capabilities

Similar to how ML and AI capabilities are transforming analytics, BI and data science across data management categories, vendors are adding ML capabilities and AI engines to make self-configuring and self-tuning processes pervasive. These processes are automating many manual tasks and allowing users with fewer technical skills to be more autonomous when using data. By doing so, highly skilled technical resources can focus on higher-value tasks. This trend is impacting all enterprise data management categories, including data quality, metadata management, MDM, data integration and databases.

With technical skills in short supply, the need to automate data management tasks that comply with governance policies, rules and processes is increasing rapidly. Even mundane (and easily accomplished) tasks are simply too numerous in a distributed environment to meet the critical demands of scaling at the speed of digital business.

New ML and AI-based data management tools can analyze usage statistics and infer emergent metadata from data utilization to build models to automate tasks. For example, if your organization runs end-of the quarter ports that require a high number of compute servers, cloud vendors can predict the pattern and auto provision extra resources in advance to handle your needs and meet the SLAs.

In 2020, data and analytics technical professionals can benefit from the increased automation and autonomous capabilities as follows:

Augment the tasks of data engineers, especially manual ones that have the highest propensity of introducing errors and biases. In addition, this can lead to higher performance and optimization of the data orchestration and pipeline.
Create automated system responses to errors, issues and alternative interpretation of data.
Increase the capability to use publicly available data, partner data, open data and other assets that are currently difficult to determine as appropriate for utilization.
Automate data discovery, such as sensitive and personally identifiable information (PII) data and autotagging. This capability should also be leveraged for classifying the data and supporting creation of the semantic layer. Many data governance and compliance use cases such as remediation of data quality, MDM, data tiering and archiving benefit from increased automation.
Continuously monitor capacity and utilization for data management environments for potential redistribution of resource planning, even across multicloud and on-premises implementations.

Future-Proof Data Management Investments by Using Open Standards on Multicloud

Standards for data management are a combination of open standards, de facto standards and commonly used technologies. Despite the differing degrees of standardization that exist in the data management marketplace, whatever the standards happen to be, enterprises can derive significant benefits by using them. These benefits include a larger pool of skilled developers and practitioners in the labor market and some reduction of vendor lock-in.

Standards for Databases and Data-Centric Solutions

For data management, the most significant standards are the database-related ones. And the most prominent of these standards is Structured Query Language (SQL). SQL has proven to have surprising staying power within modern data architectures. Its ability to provide set-based processing of data has made it uniquely powerful for data management, compared to all other query and programming languages.

Other standards for data management include ODBC/JDBC, JSON, HTML, XML and REST. These standards continue to demonstrate their usefulness in data-centric enterprise solutions and in web-based applications. IoT systems and microservices architectures (MSAs) use these standards, plus additional message formats such as AMQP, MQTT, Apache Thrift, Google Protocol Buffers or Apache Avro.

In 2020, Gartner expects to see greater support for the open-source data query and manipulation language for APIs called GraphQL from various database vendors. GraphQL provides an efficient way to query data because it gets all the needed data and its references in a single call, as opposed to REST API, which requires multiple URLs. The results are returned as JSON strings.

Virtualization

Virtualization is the foundation of cloud computing, and even though virtualization cannot quite be considered a standard, virtualization is certainly a commonly used technology that lowers skill demand and can reduce vendor lock-in. Gartner’s “Decision Point for Selecting Virtualized Compute: VMs, Containers or Serverless” explains:

A series of virtualization methods have been introduced with increasing levels of abstraction. These methods make better use of resources by provisioning them at ever-finer granularity, while also enabling new operational models. Each of these methods has refined the degree to which resources can be assigned to virtualized applications:
- Virtual Machines
- Containers
- Serverless
These three options are not mutually exclusive, but do complement each other. Container orchestration platforms are routinely deployed on virtual infrastructure, and some serverless computing approaches are based on containers.

Multicloud

In most large enterprises, the use of multiple public cloud platforms is the reality. Multicloud solutions must be well-designed to ensure that the benefits outweigh the costs. Technical professionals will be called on to design and implement effective multicloud architectures for their enterprise databases and data-centric solutions. For guidance on multicloud architectures for data and analytics, see “Multicloud Architectures for Enterprise Databases and Data-Centric Solutions.”

Cloud Neutrality and Portability

It is important to have the right expectations for multicloud neutrality and portability. Gartner provides guidance on cloud application portability in “A Guidance Framework for Architecting Portable Cloud and Multicloud Applications.” The first key finding of that document states, “cloud application portability is a worthwhile goal, but it must be tempered with pragmatism.” It states further, “where you determine portability to be an important priority, you should design cloud applications to be contextually independent. As an end state, contextual independence is difficult — and sometimes impossible — to achieve.”

It is difficult to achieve cloud application portability, and cloud database portability is even harder. One option for cloud database portability is to try to use container technology for the databases. Containers have the potential to work well with analytic MPP databases where there may be large databases run by one or more master nodes and multiple worker nodes. Some NoSQL database vendors provide support for their databases to run in containers, and most of the RDBMSs can also run in containers for dev and test deployments. However, production workloads for enterprise-class, mission-critical operational databases running in containers are not mainstream for enterprise use cases as of today. Container technology is still evolving, and the ability to run operational database workloads reliably in containers is not yet fully mature. For guidance on using databases in containers, see Gartner’s “Decision Point for Selecting Stateful Container Storage.”

Data and analytics technical professionals should:

Standardize on a public cloud platform and on a small set of database types. To mitigate concerns of vendor lock-in, enterprises can adopt a multivendor approach, in which they use a primary cloud vendor for a majority of workloads and a second cloud vendor as an alternative.
Make a single-cloud architecture your first choice for databases and data-centric solutions, and use multicloud solution architectures only where the benefits of multicloud are compelling or the need for multicloud is undeniable.
Adopt open standards, de facto standards, and commonly used technologies for all enterprise-class, data-centric solutions and architectures.

New Regulations and Compliance Will Demand Comprehensive Distributed and Coordinated Data Governance

Increasingly, organizations will strive to comply with stringent regulations, especially regulations that protect consumer privacy. Compliance and regulations are obvious drivers for improving governance of corporate data assets. Inability to meet compliance directives can lead to hefty penalties. The EU General Data Protection Regulation (GDPR) came into effect on 25 May 2018, and since then, many governments around the world have started enacting strict data privacy laws. For example, Brazil will launch its version of GDPR, called LGPD, in August 2020. In the U.S., as of September 2019, at least 12 states had passed laws to regulate data privacy. The first one to come into effect will be the California Consumer Privacy Act (CCPA) on 1 January 2020.

In addition, data governance has become more challenging as data straddles edge, on-premises and multiple cloud environments. As a result, most public cloud vendors, such as AWS, Azure, GCP, IBM and Oracle, and other vendors that support very large data infrastructure (e.g., Hadoop vendors like Cloudera), have enhanced their data governance offerings in 2019.

The regulations will compel data architects to support business aspirations such as gaining:

The ability to automatically classify existing data assets (e.g., as personal health data, financial data or PII)
The ability to respond to consumer right-to-be-forgotten requests: The EU GDPR stipulates that customers have a right to be forgotten (also called “right to erasure”). But how can an organization do that if the customer demanding it isn’t properly identified in the organization? The customer may be known by different names or spellings and may exist in multiple systems.
The ability to secure data (for example, by limiting access to authorized personnel, or by using masking, anonymization or tokenization)

See “Building a Comprehensive Data Governance Program” for detailed coverage on data governance. Gartner uses the framework in Figure 11 to guide data and analytics technical professionals involved in data governance programs.

Figure 11. Data Governance Framework

Planning Considerations

Given the trends described above, data and analytics technical professionals should focus their 2020 planning efforts on the following activities:

Deploy metadata-based data governance across the distributed data pipeline
Democratize assets through data-as-a-service to compete in the data economy

Deploy Metadata-Based Data Governance Across the Distributed Data Pipeline

Managing data effectively depends on being able to answer the following questions:

What data do we have?
Where is it?
What does it mean?

Without answers to these questions, many data management initiatives — including self-service analytics, data security, data privacy, information life cycle management and many aspects of regulatory compliance — are doomed to fail. The key to all of these initiatives is data governance powered by metadata management.

Metadata management comes in many forms:

At design time, metadata management is primarily focused on modeling of data artifacts. Of course, this includes traditional modeling of persistent storage artifacts such as relational databases, NoSQL databases, physical data warehouses, and physical data marts. It also includes:
- Modeling of runtime data artifacts such as message models used in service-to-service communication of service-oriented architecture (SOA) and MSAs
- Modeling of data-in-motion artifacts such as extraction, transformation and loading (ETL) programs and other components of a data-integration pipeline, as well as explicitly modeling virtualized artifacts such as view definitions in a LDW
At runtime, metadata management:
- Includes collecting data about system operations such as monitoring specific runs (or run failures) of data integration pipelines, automated business workflows, attempts to access SOA and MSA services, and monitoring and analysis of database access logs and transaction logs.
- Can also involve retroactively imposing metadata on poorly understood data assets, including using the automated discovery features of metadata catalogs or using the automatically generated semantic layers atop data lakes and other data assets

Because data governance depends so heavily on metadata management, it is useful to understand that metadata tools exist along a spectrum. At one end of this spectrum are tools that exist to help you create metadata for specific purposes. Tools in this category include data modeling tools, XML design tools, and Unified Modeling Language (UML) modeling tools.

At the other end of this spectrum are tools that exist to consolidate metadata into all-encompassing repositories that support data lineage and other use cases that require expansive, architecture-wide views of metadata. Of course, none of these tools is truly “all-encompassing,” but vendors seek to provide connectors to as many sources of metadata as possible.

Between these two extremes are tools whose primary function is not metadata management, but whose function depends on collecting or maintaining metadata. There are many such tools, including integration platforms, data ingestion platforms, data integration and virtualization platforms, analytics and data science platforms, and cloud infrastructure platforms (see Figure 12).

Figure 12. Metadata Tool Categories and Representative Vendors

Data governance programs depend so heavily on metadata that it should come as no surprise that data governance activities can occur during the same phases as metadata management. Here are a few examples of how data governance processes capitalize on carefully maintained metadata throughout the design-build-deploy-monitor life cycle.

At design time, data governance can mean:
- Using metadata to perform impact analysis, which can reveal how a change to one data artifact can affect downstream artifacts: In many cases, this metadata comes from data integration pipelines. In other cases, the metadata might come from well-maintained data in enterprise architecture tools.
- Using metadata — such as a data dictionary — to find existing data sources that can support a new business use case: This can prevent the particular form of “reinventing the wheel” known as data silos. Note that this style of leveraging existing data assets is useful at both extremities of the information life cycle. That is, it can help you avoid both transactional silos and analytical silos.
- Using metadata — such as data models of individual systems — to predict data-integration problems
- Using metadata to understand the requirements for encryption, masking, tokenization, and other data obfuscation and protection techniques: This could include automatic classifications of data assets into categories such as personally identifying information (PII), personal health information (PHI) and corporate intellectual property (IP).
At runtime, data governance can mean:
- Using metadata to diagnose erroneous, untrusted, or otherwise suspicious data on an analytical dashboard or report: Using data lineage in this way is very similar to the impact analysis described above. Whereas impact analysis follows a data artifact forward in time to learn about downstream ramifications of a choice, diagnosing an analytical error follows the same set of connections backward in time to find the original source of an error.
- Using metadata—such as access logs—to understand security and privacy violations or their obverse, excessive access restrictions that prevent authorized personnel from accessing data to which they are entitled
- Using the metadata in a BRE to uncover data-quality violations

With all this in mind, technical professionals planning to build or expand data governance in 2020 should take these actions:

Choose an incremental approach to deploying data governance. Focus on particular data domains that support specific business use cases.
Deliver metadata management incrementally, as dictated by the prioritized approach to delivering data governance. You cannot deliver a full, enterprisewide, enterprise-scale metadata management solution in a single deployment. In other words, even though metadata management is foundational to data governance, you should not build a fully populated metadata catalog, or a fully populated business data glossary, or a fully populated data dictionary. Rather, you should selectively populate these foundational technologies by focusing on the metadata that will support your specific use case. As you add more use cases, extend the reach of your governance and metadata management program.
Prioritize the forms of metadata that will support your initial business use cases. For example, if your initial business use case is to support self-service analytics to rationalize promotional spending, start with the data from marketing and sales. Likewise, prioritize the business glossary with definition of terms from the sales and marketing phenomena, since the business glossary will constitute a starting point for self-service analysts to navigate to analysis-enabled data assets. By contrast, if your initial use case is data quality, the entries in the business data glossary might be deprioritized.
Choose the metadata management technology that best suits your architecture. For example, if you are committed to logical data warehousing, you might safely rely on the repository and metadata management features provided by your data virtualization vendor (e.g., Denodo or Dremio). Likewise, if your data-management strategy relies on a homogeneous approach from a single cloud provider, you might use the metadata solution offered by your cloud provider (e.g., AWS Glue or Azure Data Catalog). By contrast, if your architecture has many disparate components, you might choose a more general-purpose metadata management solution with connectors to many data management tools (e.g., Alation or Collibra).
Differentiate the two meanings of “data governance tool.” In some contexts, the term “data governance tool” is used to refer to tools that produce or manage metadata that can undergird a data governance program. Many of these tools, however, have user-hostile interfaces that are disliked by data stewards and other nontechnical functionaries of the data governance program. Consequently, the term “data governance tool” is also used to refer to a tool that would be used by a data steward, data trustee or other business-side participant in a data-governance program. These tools, such as Informatica Axon Data Governance, provide a user interface to manage and execute data governance processes.

Democratize Assets Through Data as a Service to Compete in the Data Economy

Data trapped in silos is not very useful. Various attempts have been made over the years to unlock the full potential of data. This task has become more complicated as data sources and formats proliferate on one hand while the business users expect to analyze the latest data in a self-service manner. This has led to a growing demand for access to data with the same ease that the cloud provides for compute and storage needs. However, the challenge is in how to represent that data in a common language that abstracts its location. A new breed of vendors is addressing this challenge by providing “data as a service,” consisting of a semantic layer on top of a metadata catalog. The catalog ensures that only people authorized to see the data are provided the needed access. The most popular way to access the data is through REST APIs.

Although data as a service remains an ambiguous term, there is much more clarity around the term “open data.” To ensure nondisjunct access to data from a variety of domains, an open data framework has been promulgated over time. The availability of this data via API endpoints allows microservices with dedicated datasets to communicate with each other.

This requires data and analytics technical professionals to focus on two areas:

APIs and MSA
Open data

APIs and Microservices Architecture

The term “microservice” emerged a few years ago and describes an approach for delivering cloud-ready or cloud-native applications. The early case studies and examples pointed to the work of Netflix, Amazon, eBay and others that had taken a radically different approach to building and supporting their cloud-based applications.

At its core, MSA is a software architecture pattern in which applications are composed of small, independently deployable processes communicating with each other using language-agnostic APIs and protocols. When applied to data, these APIs can make data available dynamically and in a variety of useful ways.

For instance, data virtualization platforms can access data via APIs in a MSA. For guidance on data virtualization, see Gartner’s “Leveraging Data Virtualization in Modern Data Architectures.”

RESTful APIs are common in MSA, and support for RESTful access to data is supported in the prominent DBMSs on the market. RESTful APIs are also supported by data integration tools and are the basis for various “open data” initiatives that make datasets available to the public.

In addition to RESTful APIs, streaming APIs have become popular, and streaming will grow in popularity into the foreseeable future. Streaming data movement delivers data through brokers, such as message queues, or streaming platforms, such as Apache Kafka. Streaming is often used with MSA, and the data is often delivered in near-real time.

Indeed, APIs can be written to integrate data in traditional batch-oriented paradigms. They can also implement newer messaging or data streaming techniques. These approaches can be used in concert to build data pipelines that provide end-to-end data to analytics architectures that support real-time analytics and dashboards along with traditional BI reporting.

Data and analytics technical professionals should:

Build modern data integration pipelines to support messaging and near-real-time integration patterns, including streaming use cases such as in-stream analytics and real-time ingestion. Architectures must still include support for batch, replication and virtualization.
Identify and align critical use cases with batch, messaging and steaming data integration patterns. Perform a trade-off analysis to balance functional requirements and to identify the best integration patterns and toolsets to support business requirements.

Open Data

There is a general need to facilitate sharing and utilization of large amounts of data generated and residing among entities in the government sector. This is driven by a desire to encourage data literacy, ease data accessibility and democratize data assets, all with the central goal of raising awareness of our surroundings and improving our quality of life.

This can be achieved by promoting a technology-based culture of data management that fosters transparency with state-of-the art data warehouse, data lake, data hub and data archive technology, with accompanying visualization capabilities.

This culture has bred the concept of “open data.” A common definition is the following:

Open data is data that can be freely used, reused and redistributed by anyone — subject only, at most, to the requirement to attribute and share alike.

The last few years has seen an explosion of open data initiatives, especially in the government sector. The goal is to unleash a wave of unconstrained access to data that promotes efficient, end-user-driven analytics.

Some key requirements of this initiative are as follows:

Availability and access
Reuse and redistribution
Unconstrained participation

Some common domains seen in open data portals are scientific research, climate, health, public safety, city services, communication, finance and transportation

A few government sites advocating open data are:

“Data.gov Homepage,” Data.gov
“HealthData.gov Homepage,” HealthData.gov

Many “open data” vendors have developed a framework that makes it seamless for any entity to post data that is consumable and actionable with an almost near-real-time cadence.

The suggested formats of the data include HTML, XML, CSV, JSON, Resource Description Framework (RDF) or Extensible HTML (XHTML). Information is expected to be human-readable as well as machine-readable.

Data integrations have typically been file- or DB-based. Open data is largely exposed as flat files or as API endpoints.

Having the data exposed as an API facilitates consumption by downstream systems. An example dataset in a JSON format is included below:

Example JSON dataset

With an iPaaS solution, data can be automatically pushed to a data portal from multiple sources. The most common underlying data layers are seen to be a data lake. The REST API is used to expose the data for each of the domains. Each dataset will have a dedicated API endpoint.

For example, for the Socrata (owned by Tyler) data portal, one will need to integrate directly with the Socrata Publisher API. The Publisher API allows one to programmatically:

Add, update and delete rows within a Socrata dataset
Maintain dataset metadata and privacy settings
Create and import Socrata datasets

All these operations are provided via Socrata’s RESTful APIs. For geospatial processing in Socrata, streaming geospatial microservices can be invoked with data furnished through the API.

However, there is a growing need to access multiple resources with one call akin to a data virtualization layer. Data scientists have expressed interest in consuming data from multiple datasets with one call.

GraphQL offers a convenient way to aggregate data from multiple sources with a single request. It is a query language for APIs. All GraphQL requests post data (always post) to that one endpoint with the query that describes which resources and fields are being requested.

Gartner expects that GraphQL will increasingly play the role that views play in databases. That is, it will serve to merge datasets for consumption by BI tools, thereby questioning the need to move the data to a data warehouse for further curation and blending.

Gartner expects to see an increase in the momentum of open data deployments. Data and analytics technical professionals should explore incorporating open data into their use cases, especially for advanced analytics.

AI/ML Advancement Will Improve Data Management Workload Delivery, but With an Added Burden

AI and ML technologies are so hot that they get hyped beyond their capabilities. There is a notion that, if data management products are AI-enabled, then they will have suddenly acquired magical new capabilities. However, if we were to separate hype from reality, there is no doubt that data management systems are now being disrupted by these advanced analytics capabilities. The capabilities are becoming more sophisticated.

Consider, for example, the optimization of data management pipelines. AI and ML technologies are being deployed to achieve higher efficiencies in areas such as improving data quality, which have previously proven to be difficult. With the number of use cases for modern data architectures increasing precipitously, the stakes of poor data quality have gone up tremendously. New ML-based options are able to infer anomalies and provide remediation recommendations.

Today, data scientists make a copy of the data from the data management systems into their ML environment, which is best suited for advanced analytics. However, expect to see more integration of ML into database management systems, which makes sense because that is where the bulk of data is. In other words, new options allow data scientists to train models right inside the databases.

Databases such as SQL Server and Azure Cosmos DB have embedded Apache Spark across many parts — from using Spark for ingestion to training ML models. These models can then be packaged as stored procedures that can be called as user-defined functions or put into a Docker container with a RESTful API endpoint.

Many Gartner clients have an open-source-first policy and are also now adding ML-enabled features. For example, open-source database PostgreSQL, and its variant Pivotal Greenplum 6, include GPU-accelerated processing for deep learning model training through Apache MADlib.

Database vendors are continuing to mature the in-database ML capabilities and are making this feature more robust. For example, in 2019, BigQuery added support for the K-Means algorithm, which now allows users to cluster the data inside the database to support customer segmentation use cases. Sophisticated data scientists who train models in TensorFlow can now export the model and run it inside BigQuery.

In 2020, further advancements in embedded ML will see databases such as Oracle Database 20c offer AutoML, which automatically builds and compares machine learning models at scale and facilitates the use of machine learning by nonexperts. This option will allow the optimizer to choose which ML algorithm should be used to train the data to get the best predictive analytics.

Planning Considerations

To prepare for these trends in 2020, technical professionals should focus on the following planning priorities:

Use AI/ML-enabled tools in the data management pipeline
Optimize data architecture by providing flexibility to data science use cases to operationalize ML models more quickly

Use AI/ML-Enabled Tools in the Data Management Pipeline

ML and AI are becoming pervasive, not only for data science workloads, but also within databases and across the different components of data pipelines, data architecture and data management. In 2020, ML-enabled autotuning databases, workloads and hardware configuration will become table stakes. ML provides new capabilities to tune data processing clusters based on throughput and latency requirements.

Traditionally, organizations’ workloads were Newtonian — fixed, measurable and predictable. However, newer workloads are much more dynamic. Organizations must continually innovate, improve and become more quantum-fluid and probabilistic. This requires investing in ML to predict and automate the known parameters for enhancing the efficiency of the organization’s underlying data architectures. As the data volumes increase exponentially, it is no longer possible to manually handle every piece of data. Organizations have used rule engines in the past, but in the dynamic world, with evolving or even unknown rules, ML-enabled data management will become a necessity. For example, improving data quality has proven to be difficult, but now built-in ML and NLP algorithms inside data governance products can alert users when anomalies are found (or even imputed).

AI/ML continues to influence database engines and LDW architectures and design in different ways:

During data ingestion by doing automated data profiling, auto tagging and classification; understanding data distributions; and detecting data drifts. This increases trust in data, increases process efficiencies and speeds up operationalization of analytical models.
Automating workload management and self-tuning the software configuration, cluster management and hardware platform to perform optimally — such as by using AI/ML to boost concurrency for short queries. ML algorithms can inspect the incoming queries, the tables or files they touch, and the capacity of the machine and its current state of “busyness.” It can also record many output variables, such as how long the queries took to run, how many were processed in each time period, the size of results and so on.
Using orchestration to schedule algorithms and allocate resources across multiple jobs, dependencies and varying workloads to maintain SLAs and guarantee throughputs and latency — a typically challenging proposition. Sophisticated algorithms leverage different tuning parameters to optimize and schedule workloads to maximize resource utilization and minimize waits, delays and idle time. ML is increasingly being used to schedule such workloads across different CPU cores and optimize resources like memory, network and I/O utilization.

The LDW usually runs complex workloads. This includes a mix of simple, medium and complex queries and short- or long-running queries with varying concurrency. These can range from scanning high volumes of data to executing large, multitable joins and scanning billions of rows, to data science exploration, data virtualization and large bulk loading jobs.

The different components of the LDW each have their own workload management features and prioritization schemes. These vary in effectiveness depending on the maturity and design of the individual component.

AI/ML is being used to manage the workloads efficiently across the LDW. Some examples include:

Oracle Autonomous Data Warehouse (ADW) has a self-tuning database that adjusts the database configuration parameters based on workloads, underlying hardware and throughput demands. Policy-driven automation using ML in the background is the core underlying philosophy of these modern database engines. The most significant aspect for most customers will be cost reduction, increased uptime and the sophisticated skill sets required. Such systems deal automatically with data growth, schema changes and keeping up-to-date statistics based on usage patterns.
Amazon Redshift makes use of AI/ML with features such as short-term query acceleration with Amazon Redshift Advisor. Short query acceleration uses ML to recognize when an incoming query is likely to be a short query and thus will not occupy many resources. This judgment can be made based on prior execution of the same query or similar queries. This is important to support higher concurrency because it is desirable to let short queries into the system to avoid artificially limiting concurrency.

Databases also routinely leverage ML to analyze a cluster’s performance and usage metrics and provide customized recommendations aimed at optimizing database performance and decreasing operating costs. System performance specialists can use AI/ML over longer time periods to give advice on the configuration and tuning of an analytical system. A combination of rule-based systems, statistics and ML examine the input workloads. Every n days, they can generate actionable advice on how to alter the system to improve performance, with a particular emphasis on improving throughput. Throughput is one of the main measures of performance. The more queries you can push through a system for a given time and price, the better the return on investment.

The advice can span partitioning of tables, sort sequences, compression schemes and so on. The ML routines monitor the system and implement recommendations every few days. After a few cycles, the performance of the system improves. The results are comparable with those of a competent tuning specialist.

Figure 13 shows how ML can be used within an LDW.

Figure 13. Using ML in an LDW

ML-Driven Data Catalogs

The next generation of data catalog products are giving rebirth to data lakes. Data catalogs are becoming the primary glue for tying together the different components in a data lake. Data catalogs have innovated to leverage state-of-the-art ML- and AI-based algorithms to enable data discovery.

Vendors are reenvisioning metadata management and governance as an ML-driven data catalog and building metadata management capability on ML, graph, metadata and services. Data catalogs based on ML take advantage of self-data linking and curation, workflows, and rich profiling. One of the most exciting parts of the market today is the availability of automated data and metadata profiling tools.

The overall idea behind these data catalogs is to build a single convergent tool for MDM and metadata management and to reference data management rather than split across different tools. ML-driven data catalogs allow collaboration, communication, profiling, automated classification, tagging and deep data introspection with its behavioral I/O analysis of data use and queries. They are based on knowledge graphs that capture semantic metadata and provide business glossary functionality to better manage data models, schemas and certification.

ML-driven data catalogs automatically inventory and tag raw data with business terms. Some data catalogs provide deep profiling of the data; incorporate tribal knowledge that connects systems; and build logical and semantic insights about data, its lineage, models and schemas from source systems at the file, table, and column levels.

Organizations suffer from pain points, including:

Keeping up with data sources
Tracking changes in data across the organization
Monitoring data quality across the different data sources and data lakes

This can lead to data not being leveraged in a timely fashion for insights. Data changes are sometimes discovered downstream, thus impacting confidence in the insights derived from it.

Ideally, data sources and their changes should be automatically discovered, held to data compliance policies and made available to analysts. Automatically discover sources by continuously scanning data sources already listed within a data catalog, and track sources for schema changes.

Data catalogs are leveraging AI and ML to identify and automatically tag and annotate to enable efficient search and data findability. These data catalogs use advanced analytics to infer the meaning of fields from field names, interpret field values, and tag them based on field names, value and context.

Data catalog vendors innovating in this space include Alation, Collibra, Informatica and Waterline Data.

Optimize Data Architecture by Providing Flexibility to Data Science Use Cases to Operationalize ML Models More Quickly

With the advent of the big data era and introduction of nonrelational (NoSQL) data stores, many technology prophets had declared SQL dead. Yet many vendors are now bending over backward to provide SQL interfaces over structured and semistructured data in nonrelational data stores and object storage.

Interestingly, SQL is now being used as an abstraction over ML. In 2019, Google made BigQuery ML generally available. This feature provides extensions to SQL that allow creation of ML models on data directly in BigQuery. This is how most of the cloud data warehouses will evolve in 2020 and beyond. The ability to train models and make predictions on data where it is stored obviates the need to make another copy of data, thereby reducing security risks.

SQL statements can also be used to both create and invoke the built-in models. The training dataset is specified, as are the names of the values to be modeled. Also, a name is given to the particular instance of a linear regression model being created. An example is shown below.

CREATE MODEL model_nameOPTIONS(model_option_list) // Model Type and Associated parameters as Key ValuesAS query_statement // The Select Query that provides training data to the above model

Invoking the model involves:

SELECT *FROM model_name{TABLE table_name | (query_statement)} // the data to be evaluated by the model

Embedding special algorithms within programs via SQL has been a feature of many databases for many years. For example, statistical functions such as correlation, built on top of other functions such as standard deviation and covariance, are present in many different databases. In addition to BigQuery, data warehouses such as Teradata Vantage provide a built-in ML engine that supports Python and R in addition to SQL. In 2019, ML engines were also embedded in nonrelational databases, such as Microsoft Azure Cosmos DB.

In 2020, technical professionals should prepare to harness ML capabilities in their databases as more vendors provide such capabilities. As long as the algorithms can scale appropriately, there are many great reasons to use your database’s engines and data warehouses for ML, such as:

Simplicity: There is no need to manage another compute platform, integrate between systems and extract/analyze/load the data.
Security: The data stays where it is well-secured. There is no need to configure credentials in external systems or worry about where copies of data might end up.
Performance: A database engine maintains statistics and metadata to optimize queries. This data could be used to train an ML algorithm to perform predictions on when the queries would finish or the time it would take to return the result set. This enables end users or applications expecting results from the query to plan accordingly.