Teradata and Data Warehouse Architectures_strategic queries teradata-CSDN博客

本文链接：https://blog.csdn.net/seeusmile/article/details/3256709

1 Teradata and Data Warehouse Architectures

1.1 Evolution to Active Data Warehousing

1.1.1 Data Warehouse Usage Evolution

There is an information evolution happening in the data warehouse environ me nt today. Changing business require me nts have placed demands on data warehousing technology to do more things faster. Data warehouses have moved from back room strategic decision support systems to operational, business-critical components of the enterprise. As your company evolves in its use of the data warehouse, what you need from the data warehouse evolves, too.

Stage 1 Reporting: The initial stage typical ly focuses on reporting from a single source of truth to drive decision-making across functional and/or product boundaries. Questions are usual ly known in advance, such as a week ly sales report.

Stage 2 Ana ly zing: Focus on why so me thing happened, such as why sales went down or discovering patterns in custo me r buying habits. Users perform ad-hoc ana ly sis, slicing and dicing the data at a detail level, and questions are not known in advance.

Stage 3 Predicting: Sophisticated ana ly sts heavi ly utilize the system to leverage information to predict what will happen next in the business to proactive ly manage the organization's strategy. This stage requires data mining tools and building predictive models using historical detail. As an example, users can model custo me r demographics for target marketing.

Stage 4 Operationalizing: Providing access to information for im me diate decision-making in the field enters the realm of active data warehousing. Stages 1 to 3 focus on strategic decision-making within an organization. Stage 4 focuses on tactical decision support.. Tactical decision support is not focused on developing corporate strategy, but rather on supporting the people in the field who execute it. Examples: 1) Inventory manage me nt with just-in-ti me replenish me nt, 2) Scheduling and routing for package delivery. 3) Altering a campaign based on current results.

Stage 5 Active Warehousing: The larger the role an ADW plays in the operational aspects of decision support, the more incentive the business has to automate the decision processes. You can automate decision-making when a custo me r interacts with a web site. Interactive custo me r relationship manage me nt (CRM) on a web site or at an ATM is about making decisions to optimize the custo me r relationship through individualized product offers, pricing, content delivery and so on. As technology evolves, more and more decisions beco me executed with event-driven triggers to initiate ful ly automated decision processes. Example: determine the best offer for a specific custo me r based on a real-ti me event, such as a significant ATM deposit.

1.1.2 Active Data Warehouse

Data warehouses are beginning to take on mission-critical roles supporting CRM, one-to-one marketing, and minute-to-minute decision-making. Data warehousing require me nts have evolved to demand a decision capability that is not just oriented toward corporate staff and upper manage me nt, but actionable on a day-to-day basis. Decisions such as when to replenish Barbie dolls at a particular retail outlet may not be strategic at the level of custo me r seg me ntation or long-term pricing strategies, but when executed proper ly , they make a big difference to the bottom line. We refer to this capability as "tactical" decision support.

Tactical decisions are the drivers for day-to-day manage me nt of the business. Businesses today want more than just strategic insight from their data warehouse imple me ntations-they want better execution in running the business through more effective use of information for the decisions that get made thousands of ti me s per day.

The origin of the active data warehouse is the ti me ly , integrated store of detail data available for ana ly tic business decision-making. It is on ly from that source that the additional traits needed by the active data warehouse can evolve. These new "active" traits are supple me ntal to data warehouse functionality. For example, the work mix in the database still includes complex decision support queries, but expands to take on short, tactical queries, background data feeds, and possib ly event-driven updates all at the sa me ti me . Data volu me s and user concurrency levels may explode upward beyond expectation. Restraints may need to be placed on the longer, ana ly tical queries in order to guarantee tactical work throughput. While accessing the detail data direct ly remains an important opportunity for ana ly tical work, tactical work may thrive on shortcuts and summaries, such as operational data store (ODS) level information. And for both strategic and tactical decisions to be useful to the business, today's data, this hour's data, even this minute's data has to be at hand.

Teradata is positioned exceptional ly well for stepping up to the challenges related to high availability, large multi-user workloads, and handling complex queries that are required for an active data warehouse imple me ntation. Teradata technology supports the evolving business require me nts by providing high performance and scalability for:

Mixed workloads (both tactical and strategic queries) for mission critical applications
Large amounts of detail data
Concurrent users

Teradata provides 7x24 availability and reliability, as well as continuous updating of information so data is always fresh and accurate.

1.1.3 Evolution of Data Processing

Traditional ly , data processing has been divided into two categories: on-line transaction processing (OLTP) and decision support systems (DSS). For either, requests are handled as transactions. A transaction is a logical unit of work, such as a request to update an account.

An RDBMS is used in the following main processing environ me nts:

DSS
OLTP
OLAP

Decision Support Systems (DSS)
In a decision support environ me nt, users submit requests to anaylze historical detail data stored in the tables. The results are used to establish strategies, reveal trends, and make projections. A database used as a decision support system (DSS) usual ly receives fewer, very complex, ad-hoc queries and may involve nu me rous tables. Decision support systems include batch reports, which roll-up numbers to give business the big picture, and over ti me , have evolved:

Instead of pre-written scripts, users now require the ability to do ad-hoc queries, which are unpredictable in their processing and which allow "what if " types of questions. These types of questions are essential for long range planning.
DSS systems are able to read through huge volu me s of data. DSS is an area where Teradata has traditional ly excelled with its parallel architecture.

On-line Transaction Processing (OLTP)
Unlike the DSS environ me nt, an on-line transaction processing (OLTP) environ me nt typical ly has users accessing current data to update, insert, and delete rows in the data tables. OLTP is typified by a small number of rows (or records) or a few of many possible tables being accessed in a matter of seconds or less. Very little I/O processing is required to complete the transaction. This type of transaction takes place when we take out money at an ATM. Once our card is validated, a debit transaction takes place a gai nst our current balance to reflect the amount of cash withdrawn. This type of transaction also takes place when we deposit money into a checking account and the balance gets updated. We expect these transactions to be perfor me d quick ly . They must occur in real ti me .

On-line Ana ly tical Processing (OLAP)
OLAP is the kind of processing that takes place in many data warehouses or data marts. Here, the user may be looking for historical trends, sales rankings or seasonal inventory fluctuations for the entire corporation. Usual ly , this involves a lot of detail data to be retrieved, processed and ana ly zed. Therefore, response ti me can be in seconds or minutes. In the most sophisticated OLAP systems, the systems will make automated purchasing or inventory decisions without any human intervention.

1.1.4 Advantages of Using Summary Data

Until recent ly , most business decisions were based on summary data. The problem is that summarized data is not as useful as detail data and cannot answer so me questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of reporting period and are cut in half.

Here's another example. Think of your month ly bank state me nt that records checking account activity. If it on ly told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data.

Decision support-answering business questions-is the real purpose of databases. To answer business questions, decision-makers must have four things:

The right data
Enough detail data
Proper data structure
Enough computer power to access and produce reports on the data

Consider your own business and how it uses data. Is that data detailed or summarized? If it's summarized, are there questions it cannot answer?

1.2 The Data Warehouse

A data warehouse is a central, enterprise-wide database that contains information extracted from the operational systems. Data warehouses have beco me more common in corporations where enterprise-wide detail data may be used in on-line ana ly tical processing to make strategic and tactical business decisions. Warehouses often carry many years worth of detail data so that historical trends may be ana ly zed using the full power of the data.

Many data warehouses get their data direct ly from operational systems so that the data is ti me ly and accurate. While data warehouses may begin so me what small in scope and purpose, they often grow quite large as their utility beco me s more ful ly exploited by the enterprise.

Data Warehousing is a process, not a product. It is a technique to proper ly assemble and manage data from various sources to answer business questions not previous ly possible or known.

1.3 Data Marts

A data mart is a special purpose subset of enterprise data used by a particular depart me nt, function or application. Data marts may have both summary and detail data for a particular use rather than for general use. Usual ly the data has been pre-aggregated or transfor me d in so me way to better handle the particular type of requests of a specific user community.

Independent Data Marts
Independent data marts are created direct ly from operational systems, just as is a data warehouse. In the data mart, the data is usual ly transfor me d as part of the load process. Data might be aggregated, di me nsionalized or summarized historical ly , as the require me nts of the data mart dictate.

Logical Data Marts
Logical data marts are not separate physical structures or a data load from a data warehouse, but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology, it removes the need for massive data loading and transforming, making a single data store available for all user needs.

Dependent Data Marts
Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the move me nt and transformation of data but may provide a better vehicle for performance-critical user queries.

1.4 Data Mart Pros and Cons

Independent Data Marts
Independent data marts are usual ly the easiest and fastest to imple me nt and their payback value can be almost im me diate. So me corporations start with several data marts before deciding to build a true data warehouse. This approach has several inherent problems:

While data marts have obvious value, they are not a true enterprise-wide solution and can beco me very cost ly over ti me as more and more are added.
A major problem with proliferating data marts is that, depending on where you look for answers, there is often more than one version of the truth.
They do not provide the historical depth of a true data warehouse.
Because data marts are designed to handle specific types of queries from a specific type of user, they are often not good at "what if" queries like a data warehouse would be.

Logical Data Marts
Logical data marts overco me most of the limitations of independent data marts. They provide a single version of the truth. There is no historical limit to the data and "what if" querying is entire ly feasible. The major drawback to logical data marts is the lack of physical control over the data. Because data in the warehouse in not pre-aggregated or di me nsionalized, performance a gai nst the logical mart will not usual ly be as good as a gai nst an independent mart. However, use of parallelism in the logical mart can overco me so me of the limitations of the non-transfor me d data.

Dependent Data Marts
Dependent data marts provide all advantages of a logical mart and also allow for physical control of the data as it is extracted from the data warehouse. Because dependent marts use the warehouse as their foundation, they are general ly considered a better solution than independent marts, but they take longer and are more expensive to imple me nt.

1.5 A Teradata System

A Teradata system contains one or more nodes. A node is a term for a processing unit under the control of a single operating system. The node is where the processing occurs for the Teradata Database. There are two types of Teradata systems:

Sym me tric multiprocessing (SMP) - An SMP Teradata system has a single node that contains multiple CPUs sharing a me mory pool.
Massive ly parallel processing (MPP) - Multiple SMP nodes working together comprise a larger, MPP imple me ntation of Teradata. The nodes are connected using the BYNET, which allows multiple virtual processors on multiple nodes to communicate with each other.

To manage a Teradata system, you use:

SMP system: System Console (keyboard and monitor) attached direct ly to the SMP node
MPP system: Administration Workstation (AWS)

To access a Teradata system, a user typical ly logs on through one of multiple client platforms (channel-attached mainfra me s or network-attached workstations). Client access is discussed in the next module.

1.6 Node Components

A node is a basic building block of a Teradata system, and contains a large number of hardware and software components. A conceptual diagram of a node and its major components is shown below. Hardware components are shown on the left side of the node and software components are shown on the right side.

1.7 Shared Nothing Architecture

The Teradata vprocs (which are the PEs and AMPs) share the components of the nodes ( me mory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independent ly of the other AMPs so they can all work in parallel for high system performance overall.

1.8 Using the BYNET

The BYNET (pronounced, "bye-net") is a high-speed interconnect (network) that enables multiple nodes in the system to communicate. It has several unique features:

Scalable: As you add more nodes to the system, the overall network bandwidth scales linear ly . This linear scalability me ans you can increase system size without performance penalty -- and so me ti me s even increase performance.

High performance: An MPP system typical ly has two BYNET networks (BYNET 0 and BYNET 1). Because both networks in a system are active, the system benefits from having full use of the aggregate bandwidth of both the networks.

Fault tolerant: Each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatical ly reconfigure that network so all me ssages avoid the unusable path. Additional ly , in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and me ssages are re-routed to BYNET 1.

Load balanced: Traffic is automatical ly and dynamical ly distributed between both BYNETs.

1.9 BYNET Hardware and Software

The BYNET hardware and software handle the communication between the vprocs and the nodes.

Hardware: The nodes of an MPP system are connected with the BYNET hardware, consisting of BYNET boards and cables.

Software: The BYNET software is installed on every node. This BYNET driver is an interface between the PDE software and the BYNET hardware.

SMP systems do not contain BYNET hardware. The PDE and BYNET software emulate BYNET activity in a single-node environ me nt. The SMP imple me ntation is so me ti me s called "boardless BYNET."

1.10 Communication Between Nodes

The BYNET hardware can carry the following types of me ssages between nodes:

Broadcast me ssage to all nodes
Point-to-point me ssage from one node to another node

1.11 Communication Between Vprocs

On an MPP system, BYNET hardware is used to first send the communication across nodes (using either the point-to-point or broadcast me ssaging described previous ly ).

On an SMP system, this first step is unnecessary since there is on ly one node.

Once a node receives a communication, vproc communication within the node is done by the PDE and BYNET software using the following types of me ssaging:.

Point-to-point
Multicast
Broadcast

Point-to-Point Me ssages

With point-to-point me ssaging between vprocs, a vproc can send a me ssage to another vproc on:

The sa me node (using PDE and BYNET software)
A different node using two steps:
1. Send a point-to-point me ssage from the sending node to the node containing the recipient vproc. This is a communication between nodes using the BYNET hardware.
2. Within the recipient node, the me ssage is sent to the recipient vproc. This is a point-to-point communication between vprocs using the PDE and BYNET software.

Point-to-Point Me ssage on the Sa me Node

Point-to-Point Me ssage on a Different Node

Multicast Me ssages

A vproc can send a me ssage to multiple vprocs using two steps:

Send a broadcast me ssage from the sending node to all nodes. This is a communication between nodes using the BYNET hardware.
Within the recipient nodes, the PDE and BYNET software determine which, if any, of its vprocs should receive the me ssage and delivers the me ssage according ly . This is a multicast communication between vprocs within the node, using the PDE and BYNET software.

Broadcast Me ssages

A vproc can send a me ssage to all the vprocs in the system using two steps:

Send a broadcast me ssage from the sending node to all nodes. This is a communication between nodes using the BYNET hardware.
Within each recipient node, the me ssage is sent to all vprocs. This is a broadcast communication between vprocs using the PDE and BYNET software.

1.12 Cliques

A clique (pronounced, "kleek") is a group of nodes that share access to the sa me disk arrays. Each multi-node system has at least one clique. The cabling determines which nodes are in which cliques -- the nodes of a clique are connected to the disk array controllers of the sa me disk arrays.

1.13 Cliques Provide Resiliency

In the rare event of a node failure, cliques provide for data access through vproc migration. When a node resets, the following happens to the AMPs:

When the node fails, the Teradata Database restarts across all remaining nodes in the system.
The vprocs from the failed node migrate to the operational nodes in its clique.
Processing continues while the failed node is being repaired.

1.14 Cliques in a System

Vprocs are distributed across all nodes in the system. Multiple cliques in the system should have the sa me number of nodes.

The diagram below shows three cliques. The nodes in each clique are cabled to the sa me disk arrays. The overall system is connected by the BYNET. If one node goes down in a clique the vprocs will migrate to the other nodes in the clique, so data remains available. However, system performance decreases due to the loss of a node. System performance degradation is proportional to clique size.

1.15 Software Components

A Teradata node requires three distinct pieces of software:

For each node in the system, you need both of the following:

Operating system license (UNIX or Microsoft Windows)
Teradata software license

1.16 Operating System

The Teradata Database can run on the following operating systems:

UNIX MP-RAS
Microsoft Windows 2000

1.17 Parallel Database Extensions (PDE)

The Parallel Database Extensions (PDE) software layer was added to the operating system by NCR to support the parallel software environ me nt.

1.18 Trusted Parallel Application (TPA)

A Trusted Parallel Application (TPA) uses PDE to imple me nt virtual processors (vprocs). The Teradata Database is classified as a TPA. The four components of the Teradata TPA are:

AMP (Top Right)
PE (Bottom Right)
Channel Driver (Top Left)
Teradata Gateway (Bottom Left)

1.19 Teradata Software: PE

A Parsing Engine (PE) is a vproc that manages the dialogue between a client application and the Teradata Database, once a valid session has been established. Each PE can support a maximum of 120 sessions. The PE handles an incoming request in the following manner:

The Session Control component verifies the request for session authorization (user na me s and passwords), and either allows or disallows the request.
The Parser does the following:
- Interprets the SQL state me nt received from the application.
- Verifies SQL requests for the proper syntax and evaluates them semantical ly .
- Consults the Data Dictionary to ensure that all objects exist and that the user has authority to access them.
The Optimizer develops the least expensive plan (in terms of ti me ) to return the requested response set. Processing alternatives are evaluated and the fastest alternative is chosen. This alternative is converted into executable steps, to be perfor me d by the AMPs, which are then passed to the Dispatcher.

The Optimizer is "parallel aware," me aning that it has knowledge of the system components (how many nodes, vprocs, etc.), which enables it to determine the fastest way to process the query. In order to maximize throughput and minimize resource contention, the Optimzer must know about system configuration, available units of parallelism (AMPs and PEs), and data demographics. The Teradata Optimizer is robust and intelligent, and enables Teradata to handle multiple complex, ad-hoc queries efficient ly .
The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the BYNET for execution by the AMPs.
After the AMPs process the steps, the PE receives their responses over the BYNET.
The Dispatcher builds a response me ssage and sends the me ssage back to the user.

1.20 Teradata Software: AMP

The AMP is a vproc that controls its portion of the data on the system. The AMPs work in parallel, each AMP managing the data rows stored on its vdisk. AMPs are involved in data distribution and data access in different ways.

Data Distribution

When data is loaded, inserted, and updated, the AMP:

Receives incoming data from the PE.
Formats rows and distributes them on its vdisk.

Data Access

When data is accessed, the AMP retrieves the rows requested by the PE in the following manner:

The database manage me nt subsystem receives the steps from the Dispatcher over the BYNET.
The database manage me nt subsystem processes the steps. The subsystem on the AMP can:
- Lock databases and tables
- Create, modify, or delete definitions of tables
- Join tables
- Insert, delete, or modify rows within tables
- Sort, aggregate, or format data
- Retrieve information from definitions and rows from tables
The database manage me nt subsystem returns responses over the BYNET to the Dispatcher.

1.21 Teradata Software: Channel Driver

Channel Driver software is the me ans of communication between an application and the PEs assigned to channel-attached clients. There is one Channel Driver per node.

In the diagram below, the blue dots show the communication from the channel-attached client, to the host channel adapter in the node, to the Channel Driver software, to the PE, and back to the client.

1.22 Teradata Software: Teradata Gateway

Teradata Gateway software is the me ans of communication between an application and the PEs assigned to network-attached clients. There is one Teradata Gateway per node.

In the diagram below, the blue dots show the communication from the network-attached client, to the Ethernet card in the node, to the Teradata Gateway software, to the PE, and back to the client.