xml 数据操作_将XML操作数据集成到数据仓库中

最新推荐文章于 2024-06-23 23:25:47 发布

cuyi7076

最新推荐文章于 2024-06-23 23:25:47 发布

阅读量881

点赞数

文章标签：数据库数据仓库大数据编程语言 python

原文链接：https://www.ibm.com/developerworks/data/library/techarticle/dm-0908datastagepurexml1/index.html

版权

xml 数据操作

这不是秘密：XML无处不在。基于Web的应用程序，面向服务的体系结构（SOA）和特定于行业的数据交换格式都严重依赖XML来表示关键业务数据。 XML的日益普及促使许多公司探索如何将XML运营数据集成到其数据仓库和商业智能环境中。

在本系列文章中，将学习如何将IBM DB2 pureXML和InfoSphere DataStage一起使用来管理可操作的XML数据，根据需要转换XML数据，以及填充纯关系型或混合关系型/ XML数据仓库。（为简单起见，本系列文章的后续部分将参考“ pureXML”和“ DataStage”，而不是完整的产品名称。）分步说明指导您完成两种常见的集成方案。

在本系列文章的第一部分中，探索如何使用pureXML数据作为输入来构建包含关系数据和XML数据的数据仓库。另外，学习有关使用DataStage DB2 Connector的重要提示。在本系列文章的第二部分中，您将学习如何将平面文件数据转换为XML，以及如何使用pureXML将数据存储在数据仓库中。

引入关键的DataStage和pureXML技术

如果您不熟悉DataStage或pureXML，则在以下各节中将对它们进行简要介绍。但是，在阅读本系列文章时，有关数据仓库，ETL（提取-转换-加载）技术和XML的先验知识将有所帮助。

DataStage概述

IBM InfoSphere DataStage使公司能够从各种来源提取，转换和加载（ETL）数据到数据仓库中。对多处理器硬件的内置支持使DataStage能够提供高级别的可伸缩性，并有效地处理大量数据。各种“连接器”支持广泛的源和目标数据格式，包括流行的IBM和OEM数据库管理系统，ODBC数据源，第三方应用程序，由消息队列软件和Web服务生成的实时消息以及流行的文件格式。。

DataStage通过许多软件组件提供了这些功能以及其他功能。本系列文章中的方案使用其中的一部分。具体来说，本文中的场景使用DataStage Designer构造ETL作业。每个工作包含多个“阶段”，每个阶段执行一个给定的任务。此类任务的示例包括从数据源读取信息，使用内置函数转换输入数据，将数据从一种类型转换为另一种类型等等。本文的示例定义了涉及DB2连接器，XML操作和各种处理操作的阶段。两种DataStage技术对于场景至关重要：DB2连接器和XML Pack 2.0。本文简要介绍了这些内容。

使用DataStage Designer创建和编译您的作业后，您将使用DataStage Director执行您的工作。

DataStage DB2连接器

本文依赖于DB2连接器来进行所有DB2数据库访问。这是InfoSphere Information Server中的主要DB2连接选项，它基于Common Connector Framework。该连接器包含在旧式DataStage连接组件中发现的功能的超集，并提供改进的功能和性能。

DB2连接器可作为单独的连接器补丁安装用于Information Server 8.1。它支持到DB2 9.1和9.5服务器的本地和远程连接。（计划在将来的Information Server版本中提供DB2 9.7支持。）连接器需要DB2 Client 9.1修订包6（或更高版本）或DB2 9.5修订包3（或更高版本）。

DB2连接器的主要功能包括：

支持任何大小的XML和LOB（大对象）数据类型。有两种传递机制：按值传递（内联）和按引用传递。内联传递LOB或XML数据时，最大大小由DataStage引擎确定。当通过引用传递数据时，下游（或接收）阶段通过直接调用源连接器使用引用来获取数据。此方法的缺点是只能将LOB数据从一个连接器移动到另一个连接器，并且不能在作业中进行转换（因为只有引用是通过链接传递的）。连接器要求在读取或写入XML列时将ArraySize属性设置为1。
能够执行诸如SELECT ， INSERT ， UPDATE和DELETE类的标准SQL语句，这些语句的组合，用户定义SQL以及批量加载。
支持并行（多进程）执行。
支持DB2 DPF（数据库分区功能）。该连接器可以并行或顺序模式与分区数据库一起使用。并行运行时，将为每个DB2分区分配一个单独的DataStage进程。连接器可以并行读取和写入，并且可以并行加载分区数据库。
与分布式事务处理阶段（DTS）一起使用时，支持通过XA（两阶段提交）事务进行有保证的数据传递。 DTS提供了一种机制，可以将多个DB2数据库作为一个XA事务的一部分进行更新。请注意，DTS与DataStage 8.1的单独修补程序一起提供。
支持通过DataStage Designer中的“公共连接器导入向导”导入元数据。

DB2连接器用法上下文

该连接器可以在源，目标或查找上下文中使用。在本系列文章中，您将在源和目标上下文中使用DB2连接器。

源上下文是指连接器级中有输出链接时的情况。在这种情况下，连接器用于检索数据。

目标上下文是指输入链接进入连接器阶段的情况，这意味着数据正在传递到连接器。在这种情况下，连接器用于执行INSERT ， UPDATE ， DELETE ， UPSERT （ INSERT/UPDATE/DELETE组合），用户定义SQL或批量装入语句。

查找上下文是指连接器与查找阶段结合使用时的情况。在这种情况下，数据既传递到连接器，又从连接器检索。在这种情况下，连接器执行带有WHERE子句的SELECT语句。

DB2连接器XML支持

本系列文章中的场景集中在DB2连接器对XML数据的支持上。因此，值得深入了解DB2 Connector中的XML支持。

XML列的DataStage表示形式

DataStage 8.1使用NVarChar或LongNVarChar数据类型表示XML数据。在元数据导入期间，用户可以选择要用于XML列的类型。区别在于，以NVarChar列表示的XML列用于内联传输数据，并受DataStage引擎的大小要求，而long数据类型用于将XML列视为大对象（LOB）并允许将它们通过引用传递。

连接器支持：

X查询
带有嵌入式SQL的XQueries
SQL / XML查询
使用XML列更新和删除操作

源上下文：读取XML列

连接器可以读取整个XML文档或片段。片段可以是序列或原子值。使用XQuery时，由于在构造LOB定位器时遇到困难，连接器不支持通过引用传递XML结果。

以下示例（清单1-6）包含连接器支持的几个常见SQL查询和XQueries。这些查询基于流行的DB2 pureXML入门文章中包含的样本。（请参阅相关主题的细节部分。）注意，对于SQL，SQL / XML DB2连接器的支持，和XQuery不限于这些例子。

清单1.使用简单SQL检索完整的XML文档

select xml_col from table

如果XML列在DataStage作业中用LongVarChar数据类型表示，则连接器将为用户提供一个选择，以选择将该列作为参考还是内联传递到下一个阶段。

清单2.使用XQuery检索完整的XML文档

xquery db2-fn:xmlcolumn('TABLE.COL_NAME')

注意：清单2的结果不能通过引用传递。

清单3.使用XQuery提取XML文档片段

xquery db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client[Address/zip="10011" 
or Address/city="San Jose"]/Address

注意：清单3的结果不能通过引用传递。

清单4.使用XQuery `FLWOR`表达式提取XML文档片段

xquery
for $y in db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client
where $y/Address/zip="10011" or $y/Address/city="San Jose"
return $y/email[1]/text()

注意：清单4的结果不能通过引用传递。

清单5.使用XQuery和嵌入式SQL提取完整的XML文档

xquery db2-fn:sqlquery('select xml_col from table')

注意：清单5的结果不能通过引用传递。

清单6.使用SQL / XML查询提取XML文档片段

select xmlquery('$c/Client/Address' 
passing contactinfo as "c") 
from clients 
where status = 'Gold'
and xmlexists('$c/Client/email' passing contactinfo as "c")

在清单6中，连接器可以内联或通过引用传递结果。

表达式列的前缀

连接器执行查询时，它将尝试使查询返回的列名与在连接器阶段的输出链接上指定的列名匹配。但是，XQuery列名称用数字表示。连接器在它们前面加上一个字符串，可以通过设置连接器源上下文属性Prefix for Expression Columns来配置它们。使用的默认前缀是EXPR 。 XQuery列名称将变为EXPR1 ， EXPR2等。注意这一点很重要，因为连接器会在链接列列表中寻找那些列。

目标上下文：插入/更新/删除XML列

如前所述，DB2连接器可以充当输入数据的目标，并且可以将该数据写入适当的DB2数据库。以下示例（清单7-11）说明了DB2连接器如何在XML列中插入，更新或删除信息。

其中一些示例包含一个保留的DataStage关键字ORCHESTRATE ，该关键字用于将参数传递给该语句。在ORCHESTRATE关键字之后指定的列名称需要与与适当的DataStage链接关联的结果集中的列匹配。

清单7.插入一个XML列

insert into table values(ORCHESTRATE.col1)

列名col1表示在输入链接上指定的列。该列可以是NVarChar或LongNVarChar，可以内联或通过引用传递。

清单8.使用简单SQL更新语句更新完整的XML文档

update clients set 
contactinfo=( xmlparse(document ‘<email>newemail@someplace.com </email> ' ) )
where id = 3227

清单9.使用参数更新XML文档值的一部分

update xmlcustomer
set info = xmlquery('copy $new := $INFO
modify do replace value of $new/customerinfo/phone with $z
return $new' passing cast(ORCHESTRATE.col2 as varchar(15)) as "z")
where cid = ORCHESTRATE.col1

清单9使用了两个参数。 col2的值替换XML文档中现有的/ customerinfo / phone数据； col1的值限制受更新影响的行。

清单10.基于内部XML数据过滤器删除XML记录

delete from clients
where xmlexists ('$c/Client/Address[zip="95116"]'
passing clients.contactinfo as "c")

清单11.使用参数删除XML记录

delete from clients
where xmlexists ('$c/Client/Address[zip=$z]'
passing clients.contactinfo as "c", cast(ORCHESTRATE.col2 as varchar(15)) as z)

清单11使用col2的值作为参数，以基于XML元素（客户机地址的zip的值限制DELETE操作。

DataStage XML包

本系列文章依赖于DataStage XML Pack 2.0为这两种情况提供关键功能。 XML Pack包含三个阶段：XML输入，XML输出和XML转换程序。本系列使用XML Input和XML Output阶段。本文的场景中未使用基于XSLT的XML Transformer阶段。

XML输入阶段

XML输入将XML数据转换为“平面”关系表。使用XML输入，您可以：

使用标准XPath表达式提取数据
使用XML模式并有选择地验证XML数据
使用自定义样式表支持复杂的转换
使用不同的数据集支持多个输出链接
处理错误

图1说明了XML Input阶段完成的简单转换：

图1. XML Input阶段将分层XML数据转换为“平面”表

XML输出阶段

XML输出将表格数据（例如关系表）转换为XML层次结构。使用XML输出，您可以：

使用标准XPath表达式的子集生成XML
从同一输入生成多个文档。
支持XML名称空间
以多种方式配置XML文档生成，例如每输入行一个文档
在磁盘上或通过输出链接生成输出文档
处理错误

图2说明了XML Output完成的简单转换：

图2. XML Output阶段将“平面”表格结构转换为XML层次结构

DB2 pureXML概述

DB2为公司提供了通用的应用程序编程接口和数据库管理平台，用于在表以及XML层次结构中建模的数据。图3说明了DB2如何使公司与传统的关系数据一起管理XML消息和文档。使用pureXML，无需将业务关键的XML数据“切碎”或映射为关系格式。而是可以将XML数据完整地以其本机分层格式存储。应用程序可以轻松访问所需的XML数据，并且不费吹灰之力即可集成XML和关系数据。

图3. DB2的体系结构提供关系和XML数据管理支持

本系列文章中的场景依赖于DB2的pureXML功能，这些功能包括以其本机分层格式对XML数据进行优化存储，并支持以SQL或XQuery语言查询XML数据。

示例开发和运行时环境

为了创建本系列文章中描述的示例方案，我们使用了以下软件：

Information Server 8.1，其中包括DataStage Designer和DataStage Director组件。此外，我们还安装了DataStage 8.1 DB2连接器，该连接器包含在DataStage 8.1的补丁中，称为“连接器汇总补丁1”。因为其中一种方案使用了Transformer Stage，所以我们还安装了必备的C ++编译器。（我们使用了Microsoft Visual Studio .NET2003。）
DB2 9.5企业服务器版。

DataStage和DB2都安装在同一Windows系统上。

样本运营数据包括有关金融投资组合的信息。我们从开放源代码XML交易处理（TPOX）基准的子集中获得了这些数据。具体来说，我们使用的数据代表有关客户帐户及其持股（或投资）的信息。在每种情况下，您将了解有关运营数据和数据仓库设计的更多信息。（有关TPOX的更多信息，请参见“ 相关主题”部分。）

使用pureXML作为仓库的输入

本文描述的场景使用存储在DB2 pureXML中的可操作XML数据作为ETL作业的输入。如图4所示，您希望DataStage提取存储在一个DB2 XML列中的数据，并将该数据映射到DB2数据仓库中的两个表。这些表之一仅包含传统的关系列，而第二个表包含关系列和一个XML列。

图4.存储在DB2 pureXML中的操作数据用作数据仓库的输入

此数据模型代表一种常见的情况，其中XML数据的某些部分经常被“切碎”为关系结构。这些部分代表业务用户可能经常分析和查询的数据。许多商业智能工具都经过优化以支持关系结构，因此将经常查询的XML数据分解为关系列可能非常有效。但是，业务需求会随时间变化，因此管理员可能很难确定应创建哪些关系列。通过在数据仓库中维护完整的XML数据，用户可以立即访问以前未分解为关系格式的重要业务数据。

为了简化本文中的示例场景，我们使用一个DB2数据库（名为“ TPOX”）来存储操作数据和仓库数据。当然，在生产环境中，运营数据和仓库数据将在单独的数据库中管理，通常在单独的服务器上。

由于我们使用DB2来管理运营数据和仓库数据，因此我们可以选择使用内置的DB2技术来执行许多ETL工作。但是，将这项工作转移到DataStage可以最大程度地减少对DB2操作的影响，这是生产环境中的常见目标。此外，DataStage还提供了DB2中不提供的许多转换和清除功能。最后，许多公司需要使用来自异构数据源的数据填充其数据仓库，而DataStage提供了关键服务来帮助他们做到这一点。此类功能在其他地方都有详尽的文档说明，不在本系列文章的讨论范围之内。

设计概述

本文中的示例场景将可操作的XML数据存储在TPOXADMIN.ACCOUNT表中，该表用作此场景的DataStage的源表。 ACCOUNT表包含一个关系列（ID）和一个XML列（INFO）。清单12显示了创建此表的难易程度：

清单12.创建源ACCOUNT表SQL

create table tpoxadmin.account(id int, info xml)

INFO列包含有关该帐户的详细信息，包括其标题，开设日期，工作余额，投资组合持有量以及其他信息。清单13显示了存储在ACCOUNT表中的一个XML文档的一部分。本文随附的DB2脚本包含完整的XML帐户记录集（请参阅下载）。

清单13.存储在ACCOUNT表中的一个XML文档的一部分

<Account id="804130877" xmlns="http://tpox-benchmark.com/custacc">
  <Category>6</Category>
  <AccountTitle>Mrs Shailey Lapidot EUR</AccountTitle>
  <ShortTitle>Lapidot EUR</ShortTitle>
  <Mnemonic>LapidotEUR</Mnemonic>
  <Currency>EUR</Currency>
  <CurrencyMarket>3</CurrencyMarket>
  <OpeningDate>1999-02-20</OpeningDate>
  <AccountOfficer>Soraya Lagarias</AccountOfficer>
  <LastUpdate>2004-02-10T22:33:58</LastUpdate>
  <Balance>
    <OnlineActualBal>896882</OnlineActualBal>
    <OnlineClearedBal>337676</OnlineClearedBal>
    <WorkingBalance>430147</WorkingBalance>
  </Balance>
  . . . 
  <Holdings>
    <Position>
       <Symbol>ZION</Symbol>
       <Name>Zions Bancorporation</Name>
       <Type>Stock</Type>
      <Quantity>1927.719</Quantity>
    </Position>
    . . .    
  </Holdings>
. . . 
</Account>

为了简化测试，目标数据仓库数据库也配置为TPOX。 TPOXADMIN.ACCOUNT的INFO列中的源信息将映射到两个表-DWADMIN.ACCT表包含有关整个帐户的信息，DWADMIN.HOLDINGS表包含有关给定帐户的各种投资（投资组合的信息）。清单14显示了如何定义这些表：

清单14.目标数据仓库表的定义

-- DWADMIN.ACCT contains general information about accounts. 
create table dwadmin.acct (
     id               int        primary key not null, 
     title            varchar(100), 
     currency         char(3), 
     workingbalance   int,  
     totalholdings    int,
     holdingtypes     int,
     officer          varchar(50), 
     datechanged      date,
     timechanged      time, 
     fullrecord       xml 
)

-- DWADMIN.HOLDINGS tracks specific investments (holdings) for a given account 
create table dwadmin.holdings (
   id                 int       references dwadmin.acct on delete cascade,  
   symbol             varchar(10), 
   type               varchar(25), 
   quantity           decimal(12,2) 
)

要了解如何将XML源数据（在TPOXADMIN.ACCOUNT的INFO列中）映射到数据仓库表的各个列中，请参阅表1 。（两个数据仓库表的ID列都是从操作中的ID列的值填充的。）

表1.数据仓库表中每一列的XML源数据

达文		德华控股
栏名	数据源（XPATH或...）	栏名	数据源（XPATH表达式）
标题	/帐户/帐户标题	符号	/帐户/控股/职位/符号
货币	/账户币种	类型	/帐户/控股/职位/类型
工作平衡	/帐户/工作平衡	数量	/帐户/馆藏/位置/数量
官	/帐户/帐户办公室
数据已更改	从/ Account / LastUpdated中提取
时变	从/ Account / LastUpdated中提取
全记录	最初存储在DB2源表中的整个XML文档

如您所料，有几种方法可以为这种情况构建DataStage作业。本文采用了增量开发方法。特别是，初始步骤将指导您创建整个DataStage作业的一部分，以提取，转换数据并将数据加载到DWADMIN.ACCT表中。完成并测试了这部分之后，本文将指导您完成增强的工作，以提取，转换XML数据并将其加载到DWADMIN.HOLDINGS表中。不过，首先，您需要创建适当的DB2表来支持这种情况。

准备DB2数据库环境

本文下载包含一个脚本，该脚本创建必要的操作（源）表和数据仓库（目标）表。它还使用示例XML帐户信息填充操作表。下载并解压缩DSsetup.zip文件，然后打开一个DB2命令窗口并发出以下命令：

清单15.调用本系列文章附带的DB2脚本

db2 –td@ -vf  DSsetup.db2

注：此脚本旨在支持Windows上运行的DB2 9.5服务器。

创建一个并行作业并概述其内容

首先，为需要包括的各个阶段（工作任务）定义一个新的DataStage并行作业，该作业具有占位符。稍后，您将编辑每个阶段以指定适当的属性以成功执行作业。

启动DataStage Designer。
要创建新的并行作业，请在“ 存储库”窗格中右键单击Jobs文件夹，然后选择“ 新建” >“ 并行作业” 。
将两个DB2 Connector阶段添加到并行作业。一个连接器将代表TPOXADMIN.ACCOUNT源表，另一个将代表DWADMIN.ACCT目标表。
1. 在“ 面板”窗格中选择“ 数据库”选项卡。
2. 将DB2 Connector阶段拖动到并行作业。
3. 重复上一步，创建第二个DB2连接器阶段。
4. 将这两个连接器放在并行作业画布的相对两侧。
将XML Input阶段添加到作业。（此阶段会将有关帐户的分层XML数据转换为表格结果结构。）
1. 在“ 面板”窗格中选择“ 实时”选项卡。
2. 找到XML Input阶段并将其拖动到并行作业。
3. 将图标放在第一个DB2 Connector阶段附近。
在工作中添加一个Transformer阶段。（此阶段会将单个XML元素值分成两个值，这两个值将填充目标表中的两个不同列。）
1. 在“ 面板”窗格中选择“ 处理”选项卡。
2. 找到“变形金刚”阶段，然后将此图标拖动到并行作业窗格中。
3. 将图标放在XML Input阶段和最终的DB2连接器之间。
将各个阶段链接在一起。
1. 要将第一个DB2连接器链接到XML输入，请按住鼠标右键，单击DB2连接器，然后将鼠标拖到XML Input阶段。两个阶段之间将出现一个箭头。
2. 将XML输入链接到Transformer阶段。
3. 将Transformer阶段链接到最终的DB2连接器。
如果需要，请使用每个阶段的描述性名称自定义您的工作，并使用标准DataStage设施进行链接。（如果需要，请参阅“ 相关主题”部分，以获得指向DataStage教程和文档的链接。）
验证并行作业设计是否类似于图5 ，该图显示了链接在一起的各个阶段，如步骤6所述：
图5.集成方案第一部分的DataStage作业框架

（查看图5的放大图。）
要保存您的工作，请从主工具栏中单击文件 > 保存，然后为该工作指定您选择的名称。

现在，您已经为DataStage作业的第一部分提供了并行作业框架。

导入表定义

DB2 TPOX数据库已经包含源表和目标表，因此您需要将有关这些表的元数据导入DataStage。

在主工具栏中，选择导入 > 表定义 > 启动连接器导入向导 。
将出现一个新窗口，用于“连接器元数据导入–连接器选择”。选择DB2 Connector（Variant 9.1） ，然后单击Next 。
输入适当的连接详细信息，包括实例类型（DB2），数据库名称（TPOX）以及有效的用户ID和密码。
单击此窗口右上角的“ 测试连接”选项，以验证您可以连接到DB2 TPOX数据库。
成功连接后，单击“ 下一步” ，然后单击“ 确定” 。
接受数据源位置的默认值。其中包括主机名DB2和数据库名TPOX（DB2）。
点击下一步 。
从“ 过滤器”下拉列表中选择TPOXADMIN模式，验证是否选中了“ 包含表”选项，然后单击“ 下一步” 。
从此架构的可用表列表中选择ACCOUNT表。
取消选中所有选项，包括“ XML列作为LOB”选项。您的DataStage作业需要处理和转换XML，因此它将把它视为字符串数据（而不是非结构化的大对象或LOB）。
验证TPOXADMIN.ACCOUNT表已计划导入，然后单击Import 。
出现一个弹出窗口，提示您选择用于元数据导入的文件夹。选择表定义 ，然后单击确定。
重复上述步骤，导入两个数据仓库目标表。
- 在步骤8中，选择DWADMIN模式，而不是TPOXADMIN模式。
- 在步骤9中，选择ACCT和HOLDINGS表。（最初只使用DWADMIN.ACCT表定义，但是如果现在也导入DWADMIN.HOLDINGS表定义，它将节省时间。）
若要确认已成功导入所有必需表的元数据，请在左上角的“ 存储库”窗格中展开“ 表定义”文件夹，并确认存在TPOXADMIN.ACCOUNT，DWADMIN.ACCT和DWADMIN.HOLDINGS的条目。
保存您的工作。

现在您可以编辑作业的每个阶段了。

编辑DB2源表阶段

TPOXADMIN.ACCOUNT表包含仓库所需的操作信息。需要编辑源DB2 Connector阶段的属性，以反映有关此输入表的适当信息。

双击代表输入源的DB2连接器。
出现一个窗口。在“ 属性”选项卡中，指定适当的连接信息。这包括实例类型（DB2），数据库名称（TPOX）以及有效的用户ID和密码，如图6所示：
图6.设置DB2源表的属性
单击窗格右上角的“ 测试”以确认您可以成功连接到数据库。
向下滚动到Properties选项卡的Usage部分，并指定以下设置（也如图6所示）：
- 生成SQL ：是
- 表名称 ：TPOXADMIN.ACCOUNT
- 数组大小 ：1
单击“ 用法”行右侧的“ 查看数据”以确认您可以成功查询该表，然后单击“ 确定”以保存您的设置。
单击“ 列”选项卡，然后选择窗格底部的“ 加载 ”。
将出现一个带有表定义的窗口。选择TPOXADMIN.ACCOUNT表，然后单击OK 。
将出现一个带有表列的窗口。接受选择表中所有列的默认设置。（请注意，INFO列在DB2中作为XML列创建，在此显示为带有NVarCharSQL类型，表示Unicode字符串。这很好。）
单击确定。
为INFO列指定适当的长度。对于此样本数据，5000字节的长度就足够了。
单击确定。
保存您的工作。

源表的DB2连接器阶段已准备就绪。在下一步中，您将定制代表数据仓库目标表的DB2连接器。

编辑DB2目标表阶段

DWADMIN.ACCT表是要从输入源填充的两个数据仓库表之一。如前所述，该表包含多个关系列，用于提供有关财务帐户的特定信息，以及一个pureXML列，其中包含XML格式的完整帐户数据。需要编辑目标DB2 Connector阶段的属性以反映有关此表的适当信息。

要打开代表目标表的DB2连接器，请双击其图标。
在“ 属性”选项卡中，指定适当的连接信息。这包括实例类型（DB2），数据库名称（TPOX）以及有效的用户ID和密码。
测试您是否可以成功连接到数据库。
在“ 属性”选项卡的“ 用法”部分中，指定以下设置，如图7所示：
- 写入方式 ：插入
- 生成SQL ：是
- 表格名称 ：DWADMIN.ACCT
- 数组大小 ：1
- 表操作 ：截断
图7.为一个DB2目标表设置属性
单击查看数据以验证您可以成功查询该表。（第一次运行作业时，此表将为空。）
单击确定保存这些设置。
单击列选项卡，然后选择加载。
选择DWADMIN.ACCT表，然后单击OK 。
接受选择表中所有列的默认设置。（请注意，FULLRECORD列是在DB2中创建为XML列的，它SQL类型为NVarChar。）
单击确定。
为FULLRECORD列指定适当的长度。对于样本数据，5000字节的长度就足够了。
单击确定。
保存您的工作。

定义了DB2源和目标阶段之后，就该开始处理数据了。

编辑XML Input阶段

您需要自定义XML Input阶段，以从TPOXADMIN.ACCOUNT表的INFO列中提取XML数据并生成表格结果集，该结果集最终将用于填充DWADMIN.ACCT表。

要打开XML Input阶段，请双击其图标。
单击输入选项卡。
出现第二组选项卡。单击XML Source子选项卡（如果需要）。
指定XML源列。使用下拉选项卡选择“ 信息”列。
指定XML文档 （而不是URL /文件路径）作为“列”内容。
单击列子选项卡。
验证是否列出了ID和INFO列。
将INFO列的数据类型设置为VarBinary ，并将列长度设置为5000字节。（使用VarBinary可使XML Input阶段以其本机UTF-8格式处理数据，并避免不必要的字符集转换。）
单击主输出选项卡。
出现一组新的选项卡。单击转换设置子选项卡。
验证是否已选中“ 必需的重复元素”设置。这需要表明您打算从包含XML的单个记录中提取多个记录。 XML Input阶段将循环遍历repeat元素，并为该元素的每次出现生成一个输出记录。在接下来的步骤中，您将看到如何指示重复元素的路径。
选择包括名称空间声明复选框。
XML源数据包含一个名称空间，因此您需要在此处声明。输入以下内容：
```
xmlns:ns= "http://tpox-benchmark.com/custacc"
```
单击列子选项卡，然后选择加载。
出现表定义列表。选择DWADMIN.ACCT表。
接受默认设置以加载所有列。
确认已选中确保所有字符列均使用Unicode复选框，然后单击确定。
如下修改此作业阶段的输出结构（以匹配图8的内容）：
1. 为FULLRECORD列指定5000字节的长度。
2. 将TIMECHANGED列的数据类型更改为NVarChar ，其长度为20 。将数据视为字符串将使您能够利用某些内置的DataStage转换功能，这将在不久后看到。
3. 指定DataStage从输入XML数据提取信息时应使用的适当XPATH表达式。对除ID列（从源TPOXADMIN.ACCOUNT表中的关系列中提取）以外的所有列执行此操作。适当的描述设置包括：
  - 标题：/ ns：Account / ns：AccountTitle / text（）
  - 货币：/ ns：帐户/ ns：货币/文本（）
  - 工作平衡 ：/ ns：帐户/ ns：余额/ ns：WorkingBalance / text（）
  - OFFICER ：/ ns：Account / ns：AccountOfficer / text（）
  - DATECHANGED ：/ ns：帐户/ ns：LastUpdate / text（）
  - TIMECHANGED ：/ ns：帐户/ ns：LastUpdate / text（）
  - FULLRECORD ：/ ns：帐户
4. 将TITLE列标识为键。为此，请将TITLE的键值更改为是。这指示DataStage使用此XML元素值作为重复元素标识符。对于每次出现的ns：AccountTitle，该阶段都会产生一个输出记录。换句话说，它将为每个帐户生成一条记录，因为每个帐户都包含一个AccountTitle元素。 In this scenario, other columns could also serve this purpose, including the CURRENCY, WORKINGBALANCE, and OFFICER columns, since all of them are mandatory elements of the Account element. We selected TITLE as the key column for convenience.
Figure 8. Column definitions for output results from the XML Input stage

（查看图8的大图。）

Note: VarChar types with a Unicode extension are equivalent to NVarChar types to the DB2 Connector.
Click on OK and save your work.

You've now programmed the XML Input stage to extract data from a DB2 pureXML column and prepare it for mapping into a DB2 target data warehouse table that is largely relational. Next, you'll customize the Transformer stage so that the data will be modified as needed before it's loaded into the target table.

Edit the Transformer stage

The Transformer stage needs to extract time information from an input string containing timestamp data in order to correctly populate the TIMECHANGED column in the target DWADMIN.ACCT table. DataStage provides a number of built-in functions for manipulating date/time data, performing data type conversions, and so on. You need to edit the Transformer stage to invoke appropriate functions for your data conversion.

To open the Transformer stage, double-click on its icon.
Click on the tool bar's Column Auto-Match icon to instruct DataStage to automatically map columns by name between the input and output links. (You defined the result set for this stage's input link in " Edit the XML Input stage " section. You defined the result set for this step's output link in the " Edit the DB2 target table stage " section.)
单击确定。
Modify the Derivation setting for the TIMECHANGED column of the output link to transform the data as needed. Recall that the input string contains a full timestamp with date and time information, and you want to populate the TIMECHANGED column in the DB2 target table with only a time value.
1. Highlight the appropriate derivation setting, right-click, and select Edit Derivation .
2. A blank pane appears. Use the built-in wizards to select appropriate transformation function calls, or enter the following code:
```
TimestampToTime( StringToTimestamp(AccountOverview.TIMECHANGED, 
"%yyyy-%mm-%ddT%hh:%nn:%ss"))
```
  In case you're curious, the inner function call converts the input string into a timestamp that complies with a specific format. The outer function takes this timestamp and converts it to a time value. For details on these functions or the Transformer stage, see the Related topics section.
Verify that your transformation appears similar to Figure 9 , which illustrates the mapping between the input and output links (generated in Step 2) as well as the derivation that you edited in Step 4:
Figure 9. Modified properties for the Transformer stage

(View a larger version of Figure 9.)
Click on OK .

You've now completed the modifications necessary for the initial portion of your DataStage job. It's time to compile and test your work.

Compile and test your work

Follow the standard DataStage procedures to compile, validate, and run your job. Consult the Related topics section for links to DataStage documentation, if needed. Here's a quick summary:

Press the F7 key, or select the Compile icon from the toolbar. Ensure that the job compiled without error.
Press Ctrl+F5 , or select the Run icon from the toolbar.
Set Warnings to no limit , and click on Run .
DataStage will attempt to execute your job. Launch the DataStage and QualityStage Director to inspect the results of your job.
If needed, switch to the Status page and highlight your job. To do so select View > Status , or click on the Status icon in the toolbar.
Switch to the Log page to review details about the job's execution. To do so, select View > Log , or click on the Log icon on the toolbar. The job should execute successfully, although warning messages may appear.

At this point, your can inspect the data in the target table using standard DB2 tools. Or, if you prefer, you can view the data from the DataStage Designer by doing the following:

Double-click on the DB2 Connector that represents the target table.
Click on the View Data link to display the data in the target table.

Add more stages to populate the final warehouse table

To complete the scenario, you need to extract the appropriate information from the XML Input source and load this data into the DWADMIN.HOLDINGS table. Doing so requires adding two new stages to the existing job.

Select a Copy stage from the Processing section of the palette, and place the Copy stage just below the existing XML Input stage.
Place a new DB2 Connector stage on the palette to the right of the Copy stage.
Link the XML input stage to the new Copy stage, and link the new Copy stage to the new DB2 Connector stage.
If desired, rename your new links and stages in a more descriptive manner.
Verify that your job design appears similar to Figure 10 , which shows the various stages linked together, as described in Step 3:
Figure 10. Revised DataStage job for this scenario

(View a larger version of Figure 10.)

Edit the new DB2 Connector stage:

Specify the appropriate connection information, including the instance type (DB2), database name (TPOX), and a valid user ID and password.
Test the connection.
Specify the appropriate usage information:
- Write Mode : Insert
- Generate SQL : Yes
- Table name : DWADMIN.HOLDINGS
- Table action : Truncate
- Array Size : 1
Select View Data to verify that the specifications are valid.
Click on the Column tab.
Load information from the table definition for DWADMIN.HOLDINGS. Import all columns.
Set the Key value of the ID column to Yes .
Specify the XPATH expressions that DataStage should use to extract information from the input XML data (shown in Figure 11 ). Do this for every column except the ID column. The appropriate description settings include:
- SYMBOL : /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()
- TYPE : /ns:Account/ns:Holdings/ns:Position/ns:Type/text()
- QUANTITY : /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()
  Figure 11. Edited column values of the DB2 Connector for the DWADMIN.HOLDINGS table
  
  （查看图11的大图。）

Edit the existing XML Input stage:

Click on the Output tab and use the drop-down menu beneath the Output Name box to identify the appropriate DataStage link to modify. (In Figure 10 , HoldingInfo is the appropriate link, as it controls the data flow from the XML Input stage to the Copy stage that you just added to the job.)
Click on the Transformation Settings tab.
Verify that the Repetition element required setting is checked.
Select the include namespace declaration check box.
Enter the following XML namespace definition in the white box:
```
xmlns:ns="http://tpox-benchmark.com/custacc"
```
Click on the Columns sub-tab.
Load information for the DWADMIN.HOLDINGS table. Select all columns.
Specify the XPath expressions that DataStage should use to extract information from the input XML data (shown in Figure 12 ). Do this for every column except the ID column. The appropriate description settings include:
- SYMBOL : /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()
- TYPE : /ns:Account/ns:Holdings/ns:Position/ns:Type/text()
- QUANTITY : /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()
Figure 12. Column definitions for output from XML Input stage to new Copy stage

（查看图12的大图。）
Set the Key value for the SYMBOL column to Yes . This causes the stage to produce an output record for each occurrence of the Symbol element. Since Symbol occurs in every Position element, the stage produces a record for each Position element.
Click on OK .

Edit the new Copy stage. (This stage passes information between the XML Input and DB2 Connector stages, each of which associates different meanings with Key values. The XML Input stage uses a Key column to identify the repetition element, which determines how the XML data is generated. When you edited the XML Input stage in the previous section, the work you performed in Steps 3 and 9 identified the Key value and repetition element. The DB2 Connector, of course, uses Key to identify a column that is a database key. The Copy stage allows data to be passed between these stages without introducing any operational conflicts due to the way each stage treats key values. To do so, its input and output links identify different columns as keys.)

From the Output tab, select Mappings.
Click on Auto-Match , and select OK .
If desired, inspect the Key definitions for the columns on the input and output links.
Click on the Input tab, and then click on the Columns sub-tab.
Verify that the SYMBOL column is identified as the Key.
Click on the Output tab, and then click on the Columns sub-tab.
Verify that the ID column is identified as the Key.
Click on OK to finish editing this stage.

Test your revised job:

If you haven't already done so, save your job.
Compile the job.
Verify that there are no errors.
Run the job.
Use DataStage Director to inspect the log and verify that the job finished successfully.
Optionally, inspect the data in the target table by selecting View Data in the stage editor for the target DB2 stage.

摘要

Increased use of XML as a preferred format for data exchange is prompting data architects and administrators to evaluate options for integrating business-critical XML data into their data warehouses. In this first installment of this two-part series , you learned how IBM InfoSphere DataStage can extract and transform XML data managed by DB2 pureXML. In addition, you explored how DataStage can load this data into two tables: one with traditional SQL data types, and one that features both relational and XML columns.

The second part of this article series explores another important scenario: using DataStage to read information from a flat file, convert the data into an XML format, and load this XML data into a data warehouse that contains a table with a DB2 pureXML column.

致谢

Thanks to Stewart Hanna, Susan Malaika, and Ernie Ostic for their review comments on this article.

翻译自: https://www.ibm.com/developerworks/data/library/techarticle/dm-0908datastagepurexml1/index.html