如何使用数据质量服务和SQL Server集成服务清除数据

介绍 (Introduction)

A year or so ago, I worked for an online web grocery software house located in the northern United States. At that time I had my ‘baptismal’ exposure to ‘genuinely dirty data’. Granted most of the data entry was done manually and many times from offshore. The point being that I could not fathom just how many ways there were to spell the brand name of a major cereal manufacturer. Why is this such an issue? The answer is fairly straight forward. Imagine the scenario that you are trying to ascertain the dollar value of breakfast cereals sold in the country from the local supermarket standpoint all the way up to national sales. Imagine this utilizing a SQL Server Multi-dimensional cube. The ‘eagle – eyed’ reader will recognize that the results will not aggregate correctly should our aggregation attributes have a plethora of different ways of being spelt.

大约一年前,我在美国北部的一家在线网络杂货店工作。 那时,我对“真正肮脏的数据”进行了“洗礼”。 当然,大多数数据输入是手动完成的,并且离岸很多次。 关键是我无法理解主要谷物制造商的品牌名称拼写有多少种方法。 为什么会这样呢? 答案很简单。 想象一下您要确定从本地超级市场的​​角度一直到全国销售的早餐谷物的美元价值的情况。 想象一下,利用SQL Server多维多维数据集。 “老鹰眼”的读者会认识到,如果我们的汇总属性具有多种不同的拼写方式,则结果将无法正确汇总。

Yes, we (as other firms) manually fixed the data anomalies, however these anomalies tendered to rear their ugly heads with each new data load.

是的,我们(和其他公司一样)手动修复了数据异常,但是随着每一次新的数据加载,这些异常都会抬起他们的丑陋头。

Enter, Data Quality Services and SQL Server Integration Services and THIS is what we are going to discuss.

输入,数据质量服务和SQL Server集成服务,这就是我们将要讨论的内容。

A final note, in the preparation of this article I felt it necessary to give the reader unfamiliar with the Data Quality Services product, a high level understanding of the processes involved in creating a workable ‘model’. Should you be familiar with Data Quality Services, feel free to skip ahead to the SQL Server Integration Services section below.

最后,在准备本文时,我觉得有必要使读者不熟悉数据质量服务产品,对创建可行的“模型”所涉及的过程有较高的了解。 如果您熟悉数据质量服务,请随时跳到下面的“ SQL Server集成服务”部分。

入门 (Getting Started)

For the sake of simplicity, in our little example, we shall be working only with the manufacturers of certain products. Naturally data errors crop up in a plethora of data fields, however I believe in the adage of keeping things as simple as possible, in a paper of this sort.

为了简单起见,在我们的小示例中,我们将仅与某些产品的制造商合作。 自然地,数据错误会出现在大量数据字段中,但是,我相信在此类论文中将事情保持尽可能简单的说法。

Our first task will be to get the current data cleaned up AND THEN make it possible for Data Quality Services to ‘use its magic’ to clean up new data (on its own) going forward. This requires the construction of a ‘Knowledge Base’ and a ‘Data Quality Services Project’.

我们的首要任务是清理当前数据,然后使数据质量服务“自行使用”清理未来的新数据。 这需要构建“知识库”和“数据质量服务项目”。

We shall then create a SQL Server Integration Services package which will be run daily to place correct data into our database and send bad data to our Business Analysts and Data Stewards to be fixed/corrected for the next day’s run.

然后,我们将创建一个SQL Server Integration Services软件包,该软件包将每天运行,以将正确的数据放入我们的数据库中,并将不良数据发送给我们的业务分析师和数据管理员,以便在第二天的运行中进行修复/更正。

安装数据质量服务 (Installing Data Quality Services)

Data Quality Services is available to the Business Intelligence and Enterprise versions of SQL Server 2012 and SQL Server 2014. Should you wish to experiment with the product, it is also available via the Developer Edition. The important point being to let the SQL Server installation process know that you wish to install Data Quality Services (DQS) when you install your SQL Server instance.

数据质量服务可用于SQL Server 2012和SQL Server 2014的商业版和企业版。如果您想试用该产品,也可以通过开发人员版使用。 重要的一点是让SQL Server安装过程知道您希望在安装SQL Server实例时安装数据质量服务(DQS)。

You thought that you were finished, right? Think again! We must now physically install the server portion on our instance. Simply select Programs, SQL Server 2012, Data Quality Services, and Data Quality Server Installer. The process executes in a command window and once complete you are ready to go. (See the screen dump below).

你以为你完蛋了吧? 再想一想! 现在,我们必须在实例上实际安装服务器部分。 只需选择程序,SQL Server 2012,数据质量服务和数据质量服务器安装程序。 该过程在命令窗口中执行,完成后就可以开始了。 (请参见下面的屏幕转储)。

Locating data quality server installer

We now call upon the Data Quality Service client and begin our journey.

现在,我们呼吁数据质量服务客户开始我们的旅程。

Starting data quality service client

After having selected the ‘Data Quality Client’, the work screen that we shall be using for this portion of the paper, will appear (See below).

选择“数据质量客户端”后,将出现我们将在本文的此部分使用的工作屏幕(请参见下文)。

Data quality service client work screen

构建我们的第一个知识库(创建知识库的三个步骤中的第一个) (Constructing our first Knowledge Base (the first of three steps to create our knowledge base))

In order for Data Quality Services to understand a bit about our data AND to use that knowledge about our data on future loads, we must build/construct a ‘Knowledge Base’. Please note that once complete, a Knowledge Base is similar to a .NET object and therefore can be ‘inherited’ in any subsequent new knowledge base.

为了使Data Quality Services能够对我们的数据有所了解并在将来的负载中使用有关我们的数据的知识,我们必须构建/构建“知识库”。 请注意,知识库一旦完成,便类似于.NET对象,因此可以在任何后续的新知识库中“继承”。

N.B. The results of the activities that we shall be performing below will be ‘stored’ in a special SQL Server Database called DQS_MAIN, which is created by the Data Quality Services server installation.

注意:我们将在下面执行的活动的结果将被“存储”在一个称为DQS_MAIN的特殊SQL Server数据库中,该数据库是由Data Quality Services服务器安装创建的。

Let us get going!

让我们开始吧!

I first click the ‘New Knowledge Base’ Option from the left hand menu. (See above)

我首先单击左侧菜单中的“新知识库”选项。 (往上看)

The following screen appears:

出现以下屏幕:

Data quality service naming and description

I have taken the liberty of naming the Knowledge Base and simply click on ‘Next’.

我可以随意命名知识库,只需单击“下一步”即可。

We first wish to create a Domain. This domain will contain all of our Manufacturer related data.

我们首先希望创建一个域。 该域将包含我们所有与制造商相关的数据。

Creating domain in domain management

I now click OK to accept and the following screen is brought into view.

我现在单击“确定”接受,并显示以下屏幕。

Final screen after domain is created

“培训”我们的知识库,或者让乐趣开始!! (‘Training’ our knowledge base OR let the Fun Begin!!)

The astute reader will note that there are 5 main tabs in the screen dump above. The ‘Domain Properties’ tab is shown.

精明的读者会注意到上面的屏幕转储中有5个主要选项卡。 显示“域属性”选项卡。

We shall not be discussing the ‘Reference Data’ tab, however it is used to link to the cloud to obtain reference data such as telephone numbers, street addresses etc. from third party vendors.

我们不会在讨论“参考数据”选项卡,但是它用于链接到云以从第三方供应商那里获取参考数据,例如电话号码,街道地址等。

As with any process we must ensure that we have a clean set manufacturer data as our ‘Master’ manufacturer list. Often this takes a few days to construct.

与任何过程一样,我们必须确保拥有完整的制造商数据作为“主”制造商列表。 通常,这需要几天的时间来构建。

Although we shall be looking at manufacturers, in reality one would really want to include ‘products’, financial data etc.

尽管我们将关注制造商,但实际上,人们确实希望包括“产品”,财务数据等。

载入我们的主要制造商数据 (Loading our master Manufacturer data)

To load our master list we select the ‘Domain Values’ tab and select ‘Import Values’ (See below).

要加载主列表,我们选择“域值”标签,然后选择“导入值”(见下文)。

Loading master list from domain values

The plot now thickens!!! The name of the product is ‘SQL Server’ however, guess what!!! The data for our master list must be in a spreadsheet as YOU CANNOT load our master data from a SQL Server table. Go figure!! As a BTW, this point has been raised with Microsoft.

现在情节变厚了!!! 该产品的名称是“ SQL Server”,但是,请猜!!! 我们的主列表中的数据必须是在电子表格中,你不能从SQL Server表加载我们的主数据。 去搞清楚!! 作为BTW,Microsoft已提出了这一点。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值