ssis组件_用于SSIS的Melissa Data Quality免费组件

ssis组件

In this article, we will talk briefly about data quality in SQL Server. Then, we will give a brief overview of the Melissa Data Quality for SQL Server Integration Services (SSIS), and we will demonstrate the components available in the community edition.

在本文中,我们将简要讨论SQL Server中的数据质量。 然后,我们将简要概述SQL Server集成服务(SSIS)的Melissa数据质量,并演示社区版中可用的组件。

介绍 (Introduction)

In general, data quality is the level of how much data fits its serving context. Enhancing the data quality is very critical since poor quality leads to inaccurate reporting and results in wrong decisions and inevitably economic damages. For this reason, each data management system provided a bunch of tools used to improve the data quality level.

通常,数据质量是多少数据适合其服务上下文的级别。 数据质量的提高非常关键,因为质量差会导致报告不准确,并导致错误的决策和不可避免的经济损失。 因此,每个数据管理系统都提供了一堆用于提高数据质量级别的工具。

For SQL Server, many technologies can be used to enhance data quality:

对于SQL Server,可以使用许多技术来提高数据质量:

SQL Server数据质量服务(DQS) (SQL Server Data Quality Services (DQS))

Data Quality Services is a knowledge-driven data quality feature developed by Microsoft and released in SQL Server 2012. It can be installed from the SQL Server installation, and it provides different services, such as building a knowledge base, data de-duplication, standardization.

数据质量服务是由Microsoft开发并在SQL Server 2012中发布的知识驱动的数据质量功能。可以从SQL Server安装中安装它,并且它提供不同的服务,例如构建知识库,重复数据删除,标准化。

To learn more about this feature, you can refer to the following articles:

要了解有关此功能的更多信息,可以参考以下文章:

使用众所周知的Microsoft SSIS组件 (Using the well-known Microsoft SSIS Components)

SQL Server Integration Services provides a bunch of components that can be used to assess and enhance data quality. These operations can be performed at the control flow level, such as data profiling and validation, or the data flow level using fuzzy lookups, conditional splits, derived columns, script component, and others.

SQL Server Integration Services提供了一堆可用于评估和增强数据质量的组件。 这些操作可以在控制流级别上执行,例如数据概要分析和验证,或者使用模糊查找条件拆分派生列脚本组件等在数据流级别上执行。

编写SQL语句 (Writing SQL Statements)

One of the most popular data cleaning approaches is implementing your own logic using SQL statements, which is known as data wrangling. SQL Server provided a bunch of system functions that can be used to improve data quality.

最受欢迎的数据清理方法之一是使用SQL语句实现自己的逻辑,这称为数据整理。 SQL Server 提供了许多系统功能,可用于提高数据质量

使用第三方组件 (Using third-party components)

One of the most beautiful things in the Visual Studio IDE is that it is allowed to develop third-party components and integrate them within Microsoft products such as SSIS. Many companies developed third-party SSIS components such as CDATA, Kingwaysoft, and COZYROC.

Visual Studio IDE中最漂亮的功能之一是允许开发第三方组件并将其集成到Microsoft产品(例如SSIS)中。 许多公司开发了第三方SSIS组件,例如CDATA,Kingwaysoft和COZYROC。

Regarding data quality, one of the most popular products in the market is Melissa data quality components for SQL Server.

关于数据质量,市场上最受欢迎的产品之一是用于SQL Server的Melissa数据质量组件。

Melissa SQL Server的数据质量 (Melissa Data Quality for SQL Server)

Melissa data quality tools are a set of SSIS components that are used to clean and enrich data during the data transfer or integration process. Two editions are available:

Melissa数据质量工具是一组SSIS组件,用于在数据传输或集成过程中清理和丰富数据。 有两个版本

  1. Enterprise edition: Commercial, contains a wide variety of data quality components and online services
  2. 企业版 :商业 ,包含各种数据质量组件和在线服务
  3. Community edition: Free, but only a few components are available (check the link above) 社区版 :免费,但只有少数组件可用(请查看上面的链接)

In this article, we will be talking about the community edition, and we will briefly illustrate its components.

在本文中,我们将讨论社区版本,并简要说明其组成部分。

下载Melissa数据质量社区版 (Download Melissa data quality community edition)

To download the Melissa data quality community edition, you should navigate to the SQL Server editions page. Then, request a demo by filling the form located on the left side of the page. And make sure to select the community edition.

要下载Melissa数据质量社区版本,您应该导航到SQL Server版本页面 。 然后,通过填写页面左侧的表格来请求演示。 并确保选择社区版本。

Requesting Melissa data quality community edition

After requesting the demo, you will receive an email that contains a link for the web installer with a community license key.

请求演示后,您将收到一封电子邮件,其中包含带有社区许可证密钥的Web安装程序的链接。

Received email from Melissa

Now, you should download the web installer from the link you received. When finished, you should enter the Melissa license key during installation.

现在,您应该从收到的链接下载Web安装程序。 完成后,您应该在安装过程中输入Melissa许可证密钥。

Melissa data quality license information form within installation

When the installation is done, the Melissa data quality components should appear within the SSIS toolbox (data flow level).

安装完成后,Melissa数据质量组件应出现在SSIS工具箱(数据流级别)中。

Melissa data quality components within SSIS toolbox

If you add any of those components to the data flow, you will see the following notification every time you will try to open its editor.

如果将这些组件中的任何一个添加到数据流,则每次尝试打开其编辑器时都会看到以下通知。

Community edition notification

As mentioned in the Melissa SQL Server editions page, few features are available in the community edition:

如Melissa SQL Server版本页面中所述,社区版本提供的功能很少:

  1. Contact Verify component: Only address parsing, name parsing, email correction, and phone formatting operations can be performed 联系人验证组件 :只能执行地址解析,名称解析,电子邮件更正和电话格式化操作
  2. Profiler: Max 50000 records limit 探查器 :最多50000条记录
  3. MatchUp: Max 50000 records limit 对决:最多50000条记录的限制

Note: to run examples, we exported a flat-file from AdventureWorks2017 database, using the following SQL Statement:

注意:要运行示例,我们使用以下SQL语句从AdventureWorks2017数据库导出了平面文件:

SELECT [BusinessEntityID]
      ,[Title]
      ,[FirstName]
      ,[MiddleName]
      ,[LastName]
      ,[Suffix]
    ,REPLACE(LTRIM(RTRIM(ISNULL([FirstName],'') + ' ' + ISNULL([MiddleName],'') + ' ' + ISNULL([LastName],''))),'  ',' ') as [Name]
      ,[JobTitle]
      ,[PhoneNumber]
      ,[PhoneNumberType]
      ,[EmailAddress] + '.' as [EmailAddress]
      ,[EmailPromotion]
      ,[AddressLine1]
      ,[AddressLine2]
      ,[City]
      ,[StateProvinceName]
      ,[PostalCode]
      ,[CountryRegionName]
  FROM [AdventureWorks2017].[HumanResources].[vEmployee]
  • Note: We concatenated first, middle and last name to test the name parser. Besides, we added a “.” at the end of the email address to test the email correction feature.

    注意: 我们将名字,中间名和姓氏连接起来以测试名称解析器。 此外,我们添加了“。” 电子邮件地址的末尾以测试电子邮件更正功能。

联系验证组件 (Contact Verify Component)

As mentioned before, there are only four features of the contact verify component available in the community edition:

如前所述,社区版中的联系人验证组件只有四个功能:

  1. Name parsing: This feature is used to split a full name filed into first, middle, and last name fields. Also, it extracts additional information as title, prefix, and suffix. In the community edition, we are able only to extract the last name 名称解析 :此功能用于将全名分为名字,中间名和姓氏字段。 此外,它还提取其他信息作为标题,前缀和后缀。 在社区版中,我们只能提取姓氏
  2. Address parsing: This feature is used to extract additional information from the address field, such as the street name, suffix, mailbox name, and others 地址解析 :此功能用于从地址字段中提取其他信息,例如街道名称,后缀,邮箱名称等
  3. Phone formatting: This feature is used to change the phone number formatting 电话格式 :此功能用于更改电话号码格式
  4. Email correction: This feature is used to remove meaningless characters from an email address 电子邮件更正:此功能用于从电子邮件地址中删除无意义的字符

To test this component, we create a new SSIS project and add the following components:

为了测试此组件,我们创建一个新的SSIS项目并添加以下组件:

  • Flat File Connection Manager: used to establish a connection with the flat file we generated from the AdventureWorks2017 database 平面文件连接管理器 :用于与我们从AdventureWorks2017数据库生成的平面文件建立连接
  • OLE DB Connection Manager: used to establish a connection with Tempdb (we will use it as destination) OLE DB连接管理器 :用于与Tempdb建立连接(我们将其用作目标)
  • Data flow task: where we will add the following components: 数据流任务:我们将在其中添加以下组件:
    • Flat File Source: read from the flat file connection manager 平面文件源:从平面文件连接管理器读取
    • MD Contact Verify: Melissa contact verify component MD联系人验证: Melissa联系人验证组件
    • OLE DB Destination: where data will be loaded OLE DB目标:数据将加载到的位置

To configure the Contact Verify component, first, we have to specify the Melissa data directory. In the Contact Verify editor, go to “File > Advanced Configuration”.

要配置联系验证组件,首先,我们必须指定Melissa数据目录。 在联系人验证编辑器中,转到“文件>高级配置”。

Opening advanced configuration

Make sure that the data file path is set to “C:\Program Files\Melissa DATA\SQT\Data”, which is the default data file path.

确保数据文件路径设置为“ C:\ Program Files \ Melissa DATA \ SQT \ Data”,这是默认的数据文件路径。

Advanced configuration form

Now, we will first configure the name parsing feature. In the contact verify editor, we select open the “Name” tab page. Then, we should specify the input Name column and the output Last name column, as shown in the image below.

现在,我们将首先配置名称解析功能。 在联系人验证编辑器中,我们选择打开“名称”标签页。 然后,我们应该指定输入“名称”列和输出“姓氏”列,如下图所示。

Name parsing tab page

Note that even first name and middle name output column are specified by default, they will not generate any data in the community edition. Besides, “Name 2” columns generate data if two names exist in the name field.

请注意,即使默认情况下也指定了名字和中间名输出列,它们在社区版中都不会生成任何数据。 此外,如果名称字段中存在两个名称,则“名称2”列将生成数据。

Next, we should select the “Address” tab page to configure address parsing. Then, we should specify all available input columns, as shown in the image below.

接下来,我们应该选择“地址”标签页来配置地址解析。 然后,我们应该指定所有可用的输入列,如下图所示。

Address parsing input columns

Now, we should press on the “Additional Output Columns” button to specify the output columns generated.

现在,我们应该按“其他输出列”按钮以指定生成的输出列。

Parsed address columns

You will note that all properties related to the enterprise edition are disabled.

您将注意到与企业版相关的所有属性均被禁用。

Next, we must select the “Phone/Email” tab page to configure the phone formatting and the email correction feature. As shown below, we should specify the input phone and email columns, the output columns, and the desired phone format.

接下来,我们必须选择“电话/电子邮件”标签页来配置电话格式和电子邮件更正功能。 如下所示,我们应该指定输入电话和电子邮件列,输出列以及所需的电话格式。

Phone/Email configuration

Next, we must select the “Pass-Through Columns” tab page, to specify what are the columns in the input buffer we need to add to the output buffer.

接下来,我们必须选择“通过列”选项卡页面,以指定需要添加到输出缓冲区的输入缓冲区中的列。

Pass-through columns

The Contact Verify component allows adding conditional filters to the generated output, which is not supported in the community edition. You can check that in the “Output filter” tab page, where you can only change the output name.

联系人验证组件允许将条件过滤器添加到生成的输出中,社区版本不支持此条件。 您可以在“输出过滤器”选项卡页面中进行检查,在该页面中只能更改输出名称。

Output filter

After configuring the MD Contact Verify component, we create a new destination table from the OLE DB destination component using by clicking on the “New” button.

配置MD联系人验证组件之后,我们通过单击“新建”按钮从OLE DB目标组件创建新的目标表。

Creating a new destination table

In the end, the data flow task should look like the following:

最后,数据流任务应如下所示:

Data flow task screenshot

After executing the package, we can see the component impact from the result table, as shown in the following screenshots:

执行完程序包后,我们可以从结果表中看到组件的影响,如以下屏幕快照所示:

Name parsing result

Address parsing result

Phone formatting and email correction result

探查器组件 (Profiler component)

The second Melissa data quality component is MD Profiler. It is a data profiling component similar to the SSIS data profiling task. This component is simple; you should select the input, pass-through, and result columns. And each profile data is generated within a separate output, as shown in the screenshots below. Besides, you can perform some analysis on how the data processing is complete and save the profile to an external file.

第二个Melissa数据质量组件是MD Profiler。 它是一个数据分析组件,类似于SSIS数据分析任务。 这个组件很简单; 您应该选择输入,传递和结果列。 每个配置文件数据都在单独的输出中生成,如下面的屏幕快照所示。 此外,您可以对数据处理的完成方式进行一些分析,并将配置文件保存到外部文件中。

Select input columns

Select needed analysis

Configure profile output

The generated data profile outputs

Linking profile output to a destination

Even if this component is mentioned within the available features of the community edition. It still not working correctly since it may not accept a community License key.

即使在社区版的可用功能中提到了此组件,也是如此。 由于它可能不接受社区许可证密钥,因此仍然无法正常工作。

比赛组件 (Matchup component)

The third Melissa data quality free component is the Matchup component. This component is similar to the SSIS lookup transformation but with a de-duplication feature. De-duplication is performed based on match codes ruleset. In the community edition, only nine-match codes are available.

第三个Melissa数据质量免费组件是Matchup组件。 该组件类似于SSIS查找转换,但具有重复数据删除功能。 重复数据删除是根据匹配代码规则集执行的。 在社区版中,只有九个匹配的代码可用。

Available Matchcode

To perform lookups, you should add a data source and link it to the Lookup component input as shown below.

要执行查找,您应该添加一个数据源并将其链接到Lookup组件输入,如下所示。

Selecting Matchup component input type

结论 (Conclusion)

In this article, we talked briefly about data quality and how to improve it in SQL Server Integration Services (SSIS). We illustrated the Community edition of Melissa data quality and demonstrated the available components; Contact Verify was fully explained while we didn’t provide much information on Profiler and Matchup since they need a separate article. Based on the demonstration, community edition is only used for demonstration while we should buy the enterprise edition since it contains much more powerful tools that we may need at the enterprise level.

在本文中,我们简要讨论了数据质量以及如何在SQL Server Integration Services(SSIS)中提高数据质量。 我们举例说明了社区版的Melissa数据质量,并演示了可用的组件; 尽管我们没有提供有关Profiler和Matchup的大量信息,但已充分解释了Contact Verify,因为它们需要单独的文章。 基于演示,社区版仅用于演示,而我们应该购买企业版,因为它包含我们在企业级可能需要的功能更强大的工具。

翻译自: https://www.sqlshack.com/melissa-data-quality-free-components-for-ssis/

ssis组件

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值