ssis 列转换_SSIS包中的行采样转换和百分比采样转换

ssis 列转换

This article explores Row Sampling Transformations in SSIS and Percentage Sampling Transformations in SSIS packages

本文探讨了SSIS中的行采样转换和SSIS包中的百分比采样转换

介绍 (Introduction)

Sometimes we have a large number of rows in database tables. We want to perform some calculations on data, but we do not require the complete data from the table. You can understand it like a blood test in the medical laboratory. It requires you to give a sample of your blood so that they can perform lab test on it and give you results.

有时我们数据库表中有很多行。 我们想对数据进行一些计算,但是我们不需要表中的完整数据。 您可以像医学实验室的验血一样理解它。 它要求您提供您的血液样本,以便他们可以对其进行实验室测试并获得结果。

Suppose we have customer’s sales record and it contains millions of records. You want to perform aggregations or calculations on those data. If you take complete data, it will consume many system resources and package execution will also take time. We want a mechanism that should fetch 20% records from the table, and it can be used for further work.

假设我们有客户的销售记录,并且其中包含数百万条记录。 您要对这些数据执行汇总或计算。 如果您获取完整的数据,它将消耗许多系统资源,并且包执行也将花费时间。 我们想要一种应该从表中获取20%记录的机制,并且可以将其用于进一步的工作。

Another example you can consider is the lottery system. Suppose you want to select 20 customers from the entire data set randomly.

您可以考虑的另一个示例是彩票系统。 假设您要从整个数据集中随机选择20个客户。

We can use data sampling transformation in SSIS to fulfil our purpose.

我们可以在SSIS中使用数据采样转换来实现我们的目的。

SSIS package provides several useful transformations. In the previous articles such as Term Extraction Transformation in SSIS, An Overview of the LOOKUP TRANSFORMATION in SSIS, we learned such transformations.

SSIS包提供了一些有用的转换。 在以前的文章(例如SSIS中的术语提取转换,SSIS 中的LOOKUP转换概述)中 ,我们学习了这种转换。

SSIS package provides data sampling transformations in SSIS that can solve the problem described earlier. We can see two kinds of data sampling transformations:

SSIS包提供了SSIS中的数据采样转换,可以解决前面所述的问题。 我们可以看到两种数据采样转换:

  • Row Sampling Transformation

    行采样转换
  • Percentage Sampling Transformation in SSIS package

    SSIS程序包中的百分比采样转换

先决条件 (Pre-requisites )

  • SQL Server data tools

    SQL Server数据工具
  • AdventureWorks sample database

    AdventureWorks示例数据库
  • SQL Server integration service

    SQL Server集成服务
  • SSIS category. SSIS类别中的文章进行操作。

Let’s create an SSIS package for data sampling in the SSIS package.

让我们为SSIS包中的数据采样创建一个SSIS包。

The following table in AdventureWorks database contains 19820 rows.

AdventureWorks数据库中的下表包含19820行。

SELECT  [CustomerID]
      ,[PersonID]
      ,[StoreID]
      ,[TerritoryID]
      ,[AccountNumber]
      ,[rowguid]
      ,[ModifiedDate]
  FROM [adventureworks2014].[Sales].[Customer]

In the SSIS Control flow window, add a data flow task and rename it to Data Sampling Transformation in SSIS.

在“ SSIS控制流”窗口中,添加一个数据流任务,并将其重命名为SSIS中的“数据采样转换”。

Data Sampling Transformation in SSIS

Right-click on this data flow task and Edit. It takes you to data flow page. In this page, you can see that we are in this particular data flow task.

右键单击此数据流任务,然后单击“编辑”。 它带您到数据流页面。 在此页面中,您可以看到我们正在执行此特定的数据流任务。

data flow task

Add an OLE DB source and rename the task as appropriate. This task should point to SQL instance and Sales.Customers table in AdventureWorks2014 database.

添加一个OLE DB源,并根据需要重命名该任务。 此任务应指向AdventureWorks2014数据库中SQL实例和Sales.Customers表。

rename data flow task

Right-click on a blank area in the data flow task and click on Add Annotation. Annotation is similar to a text box that does not execute, and we use to print messages that help to understand the SSIS package.

右键单击数据流任务中的空白区域,然后单击“ 添加注释”。 注释类似于不执行的文本框,我们用于打印有助于理解SSIS包的消息。

I specified 19820 rows in this annotation box.

我在此注释框中指定了19820行。

Add annonation

We have prepared the base of the SSIS package in this step. Let’s move forward with data sampling transformations in SSIS.

在此步骤中,我们已经准备了SSIS软件包的基础。 让我们继续进行SSIS中的数据采样转换。

SSIS中的行采样转换 (Row Sampling Transformation in SSIS)

We use Row sampling transformation to retrieve a specified random number of data rows from the source data table. It gives random data every time we execute the SSIS package. You get two outputs from this transformation.

我们使用行采样转换从源数据表中检索指定数量的随机数据行。 每次执行SSIS程序包时,它都会提供随机数据。 您将从此转换中获得两个输出。

  1. Random data based on a specified number of rows

    基于指定行数的随机数据
  2. Rest of the data that is not available in step 1

    步骤1中不可用的其余数据

Add a Row Sampling transformation from the SSIS toolbox and drag the blue arrow from source to transformation, as shown below.

从SSIS工具箱中添加一个行采样转换,并将蓝色箭头从源拖动到转换,如下所示。

Row Sampling transformation in SSIS

Double click on Row Sampling, and it opens the row sampling transformation editor.

双击“行采样”,它将打开行采样转换编辑器。

Row Sampling transformation configuration

We have the following options in this editor window.

在此编辑器窗口中,我们有以下选项。

  • The number of rows: We can specify the number of random rows we want from the transformation. The default value is 10. Let’s modify it to 1000 rows 行数:我们可以指定转换中需要的随机行数。 默认值为10。让我们将其修改为1000行
  • Sample Output name: It shows the name of the output that we get from the transformation as specified by the number of rows parameter. The default name is Sampling Selected Output. Let’s modify it to 示例输出名称:它显示了我们从转换中获得的输出的名称,该数目由rows参数指定。 默认名称是“采样所选输出”。 让我们将其修改为“ Row Sampling Match output 行采样匹配”输出
  • Unselected output name: It is the name of the output that contains data excluded from row transformation. We get the total number of rows in the table – the number of rows specified in this output. Let’s modify the name to 未选择的输出名称:这是包含行转换中排除的数据的输出名称。 我们获得表中的总行数–在此输出中指定的行数。 让我们将名称修改为Excluded data 排除的数据

specify number of rows and output name

Let’s skip the option ‘User the following random seed’ as of now. We will cover it in the latter part of the article.

现在,让我们跳过选项“ 用户以下随机种子” 。 我们将在文章的后半部分介绍它。

Click on Columns, and it shows all available source columns.

单击列,它显示所有可用的源列。

Available input columns

Now, add two SSIS Multicast transformations and rename them as follows.

现在,添加两个SSIS组播转换 ,并按以下方式重命名它们。

  • Multicast – Matched

    组播-匹配
  • Multicast – Unmatched

    组播–无与伦比

Add Multicast transformation

Join the output from Row Sampling to Multicast – Matched, and it opens the input-output selection window. In the output column, select the output – Row Sampling Match output.

将“行采样”的输出连接到“多播-匹配”,它会打开输入-输出选择窗口。 在输出列中,选择输出–行采样匹配输出。

Row Sampling Match output

Similarly, take the second output from Row Sampling and join to Multicast – unmatched transformation. It will automatically take another available output, as shown below.

同样,从行采样中获取第二个输出,然后加入多播-无与伦比的转换。 它将自动获取另一个可用的输出,如下所示。

We added SSIS Multicast operator here to display the data. If you want to insert data into SQL table, configure OLE DB destination as well.

我们在此处添加了SSIS多播运算符以显示数据。 如果要将数据插入SQL表中,请同时配置OLE DB目标。

SSIS package configuration

Right-click on the arrow between Row Sampling and Multicast- Matched and enable data viewer.

右键单击“行采样”和“多播匹配”之间的箭头,然后启用数据查看器。

Enable data viewer

It shows the following symbol on the arrow.

它在箭头上显示以下符号。

data viewer symbol

Press F5 to execute the SSIS package. It opens the data viewer, and you can see the 1000 rows in the output.

按F5执行SSIS包。 它打开数据查看器,您可以在输出中看到1000行。

Execute package and view sample data

Close this data viewer and package execution complete. In the output, we can see that

关闭此数据查看器并完成包执行。 在输出中,我们可以看到

  • Multicast – Matched gets 1000 rows

    组播-匹配获得1000行
  • Multicast – Unmatched gets 18,820 rows

    组播-无与伦比获得18,820行

Successful package execution

Let’s do the following configurations in the data sampling

在数据采样中进行以下配置

  • Number of rows: 10

    行数:10
  • Use single-column AccountNumber

    使用单列帐号

Execute the SSIS package twice and note the output.

执行两次SSIS程序包并记下输出。

First execution:

第一次执行:

First SSIS execution

Second execution:

第二次执行:

Second SSIS execution

You can compare the output in both the executions. In both the executions, it gives random account numbers and it different in both execution. It might also pick certain account number again in second execution in random pick up.

您可以在两个执行中比较输出。 在这两次执行中,它都提供随机帐号,并且两次执行时都不同。 它还可能会在第二次执行中再次随机选择某些帐号。

Suppose we want to get similar records on each execution. It should give us the output as per specified record count, but records should not be random.

假设我们希望在每次执行时获得类似的记录。 它应根据指定的记录计数为我们提供输出,但记录不应是随机的。

In the configured SSIS package, open the properties again for row sampling transformation and set the random seed value to 1. It is recommended only for testing purpose.

在配置的SSIS包中,再次打开属性以进行行采样转换,并将随机种子值设置为1。建议仅将其用于测试目的。

Random data seed configuration

Execute the package again twice and observe the output.

再次执行该程序包两次,并观察输出。

First execution

第一次执行

First SSIS execution

Second execution

第二次执行

Second execution

You get same data in both the executions. It picks the random data at once and does not change data in the next execution.

您在两个执行中都得到相同的数据。 它会立即选择随机数据,并且在下次执行时不更改数据。

SSIS中的百分比采样转换 (Percentage sampling transformation in SSIS)

In the previous section, we discussed the Row Sampling Transformation in SSIS. Percentage sampling configuration is similar to row sampling.

在上一节中,我们讨论了SSIS中的行采样转换。 百分比采样配置类似于行采样。

In row sampling, we specify the number of rows that we want in the output, such as 500 rows or 1000 rows. I Percentage sampling, we specify the percentage of rows. For example, if total incoming rows are 1000 and we specify 10% percentage sample, we get approximately 100 rows in the matched output. Remaining rows get to unmatched row output.

在行采样中,我们指定输出中想要的行数,例如500行或1000行。 在百分比抽样中,我们指定行的百分比。 例如,如果输入的行总数为1000,并且我们指定了10%的采样百分比,则在匹配的输出中将获得大约100行。 剩余的行将到达不匹配的行输出。

Similar to row sampling transformation, it picks random sampling data, and you might get a completely different result set on each execution. We can specify random seed value to get similar data on each execution.

与行采样转换类似,它会选择随机采样数据,并且每次执行时您可能会获得完全不同的结果集。 我们可以指定随机种子值以在每次执行时获取相似的数据。

Let’s do the configuration for Percentage sampling transformation in SSIS package.

让我们在SSIS包中进行百分比采样转换的配置。

  • Remove the row sampling and underlying multicast operators

    删除行采样和基础多播运算符
  • Drag a percentage sampling from the SSIS toolbox and join arrow between source data and percentage sampling

    从SSIS工具箱中拖动百分比采样,并在源数据和百分比采样之间加入箭头
  • In percentage sampling, specify the percentage of rows, output column names

    在百分比采样中,指定行的百分比,输出列名称

Percentage sampling transformation in SSIS
  • SSIS Multicast transformations SSIS组播转换
  • First Multicast transformation gets the desired percentage sample of rows

    第一个多播转换获取所需的行样本百分比
  • Other Multicast transformation gets unmatched rows

    其他多播转换获取不匹配的行

Percentage sampling configuration in SSIS

Execute the SSIS package. In the following screenshot, we can see that percentage sampling transformation in SSIS does the following tasks

执行SSIS包。 在下面的屏幕快照中,我们可以看到SSIS中的百分比采样转换完成了以下任务

  • Total number of rows in source table – 19820

    源表中的总行数– 19820
  • Specify percentage sampling – 5%

    指定百分比抽样-5%
  • First, Multicast gets 1015 rows that are approximately 5% of the sample data

    首先,多播获得1015行,约占样本数据的5%

Output of percentage transformation

Let’s specify the random seed value 1 in the percentage sampling transformation and execute the SSIS package.

让我们在百分比采样转换中指定随机种子值1并执行SSIS包。

Percentage sampling transformation editor

In both executions, it gets similar account number in the output.

在这两次执行中,它在输出中都获得相似的帐号。

First execution

第一次执行

First execution of SSIS package

Second execution

第二次执行

Second execution of SSIS package

结论 (Conclusion)

In this article, we explored data sampling technique – Row Sampling transformation and Percentage Sampling transformation in SSIS package. You can use these transformations to test package against a different set of data and analyze results.

在本文中,我们探讨了数据采样技术– SSIS包中的行采样转换和百分比采样转换。 您可以使用这些转换来针对不同的数据集测试程序包并分析结果。

翻译自: https://www.sqlshack.com/row-sampling-transformations-in-ssis-and-percentage-sampling-transformations-in-ssis-packages/

ssis 列转换

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值