持续集成最佳实践_集成服务性能最佳实践–数据流优化

持续集成最佳实践

In this article, we’ll present a couple of common best practices regarding the performance of Integration Services (SSIS) packages. These are general guidelines for package design and development which guarantee a good performance in most use cases. However, there are always exceptional cases and outliers. The mantra of “measure twice, cut once” also applies here. Thoroughly test any changes to your packages to conclude that a change made a positive effect. This means not only running the in the designer, but also on the server. Also, this article is not an exhaustive list of all possible performance improvements for SSIS packages. It merely represents a set of best practices that will guide you through the most common development patterns.

在本文中,我们将介绍一些有关Integration Services(SSIS)包性能的常见最佳实践。 这些是包装设计和开发的一般准则,可保证在大多数使用情况下均具有良好的性能。 但是,总是有例外情况和异常情况。 “两次测量,一次切割”的口号在这里也适用。 彻底测试您的包装的任何更改,以得出更改产生了积极影响的结论。 这意味着不仅要在设计器中运行,还要在服务器上运行。 此外,本文也不详尽列出SSIS包所有可能的性能改进。 它仅代表了一组最佳实践,可以指导您完成最常见的开发模式。

Let’s get started with data flow performance.

让我们开始了解数据流性能。

#1在源组件中使用SQL语句 (#1 Use a SQL statement in the source component)

Well, this only applies of course if your source supports SQL. For a flat file, you just need to select the columns you need. The idea behind this statement is that when you write a SQL statement that will already optimize the data for the SSIS data flow:

好吧,这当然仅在您的源支持SQL的情况下适用。 对于平面文件,您只需要选择所需的列即可。 该语句背后的想法是,当您编写一条已经为SSIS数据流优化数据SQL语句时:

  • You only select the columns that you need

    您只需选择所需的列
  • You already perform the necessary data type conversions using CAST or CONVERT

    您已经使用CAST或CONVERT执行了必要的数据类型转换
  • You can already perform complex logic such as joining different tables together

    您已经可以执行复杂的逻辑,例如将不同的表连接在一起
  • You can filter the data

    您可以过滤数据
  • You can sort the data if needed

    您可以根据需要对数据进行排序

Whatever you do, don’t use the dropdown box to select a table. There’s no logic you can add, it will select all columns and all rows of the table and it’s actually slower than using a SQL statement (as it uses OPENQUERY behind the scenes).

无论您做什么, 都不要使用下拉框选择表格。 您可以添加逻辑,它将选择表的所有列和所有行,并且实际上比使用SQL语句要慢(因为它在后台使用OPENQUERY)。

Change the Data Access Mode to SQL Command or SQL Command from variable. The second option is useful if you want to dynamically build your query using expressions.

数据访问模式 从变量更改为SQL命令SQL命令 。 如果要使用表达式动态构建查询,则第二个选项很有用。

#2将尽可能多的行放入缓冲区 (#2 Get as many rows as you can into a buffer)

The SSIS data flow uses memory buffers to manage the data flowing through the data flow. It’s very important that you can get as many rows into one single buffer. Imagine a line of people passing buckets to put out a fire. The more water you can get into a bucket, the quicker you can put out a fire. To achieve this goal, you can take two actions:

SSIS数据流使用内存缓冲区来管理流经数据流的数据。 可以将多个行放入一个缓冲区非常重要。 想象有一群人经过水桶灭火。 桶中注入的水越多,灭火的速度就越快。 要实现此目标,您可以采取以下两项措施:

  • Only select the columns you need (that’s why #1 is important). If you bring unnecessary columns into the data flow, you waste precious buffer space.

    只选择您需要的列(这就是为什么#1很重要)。 如果将不必要的列带入数据流,则会浪费宝贵的缓冲区空间。
  • Keep the columns as small as possible. If the total row length is smaller, you can fit more rows into a buffer. Try converting the columns to a smaller size in the source if possible (see also #1). The data flow takes the maximum size of each column to calculate the total row length. For example, if a column has the size VARCHAR(4000), the data flow will assume a length of 4000 bytes, even if the actual rows contain much less data.

    使列尽可能小。 如果总行长度较小,则可以在缓冲区中容纳更多行。 如果可能,请尝试将列中的列转换为较小的尺寸(另请参阅#1)。 数据流采用每列的最大大小来计算总行长。 例如,如果一列的大小为VARCHAR(4000),则即使实际行包含的数据少得多,数据流也将假定为4000字节的长度。

#3不要使用默认的缓冲区设置 (#3 Don’t use the default buffer settings)

While we’re on the topic of buffers, let’s look at some data flow properties:

当我们讨论缓冲区的主题时,让我们看一下一些数据流属性:

  • DefaultBufferSize: the default size of a buffer which is set to 10MB.

    DefaultBufferSize:缓冲区的默认大小,设置为10MB。
  • DefaultBufferMaxRows: the default number of rows the data flow will try to put in a buffer. This is set to 10,000 rows.

    DefaultBufferMaxRows:数据流将尝试放入缓冲区的默认行数。 设置为10,000行。

When calculating the buffer size, the data flow uses these two properties. If one property is reached, the size of the buffer is not further enlarged. For example, if the row size is 1MB, the data flow wil put 10 rows into the buffer, since the maximum buffer size will then be met. If the row size is only 1KB, the data flow will use the maximum of 10,000 rows.

在计算缓冲区大小时,数据流使用这两个属性。 如果达到一种属性,则缓冲区的大小不会进一步扩大。 例如,如果行大小为1MB,则数据流将在缓冲区中放入10行,因为此时将满足最大缓冲区大小。 如果行大小只有1KB,则数据流将最多使用10,000行。

These two default settings haven’t been changed since the release of SSIS in 2005. Apparently, a time where memory was still crazy expensive. The first thing you’ll want to do is set these properties to a much higher value. There’s no “golden” number that will guarantee optimal performance all the time. Sometimes a smaller buffer is better if the source is really slow, sometimes a bigger buffer is better if you don’t have many columns and a very fast source. Keep in mind that buffers can be too big as well. Think about the line of people passing buckets. If the buckets are too big, it takes too long to fill them with water and the fire still rages on. In reality, it takes some testing to find the optimal settings for your data flow. Typically I run my package first with the default settings and I monitor the execution time. Then I enlarge the buffers – to 30MB and 50,000 rows for example – and see what effect it has. And then I try to determine if the package is better off with even larger buffers, or if they need to be a bit smaller.

自2005年SSIS发行以来,这两个默认设置没有更改。显然,那时候内存仍然非常昂贵。 您要做的第一件事就是将这些属性设置为更高的值。 没有“黄金”数字可以确保始终保持最佳性能。 如果源真的很慢,则有时较小的缓冲区会更好,而如果您没有很多列和非常快的源,则有时较大的缓冲区会更好。 请记住,缓冲区也可能太大。 考虑一下人们通过水桶的路线。 如果水桶太大,则将水填满的时间太长,大火仍在蔓延。 实际上,需要进行一些测试才能找到适合您的数据流的最佳设置。 通常,我首先使用默认设置运行程序包,然后监视执行时间。 然后,我将缓冲区扩大到30MB和50,000行,例如,看看它有什么作用。 然后,我尝试确定使用更大的缓冲区是否会使包装更好,或者是否需要更小一些。

Luckily, all this testing can now largely be avoided by one new property introduced in SQL Server 2016: AutoAdjustBufferSize. When this property is set to True, it will automatically adjust the size of the buffer so that the DefaultBufferMaxRows setting is met. For example, if you have set it to 40,000 rows, the data flow will automatically scale the buffer size so that the buffers will contain 40,000 rows. Unfortunately, this property is set to False by default.

幸运的是,现在可以通过SQL Server 2016中引入的一个新属性在很大程度上避免所有这些测试: AutoAdjustBufferSize 。 当此属性设置为True时,它将自动调整缓冲区的大小,以便符合DefaultBufferMaxRows设置。 例如,如果将其设置为40,000行,则数据流将自动缩放缓冲区大小,以便缓冲区将包含40,000行。 不幸的是,默认情况下此属性设置为False。

In conclusion, make those buffers bigger. This can really have a drastic impact on the performance.

总之,使这些缓冲区更大。 这确实会对性能产生巨大影响。

#4避免阻止转换 (#4 Avoid blocking transformations)

Even if you followed the previous tips to the letter, performance can still be terrible if you use a blocking transformation. These are the types of transformations in a data flow:

即使您遵循了前面的提示,但如果使用阻塞转换,性能仍然会很糟糕。 这些是数据流中转换的类型:

  • Non-blocking. The buffers are unchanged and “flow” through the transformation. The data flow performs the transformation on a memory buffer and moves on to the next. Examples are: derived column, data type conversion, conditional split … 不阻塞 。 缓冲区保持不变,并且在整个转换过程中“流动”。 数据流在存储缓冲区上执行转换,然后继续进行下一个转换。 例如:派生列,数据类型转换,条件拆分…
  • Semi-blocking. The buffer can change in size; either in the number of columns or in the number of rows. However, there is no blocking behavior. Examples are the Union All, Merge and Merge Join. 半封堵 。 缓冲区的大小可以改变; 无论是列数还是行数。 但是,没有阻塞行为。 例如“全部联合”,“合并”和“合并联接”。
  • Blocking. The bane of SSIS performance. These transformations need to read 阻塞 。 SSIS性能的祸根。 这些转换需要读取all the buffers before they can output even one single buffer. This can lead to memory pressure which causes the SSIS package to spill to disk. In other words, your package runs now for hours instead of minutes or seconds. Examples are Sort and Aggregate. 所有缓冲区,才能输出一个缓冲区。 这可能会导致内存压力,从而导致SSIS程序包溢出到磁盘。 换句话说,您的程序包现在可以运行几个小时,而不是几分钟或几秒钟。 示例是排序和汇总。

All transformations can also be divided into two categories: synchronous and asynchronous. Synchronous means the buffer doesn’t change in size at all. All non-blocking transformations are synchronous. Sometimes this isn’t obvious: the Multicast component creates multiple output paths, which seems like multiple buffers are being created. However, behind the scenes it’s still the same memory buffer. Asynchronous components do change the size of the buffer. Columns can be added or removed. Rows can be added or removed as well. All blocking and semi-blocking transformations are asynchronous.

所有转换也可以分为两类:同步和异步。 同步意味着缓冲区的大小完全不变。 所有非阻塞转换都是同步的。 有时这并不明显:“多播”组件创建了多个输出路径,似乎正在创建多个缓冲区。 但是,在后台它仍然是相同的内存缓冲区。 异步组件确实会更改缓冲区的大小。 可以添加或删除列。 行也可以添加或删除。 所有阻塞和半阻塞转换都是异步的。

The rule is simple: avoid asynchronous transformations. In most cases, these are blocking anyway. The Union All has the benefit of the doubt: it’s the least bad asynchronous transformation. However, if you can design your package so you can avoid it, it’s recommended that you do. The Merge and Merge Join are to be avoided as well, because they require sorted input. You can sort in the data flow, but the Sort component is a blocking transformation, so it needs to be avoided. An alternative is to sort in the source component, for example by adding an ORDER BY clause to the SQL statement. The data flow needs to know the data is sorted though. You can do this by setting the IsSorted property to true on the output in the Advanced Editor (to open the advanced editor, right-click on the source and select it from the context menu).

规则很简单:避免异步转换。 在大多数情况下,这些都会受到阻止。 Union All具有令人怀疑的好处:这是最糟糕的异步转换。 但是,如果可以设计包装以便避免使用,建议您这样做。 也应避免合并和合并联接,因为它们需要排序的输入。 您可以对数据流进行排序,但是Sort组件是一个阻塞转换,因此需要避免。 另一种方法是对源组件进行排序,例如,通过在SQL语句中添加ORDER BY子句。 数据流需要知道数据已排序。 您可以通过在高级编辑器中的输出上将IsSorted属性设置为true来完成此操作(要打开高级编辑器,请右键单击源,然后从上下文菜单中选择它)。

You also need to indicate which columns are sorted. Keep in mind that this doesn’t sort the data for you! This is just metadata information.

您还需要指出要对哪些列进行排序。 请记住,这不会为您排序数据! 这只是元数据信息。

#5不要使用OLE DB命令转换 (#5 Don’t use the OLE DB command transformation)

Don’t use it. Ever. Unless you have really small data sets and you’re 100% sure they won’t grow. The OLE DB command executes the SQL statement for every row in the buffer. This means if 1 million rows pass through the transformation, the SSIS package will sent 1 million SQL statements to the server. Most likely the transaction log will blow up and it might even fill your disks completely.

不要使用它。 曾经 除非您有非常小的数据集,而且您100%确信它们不会增长。 OLE DB命令为缓冲区中的每一行执行SQL语句。 这意味着,如果一百万行通过转换,则SSIS包将向服务器发送一百万条SQL语句。 事务日志很可能会崩溃,甚至可能会完全填满您的磁盘。

A work around is to write the records to a staging table and then use a Execute SQL Task to do a set-based SQL statement. In other words, use the Execute SQL Task to run one single UPDATE statement instead several thousand UPDATE statements.

解决方法是将记录写入临时表,然后使用“执行SQL任务”执行基于集合SQL语句。 换句话说,使用Execute SQL Task运行一个单独的UPDATE语句,而不是几千个UPDATE语句。

Stay tuned for more performance tricks in next article Integration Services (SSIS) Performance Best Practices – Writing to the Destination.

请继续关注下一篇文章集成服务(SSIS)性能最佳实践–写目标中的更多性能技巧。

参考链接 (Reference links)

翻译自: https://www.sqlshack.com/integration-services-performance-best-practices-data-flow-optimization/

持续集成最佳实践

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值