SSIS Data Flows - ADO.NET vs. OLE DB vs. ODBC

转载 2015年07月09日 17:16:34


​There has been much debate in the developer community on which provider to choose when connecting to SQL Server database.

Performance is often cited as a discriminator, with ADO.NET being less perfor​​mant than ODBC and OLE DB, as it is a managed​ facade. But performance is glossed over as most applications deal only with CRUD operations, and therefore the differences are negligible.

However, as BI developers we typically work with large datasets. We therefore cannot gloss over the performance differences, particularly when implementing SSIS data flow tasks.

A quick search online reveals very little in the way of empirical data on which is faster. Maybe I’ve been searching on the wrong terms, but the only comprehensive test data I’ve come up with is detailed by  Gilbert Quevauvilliers in the following article –

https://gqbi.wordpress.com/2013/06/19/ssis-fastest-data-flow-task-item-for-transferring-data-over-the-network/​

He compares various combinations of ADO.NET and OLE DB source and destination components for a relatively large data set – 10,000,000 rows. He found a combination of OLE DB source and destination components to be over 8x faster than ADO.NET for his testing protocol –

​​Given the lack of evidence, I decided to carry out some tests myself.

Create Source and Destination Tables

I used AdventureWorks2014 as my source database.

I created a {Sales].[SalesOrderDetailBIG] table with the same schema as [Sales].[SalesOrderDetail]. And then populated it with 10,000,000 records, totalling just under 700MB. This is the maximum number of records that I will be testing with. It should be more than enough records to cover most incremental loading scenarios.​

USE AdvntureWorks2014
GO
CREATE TABLE [Sales].[SalesOrderDetailBIG](
        [SalesOrderID] [int] NOT NULL,
   ​     [SalesOrderDetailID] [int] IDENTITY(1,1) NOT NU​LL,
        [CarrierTrackingNumber] [nvarchar](25) NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [LineTotal] numeric(38,6) NOT NULL,
        [rowguid] [uniqueidentifier] NOT NULL,
        [ModifiedDate] [datetime] NOT NULL
) ON [PRIMARY]​​​

My destination is a SQL Server 2014 database on the same server – my underpowered 8GB i5 quad-core development laptop – so there will be no network latency between the source and destination databases. This is not representative of most scenarios, but I’ve only got access to one server at the moment so needs must.

The destination table is as follows –

USE RWDW_Staging 
GO
CREATE TABLE [dbo].[SalesOrderDetail_Staging](
        [SalesOrderID] [int] NOT NULL,
        [SalesOrderDetailID] [int] NOT NULL,
        [CarrierTrackingNumber] [nvarchar](25) NOT NULL,
        [OrderQty] [smallint] NOT NULL,
        [ProductID] [int] NOT NULL,
        [SpecialOfferID] [int] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [UnitPriceDiscount] [money] NOT NULL,
        [ModifiedDate] [datetime] NOT NULL,
 CONSTRAINT [PK_SalesOrderDetail_Staging] PRIMARY KEY CLUSTERED (
        [SalesOrderID] ASC,
        [SalesOrderDetailID] ASC 
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_SalesOrderDetail_Staging] ON [dbo].[SalesOrderDetail_Staging]
(
        [ModifiedDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO​

Please note that the destination table does not include the [LineTotal] and [rowguid] columns. And it has a composite PK on [SalesOrderID] and [SalesOrderDetailID] and a non-clustered index on [ModifiedDate]. More about these features later.

Create Connection Managers

I created 6 connection managers, one each of ADO.NET, OLE DB, and ODBC for the source and destination databases –​

For ADO.NET and OLE DB I used the “.Net Providers\SQLClient Data Provider” and “Native OLE DB\SQL Server Native Client 11.0″ data providers.

For ODBC I configured System DSNs for the source and destination, using the “ODBC Driver 11 for SQL Server” driver. According to the SQL Server 2014 developer guide, this is the successor to the “SQL Server Native Client 11.0″ driver, and should be used in it’s place –

http://msdn.microsoft.com/en-us/library/cc280510.aspx

I kept the default settings for all the connection managers.

​​​Comparison of ADO.NET, OLE DB, and ODBC Source Components

I wanted to initially compare the source data flow components in isolation from the destination components, so I added 3 data flow tasks to my package, and configured them as follows –

ADO.NET Source data flow task​

By adding a RowCount transformation, I do not need to add a destination. Hence I have isolated the source from the destination. The RowCount transformation adds very little overhead and should not skew the results.

I have selected the LocalHost AdventureWorks2014 ADO NET connection manager, and used a SQL command to select only the columns that I need. It is obviously important to select only the columns that are required rather than all columns. It can make a significant improvement in performance. But is something that can be easily overlooked.

OLE DB Source Data Flow Task

The OLE DB Source test data flow follows the same pattern –

The OLE DB Source component allows you to choose the subset of columns to utilise – in this case I have deselected [LineTotal] and [rowguid] –

So you would think that rather than having to create a SQL Command for the source you can combine the “Table or View” data access mode with the selection of only the columns that you need.

But all is not as it seems. A SQL Server Profiler trace reveals the following –

SSIS is still emitting a “select *” statement and so is retrieving all columns.

The answer, as per the ADO.NET source, is to use a SQL Command to select only the columns that are needed. I have done this for the test.

ODBC Source Data Flow Task

The ODBC Source component has a BatchSize property. The Profiler trace reveals the following –

An sp_cursorpreexec system stored procedures is executed to prepare a curssor and populate it with the full data set. The sp_cursorfetch stored procedure is then repeatedly excuted to fetch the data 1000 rows at a time. The final parameter is the batch size specified in the BatchSize property.

I tried a range of BatchSize values, and found negligible difference in the performance. I did however hit a problem when attempting to retrieve the data in a single batch by setting the BatchSize to 10,000,000. The performance was significantly degredated as the SSIS internal buffer size was smaller than the batchsize, and as such there was a big overhead incurred in allocating more memory to the buffer –

​Information: 0x4004800C at ODBC Source, SSIS.Pipeline: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 0 buffers were considered and 0 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.

Information: 0x4004800F at ODBC Source: Buffer manager allocated 2 megabyte(s) in 1 physical buffer(s).

Information: 0x40048010 at ODBC Source: Component "ODBC Source" (2) owns 2 megabyte(s) physical buffer.​

​Source Data Flow Component Results

I wanted to see if the number of records effected the results. I therefore modified the SQL commands using a “top n” statement, and ran the tests for 10,000,000, 1,000,000, 10,000, and 1,000 records respectively for each data source. I also ran each test 3 times over in order to account for chatter on the server. The only other processes running during the testing were background OS and SQL Server processes, but sometimes these can skew the results.

Results are as follows –

Source Records​ Data (MB) Elapsed Time (s) MB/s
ADO.NET 10,000,000 639.727 37.737 16.952
ADO.NET 1,000,000 63.973 4.025 15.894
ADO.NET 100,000 6.397 0.483  
ADO.NET 10,000 0.640 0.109  
OLE DB 10,000,000 639.727 15.693 40.765
OLE DB 1,000,000 63.973 1.716 37.280
OLE DB 100,000 6.397 0.297  
OLE DB 10,000 0.640 0.078  
​ODBC 10,000,000​ 639.727​ 1:13:461​ 8.708​
​ODBC 1,000,000​ 63.973​ 6.614​ 9.672​
​ODBC 100,000​ 6.397​ 0.687​
​ODBC ​10,000 ​0.640 ​0.234

Gilbert found some benefit from optimizing the source connection manager by changing the packet size property to the maximum possible value 32,767. And also by adding the following SQL hints “WITH (NOLOCK) OPTION (FAST 10000)”. Neither of these were applicable for my testing protocol as both my databases are on the same server and so the data is not being sent over a network. Also I am not competing with other processes for access to the data, and do not have to wait for other processes to release locks on the source table.

But as with all SQL hints, you are overriding the optimizer, and so there may or may not be an improvement. You should definitely consider all of the above suggestions when attempting to optimize your processes. Gilbert, to be fair, does say it “sometimes” improves performance.

There’s a good article by Rob Farley on why you should consider using the FAST 10000 hint –

http://sqlblog.com/blogs/rob_farley/archive/2011/02/17/the-ssis-tuning-tip-that-everyone-misses.aspx

Under my testing protocol, the OLE DB source in isolation is approximately 2.5 x quicker than the ADO.NET source, and a whopping 3.8x quicker than the ODBC source, for relatively large data sets. For small data sets the difference is much less, but I suspect only because the overhead of initializing the classes and establishing the connection is obscuring the real throughput performance. This is why, in the table, I have left the MB/s field blank for the 100,000 and 10,000 row tests.

Comparison of ADO.NET, OLE DB, and ODBC Destination Components

I will now test the ADO.NET, OLE DB, and ODBC destination components. But to keep thing simple will only be using the OLE DB source to do this.

ADO.NET Destination Data Flow Task

On the ADO.NET Destination component, I have checked the  “Use Bulk Insert when possible” option. We obviously prefer bulk inserts rather than single row inserts. This option was introduced with 2008 R2. If you are using 2008 or earlier then the ADO.NET destination will use single row inserts and so will be dreadfully slow. Apparently the “Use Bulk Insert when possible” option was added to enable the transfer of data to SQL Azure, given that connecting to SQL Azure using OLE DB is not supported by Microsoft (although it does work).

I ran the data flow task 3 timed for 10,000,000. The results were – 2:55.220; 2:54.690; and 2:54.784.

Looking at the Profiler trace reveals that the component is performing a bulk insert, but is inserting in batches of 9557 rows –

SSIS buffers the data during a data flow. The data flow task has a couple of properties for controlling the buffer size – DefaultBufferMaxRow and DefaultBufferSize. DefaultBufferMax rows applies an upper limit to the maximum number of rows in the buffer. I set this to 10,000,000 as this is the number of rows that I am transferring. The DefaultBufferSize is the default size in bytes of the buffer. I set this to the maximum possible, which is 100 MB.

In addition, I increased the command timeout property of the ADO.NET destination component from 30 to 300 seconds to avoid time outs. And I kept the BatchSize property equal to 0, which instructs SSIS to use the same size as the internal buffer.

The net effect is that the bulk inserts were now being performed in batches of 728,177 rows; and this is constrained by the 100 MB DefaultBufferSize –

But contrary to expectation, increasing the buffer and batch size actually had a detrimental effect on performance, with the task now taking over 4 minutes to complete. I tried a range of DefaultMaxBufferRow values, and found 20,000 to produce the best results as follows – 2:47.483; 2:51.055; and 2:53.297.

However, this is only marginally faster than the starting point of 10,000. I would expect the optimal value to very much depend on context, so worth trying out different values. But do not expect it to greatly improve the performance with the ADO.NET destination.

As mentioned above, the staging table has a PK constraint, and hence is a B-tree rather than a heap. And it has a non-clustered index which will of course utilize resources to maintain.

Deleting the non-clustered index results in the following – 2:11.950; 2:18.155; and 2:19.824. So an average improvement of 33 seconds. But this must be offset slightly by the 12 seconds that it takes to recreate the index once all records have been inserted.

Additionally dropping the PK constraint results in the following – 1:51.344; 1:51.853; and 1:59.183. But it took over 2 minutes to recreate the constraint, so on balance dropping the PK was detrimental.

OLE DB Destination Data Flow Component

I configured the OLE DB Source component to restrict the columns to the required subset using a SQL command –

And I configured the OLE DB Destination component to use the “Table or view – fast load” data access mode, and hence enable bulk rather than row-by-row inserts –

The staging table had both the PK constraint and the non-clustered index enabled.

My initial results with 10,000,000 rows were as follows – 2:09.981; 2:05.238; and 2:00.589. This is not far off the best results achieved with the ADO.NET Destination. But we’ve only just started.

The source database is configured to use the Bulk-logged recovery model, which aims to minimally log bulk operations. This should be significantly more performant than the Full recovery model, assuming the bulk insert meets the criteria required to be minimally logged. The criteria for the target table are as follows –

  • The target table is not being replicated.
  • Table locking is specified (using TABLOCK).
  • If the table has no indexes, data pages are minimally logged.
  • If the table does not have a clustered index but has one or more non-clustered indexes, data pages are always minimally logged. How index pages are logged, however, depends on whether the table is empty –
    • If the table is empty, index pages are minimally logged.
    • If table is non-empty, index pages are fully logged.
  • If the table has a clustered index and is empty, both data and index pages are minimally logged. In contrast, if a table has a clustered index and is non-empty, data pages and index pages are both fully logged regardless of the recovery model.

My destination table is not being replicated, and the Profiler trace reveals that the TABLOCK hint is being applied with the “insert bulk” statement -insert bulk [dbo].

[SalesOrderDetail_Staging]([SalesOrderID] int,[SalesOrderDetailID] int,[CarrierTrackingNumber] nvarchar(25) collate Latin1_General_CI_AS,[OrderQty] smallint,[ProductID] int,[SpecialOfferID] int,[UnitPrice] money,[UnitPriceDiscount] money,[ModifiedDate] datetime)with(TABLOCK,CHECK_CONSTRAINTS)

My destination table has a clustered index (PK) and also a non-clustered index. However, I am truncating the table before executing the data flow task. But, there is a caveat – the data is being inserted in batches. The table is only empty for the 1st batch. As such only the 1st batch is being minimally logged.

As with the ADO.NET Destination test, I dropped the non-clustered index, but I have kept the PK constraint. I also increased the DefaultBufferMaxRow and DefaultBufferSize settings –

And then set the OLE DB Destination component’s “Maximum insert commit size” property to 0, which tells SSIS to commit the rows as a single batch at the end of the data-flow –

If I was to run the data flow now, the performance would still be relatively slow as the rows will be written to the TempDB, then sorted by the PK fields, before being written to the destination table. To ensure that the rows are written directly to the destination table, I added a ORDER BY statement to the OLE DB Source SQL Command –

And finally, I added an ORDER hint to the OLE DB Destination component’s FastLoadOptions property, specifying the PK fields –

My final results for the OLE DB Destination are as follows – 0:43.197; 0:43.493; and 0:43.493

ODBC Destination Component

I configured the Data Flow task and OLE DB Source component as per the ADO.NET and OLE DB tests. I then configured the ODBC Destination component to use the “Table Name – Batch” data access mode as follows –

In the SQL Server 2014 documentation Microsoft describes the batch load option as follows –

Batch: In this mode the ODBC destination attempts to use the most efficient insertion method based on the perceived ODBC provider capabilities. For most modern ODBC providers, this would mean preparing an INSERT statement with parameters and then using a row-wise array parameter binding (where the array size is controlled by the BatchSize property). If you select Batch and the provider does not support this method, the ODBC destination automatically switches to the Row-by-row mode.​

http://msdn.microsoft.com/en-us/library/hh758691.aspx​

I executed the task and looked at the Profiler trace, and pretty soon realised that the data was being inserted row-by-row –

I stopped the task as it was taking too long to complete.

I cannot see any obvious reason for the data being inserted row-by-row rather than in a batch. I need to investigate this further at a later date and will return and completing the tests when I have a resolution. Unfortunately, for now I am not including the ODBC destination component in the results.

SQL Server Destination Component​

​​Another option for connection to a destination SQL Server database is to use the SQL Server Destination component. I have excluded this from the test, as it is limited to local databases only, and so you are restricting yourself to only being able to run the package on the destination server. This may fit with your current deployment plan, but potentially limits extensibility and scallability. There would have to be good performance benefits for me to consider accepting this restriction, but most commentators say that the performance is very similar to the latest incarnations of the OLE DB Destination. To back this up, the description of the component in the BIDS toolbox includes the following note –

Note: to optimize performance, we recommend using the OLE DB destination instead.

Destination Data Flow Component Results

As per the Source data flow component tests, I modified the SQL commands using a “TOP n” statement, and ran the tests for 10,000,000, 1,000,000, 10,000, and 1,000 records respectively for each destination.  And I ran each test 3 times over in order to account for chatter on the server.

Source Destination
Records
Data (MB) Elapsed Time (s)​​ MB/s
OLE DB ADO.NET 10,000,000 639.727 2:16.543 4.685
OLE DB ADO.NET 1,000,000 63.973 13.004 4.919
OLE DB ADO.NET 100,000 6.397​ 1.347  
OLE DB ADO.NET 10,000 0.640 0.265  
OLE DB OLE DB 10,000,000 639.727 43.394 14.742
OLE DB OLE DB 1,000,000 63.973 4.649 13.760
OLE DB OLE DB 100,000 6.397 0.577  
OLE DB OLE DB 10,000 0.640 0.156  

Why is the OLE DB Destination so much faster than the ADO.NET Destination?

Although not statistically significant, my results infer that the OLE DB Destination is approximately 3 times faster than the ADO.NET Destination when transferring reasonably large data sets.

I can think of 2 reasons that may explain this –

  • The OLE DB Destination component gives us more control over the bulk load. In particular, it allows us to apply hints using the FastLoadOptions to ensure that the data is inserted directly into the destination table, rather than being inserted into the TempDB table prior to sorting in order to accommodate the PK constraint. It also allows us to commit the whole data set as a single batch.
  • The OLE DB destination inserts rows using the following statement –
insert bulk [dbo].[SalesOrderDetail_Staging]([SalesOrderID] int,[SalesOrderDetailID] int,[CarrierTrackingNumber] nvarchar(25) collate Latin1_General_CI_AS,[OrderQty] smallint,[ProductID] int,[SpecialOfferID] int,[UnitPrice] money,[UnitPriceDiscount] money,[ModifiedDate] datetime)with(TABLOCK,CHECK_CONSTRAINTS)

C​​ompare this to the ADO.NET destination statement –

insert bulk "dbo"."SalesOrderDetail_Staging" ([SalesOrderID] Int, [SalesOrderDetailID] Int, [CarrierTrackingNumber] NVarChar(25) COLLATE Latin1_General_CI_AS, [OrderQty] SmallInt, [ProductID] Int, [SpecialOfferID] Int, [UnitPrice] Money, [UnitPriceDiscount] Money, [ModifiedDate] DateTime)

There are no table hints. And in particular, there is no TABLOCK hint. This is necessary for minimal logging, so even if we meet all the other criteria for minimal logging, the ADO.NET destination does not request it.

In Conclusion

  • ​​I found the OLE DB Source to be approximately 2.5x faster than the ADO.NET source, and approximately 3.8x faster than the ODBC Source, when reading a large number of rows from a SQL Server Database hosted on the same server as SSIS.
  • When combined with an OLE DB Source, I found the OLE DB Destination to be ​3x faster than the ADO.NET Destination when transfering a lerge number of rows between SQL Server databases hosted on the same server.
  • Although I selected the “Table Load – Batch” data access mode, the ODBC Destination performed row-by-row inserts, which is not acceptible for a large data set. I need to investigate this further to determine a resolution.
  • My source and destination databases where hosted on the same server as SSIS. As such my testing protocol does not account for network latencies or constraints. Arguably my testing protocol is flawed because of this. I will endevour to repeat the tests on a more realistic infrastructure when one becomes available to me.
  • My results aren’t statistically significant – 3 observations per population – and my testing protocol is flawed. But I think my results add to the anecdotal evidence which suggests that OLE DB should be the first port of call when building SSIS data flows.​

Caveat Emptor

​OLE DB would appear to be the most performant option for SSIS data flow SQL Server source and destination components.

However, in August 2011 Microsoft announced that they will be deprecating the SQL Server OLE DB provider, with SQL Server 2012 being the last release to support it. True to their word, the provider has not been updated for SQL Server 2014, which is why I have used the 2012 provider “Native OLE DB\SQL Server Native Client 11.0″ for my OLE DB connections.
As discussed in the article below, they are aligning their strategy with ODBC, and they recommend using ODBC for new projects, whilst planning for the upgrade of existing projects. I will not be doing either until it is clear that the performance of ODBC has been improved to match that of OLE DB. I’m sure ​Microsoft are working on it.​


SQL那些事儿(十一)--ODBC,OLE-DB,ADO.NET区别

一、ODBC ODBC的由来 1992年Microsoft和Sybase、Digital共同制定了ODBC标准接口,以单一的ODBC API来存取各种不同的数据库。随后ODBC便获得了许多数据库厂...

ODBC、JDBC、OLE DB、 ADO、ODAC、ODP.NET介绍

面对各式各样、越来越多的数据来源和访问需求。软件开发框架中一般都提供了统一的访问接口和方法,来屏蔽数据库底层差异。 各式各样的Provider提供者。 ODBC(Open Database Con...

ADO.NET 快速入门(十三):使用 OLE DB 检索数据

OleDbDataReader 类提供了一种从数据源读取数据记录只进流的方法。如果想使用 SQL Server 7.0 或者更高版本,请参考文章:使用 SQL Server 检索数据。   ...

VS C# ADO.NET代码连接Access数据库

在Visual Studio2013 C#中以编程方式操作Access数据库,一般使用OleDb进行连接: 1.首先添加两个引用(使用using): using System.Data.OleDb; ...

【数据库】ADO.NET----搞清VS中调用数据库语句

前言        学习到现在了,还没有对ADO进行总结,第一次是自己感觉学习完了,把视频看完了,知道代码怎么写就可以了,下次用到来就去找上一次的代码。后来重构的时候发现用到的依旧是那几个,基本上...

VS2013 ADO.NET 连接 SQLEXPRESS

开发环境:VS2013 + SQLEXPRESS 出现问题:网站 Web.config 已经配置了connectionString,但是不知道怎么用代码连接 解决方案:ADO.NE...

[VS2010] ADO.NET Entity Framework: 由 Entity Object 執行 SQL 指令

http://www.dotblogs.com.tw/regionbbs/archive/2010/02/16/netfx.4.0.new.feature.ado.net.entity.framewo...

解决VS2013使用MySQL无法创建ADO.NET实体数据类型的问题

VS2013使用MySQL无法创建ADO.NET实体数据类型的问题 安装connector和vs插件之后,在添加ado.net实体时,提示项目使用了最新版本实体框架;但是,找不到进行数据连接所需的与此...

VS2008在做简单的ado.net连接数据库时出现的问题

用的是VS2008和SQL Server2005。 想要做一个简单的 可以查询、修改数据等功能的winform。 新建工程,添加一个datagridview控件来显示结果,几个textbox和bu...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:SSIS Data Flows - ADO.NET vs. OLE DB vs. ODBC
举报原因:
原因补充:

(最多只允许输入30个字)