如何在SQL Server中导入具有不同列数的平面文件

Ever been as frustrated as I have when importing flat files to a SQL Server and the format suddenly changes in production?

将平面文件导入到SQL Server时,是否曾经像我一样沮丧过,而格式突然在生产中发生了变化?

Commonly used integration tools (like SSIS) are very dependent on the correct, consistent and same metadata when working with flat files.

在使用平面文件时,常用的集成工具(如SSIS)非常依赖正确,一致且相同的元数据。

So I’ve come up with an alternative solution that I would like to share with you.

因此,我想出了一个替代解决方案,希望与您分享。

When implemented, the process of importing flat files with changing metadata is handled in a structured, and most important, resiliant way. Even if the columns change order or existing columns are missing.

实施后,将以结构化(也是最重要的)弹性方式处理具有更改的元数据的导入平面文件的过程。 即使列更改顺序或缺少现有列。

背景 (Background)

When importing flat files to SQL server almost every standard integration tool (including TSQL bulkload) requires fixed metadata from the files in order to work with them.

将平面文件导入SQL Server时,几乎所有标准集成工具(包括TSQL bulkload)都需要文件中的固定元数据才能使用它们。

This is quite understandable, as the process of data transportation from the source to the destination needs to know where to map every column from the source to the defined destination.

这是可以理解的,因为从源到目标的数据传输过程需要知道从源到定义的目标的每列映射位置。

Let me make an example:

让我举一个例子:

A source flat file table like below needs to be imported to a SQL server database.

如下所示的源平面文件表需要导入到SQL Server数据库中。

This file could be imported to a SQL Server database (in this example named FlatFileImport) with below script:

可以使用以下脚本将此文件导入到SQL Server数据库(在此示例中为FlatFileImport):

 
create table dbo.personlist (
	[name] varchar(20),
	[gender] varchar(10),
	[age] int,
	[city] varchar(20),
	[country] varchar(20)
);
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlist.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist;
 

The result:

结果:

If the column ‘Country’ would be removed from the file after the import has been setup, the process of importing the file would either break or be wrong (depending on the tool used to import the file) The metadata of the file has changed.

如果在设置导入后将从文件中删除“国家/地区”列,则导入文件的过程可能会中断或出错(取决于用于导入文件的工具)。文件的元数据已更改。

 
-- import data from file with missing column (Country)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistmissingcolumn.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist;
 

With this example, the import seems to go well, but upon browsing the data, you’ll see that only one row is imported and the data is wrong.

在此示例中,导入似乎进展顺利,但是在浏览数据时,您将看到仅导入一行并且数据错误。

The same would happen if the columns ‘Gender’ and ‘Age’ where to switch places. Maybe the import would not break, but the mapping of the columns to the destination would be wrong, as the ‘Age’ column would go to the ‘Gender’ column in the destination and vice versa. This due to the order and datatype of the columns. If the columns had the same datatype and data could fit in the columns, the import would go fine – but the data would still be wrong.

如果“性别”和“年龄”列在何处切换位置,也会发生同样的情况。 也许导入不会中断,但是将列映射到目标位置将是错误的,因为“年龄”列将转到目标位置的“性别”列,反之亦然。 这是由于列的顺序和数据类型。 如果这些列具有相同的数据类型,并且数据可以容纳在这些列中,则可以正常导入-但数据仍然是错误的。

 
-- import data from file with switched columns (Age and Gender)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistswitchedcolumns.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 

When importing the same file, but this time with an extra column (Married) – the result would also be wrong:

导入相同文件时,但是这次添加了额外的列(已婚)–结果也将是错误的:

 
-- import data from file with new extra column (Married)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistextracolumn.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist; 
 

The result:

结果:

The above examples are made with pure TSQL code. If it was to be made with an integration tool like SQL Server Integration Services, the errors would be different and the SSIS package would throw more errors and not be able to execute the data transfer.

上面的示例是使用纯TSQL代码编写的。 如果使用SQL Server Integration Services之类的集成工具来制作,则错误将有所不同,并且SSIS包将引发更多错误,并且无法执行数据传输。

治愈 (The cure)

When using the above BULK INSERT functionality from TSQL the import process often goes well, but the data is wrong with the source file is changed.

使用TSQL的上述BULK INSERT功能时,导入过程通常会顺利进行,但是源文件中的数据有误。

There is another way to import flat files. This is using the OPENROWSET functionality from TSQL.

还有另一种导入平面文件的方法。 这使用了TSQL的OPENROWSET功能。

In section E of the example scripts from MSDN, it is described how to use a format file. A format file is a simple XML file that contains information of the source files structure – including columns, datatypes, row terminator and collation.

在来自MSDN的示例脚本的E节中,描述了如何使用格式文件。 格式文件是一个简单的XML文件,其中包含源文件结构的信息-包括列,数据类型,行终止符和排序规则。

Generation of the initial format file for a curtain source is rather easy when setting up the import.

设置导入时,为幕帘源生成初始格式文件非常容易。

But what if the generation of the format file could be done automatically and the import process would be more streamlined and manageable – even if the structure of the source file changes?

但是,如果可以自动完成格式文件的生成,并且导入过程将更加简化和可管理-即使源文件的结构发生了变化,该怎么办?

From my GitHub project you can download a home brewed .NET console application that solves just that.

从我的GitHub项目中,您可以下载一个自制的.NET控制台应用程序,它可以解决这个问题。

If you are unsure of the .EXE files content and origin, you can download the code and build your own version of the GenerateFormatFile.exe application.

如果不确定.EXE文件的内容和来源,则可以下载代码并生成自己的GenerateFormatFile.exe应用程序版本。

Another note is that I’m not hard core .Net developer, so someone might have another way of doing this. You are very welcome to contribute to the GitHub project in that case.

另一个需要注意的地方是,我不是.Net开发人员的核心,所以有人可能会采用另一种方法。 在这种情况下,非常欢迎您为GitHub项目做出贡献。

The application demands inputs as below:

该应用程序要求输入如下:

Example usage:

用法示例:

generateformatfile.exe -p c:\source\ -f personlist.csv -o personlistformatfile.xml -d ;

generateformatfile.exe -pc:\ source \ -f personlist.csv -o personlistformatfile.xml -d;

The above script generates a format file in the directory c:\source\ and names it personlistFormatFile.xml.

上面的脚本在目录c:\ source \中生成一个格式文件,并将其命名为personlistFormatFile.xml。

The content of the format file is as follows:

格式文件的内容如下:

The console application can also be called from TSQL like this:

也可以从TSQL调用控制台应用程序,如下所示:

 
-- generate format file
declare @cmdshell varchar(8000);
set @cmdshell = 'c:\source\generateformatfile.exe -p c:\source\ -f personlist.csv -o personlistformatfile.xml -d ;'
exec xp_cmdshell @cmdshell;
 

If by any chance the xp_cmdshell feature is not enabled on your local machine – then please refer to this post from Microsoft: Enable xp_cmdshell

如果您的本地计算机上未启用xp_cmdshell功能,请参考Microsoft的这篇文章: 启用xp_cmdshell

Using the format file

使用格式文件

After generation of the format file, it can be used in TSQL script with OPENROWSET.

生成格式文件后,可以在带有OPENROWSET的TSQL脚本中使用它。

Example script for importing the ‘personlist.csv’

导入“ personlist.csv”的示例脚本

 
-- import file using format file
select *  
into dbo.personlist_bulk
from  openrowset(
	bulk 'c:\source\personlist.csv',  
	formatfile='c:\source\personlistformatfile.xml',
	firstrow=2
	) as t;
 
select * from dbo.personlist_bulk;
 

This loads the data from the source file to a new table called ‘personlist_bulk’.

这会将数据从源文件加载到名为“ personlist_bulk”的新表中。

From here the load from ‘personlist_bulk’ to ‘personlist’ is straight forward:

从这里开始,从'personlist_bulk'到'personlist'的负载很简单:

 
-- load data from personlist_bulk to personlist
truncate table dbo.personlist;
 
insert into dbo.personlist (name, gender, age, city, country)
select * from dbo.personlist_bulk;
 
select * from dbo.personlist;
 
drop table dbo.personlist_bulk;
 

即使源更改也加载数据 (Load data even if source changes)

The above approach works if the source is the same every time it loads. But with a dynamic approach to the load from the bulk table to the destination table it can be assured that it works even if the source table is changed in both width (number of columns) and column order.

如果每次加载时源都相同,则上述方法有效。 但是,通过动态方式处理从批量表到目标表的负载,即使源表的宽度(列数)和列顺序都发生了变化,也可以确保它起作用。

For some the script might seem cryptic – but it is only a matter of generating a list of column names from the source table that corresponds with the column names in the destination table.

对于某些脚本而言,该脚本看起来似乎很晦涩,但这只是从源表生成与目标表中的列名相对应的列名列表的问题。

 
-- import file with different structure
-- generate format file
if exists(select OBJECT_ID('personlist_bulk')) drop table dbo.personlist_bulk
 
declare @cmdshell varchar(8000);
set @cmdshell = 'c:\source\generateformatfile.exe -p c:\source\ -f personlistmissingcolumn.csv -o personlistmissingcolumnformatfile.xml -d ;'
exec xp_cmdshell @cmdshell;
 
 
-- import file using format file
select *  
into dbo.personlist_bulk
from  openrowset(
	bulk 'c:\source\personlistmissingcolumn.csv',  
	formatfile='c:\source\personlistmissingcolumnformatfile.xml',
	firstrow=2
	) as t;
 
-- dynamic load data from bulk to destination
declare @fieldlist varchar(8000);
declare @sql nvarchar(4000);
 
select @fieldlist = 
				stuff((select 
					',' + QUOTENAME(r.column_name)
						from (
							select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'personlist'
							) r
							join (
								select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'personlist_bulk'
								) b
								on b.COLUMN_NAME = r.COLUMN_NAME
						for xml path('')),1,1,'');
 
print (@fieldlist);
set @sql = 'truncate table dbo.personlist;' + CHAR(10);
set @sql = @sql + 'insert into dbo.personlist (' + @fieldlist + ')' + CHAR(10);
set @sql = @sql + 'select ' + @fieldlist + ' from dbo.personlist_bulk;';
print (@sql)
exec sp_executesql @sql
 

The result is a TSQL statement what looks like this:

结果是一条TSQL语句,如下所示:

 
truncate table dbo.personlist;
insert into dbo.personlist ([age],[city],[gender],[name])
select [age],[city],[gender],[name] from dbo.personlist_bulk;
 

The exact same thing would be able to be used with the other source files in this demo. The result is that the destination table is correct and loaded with the right data every time – and only with the data that corresponds with the source. No errors will be thrown.

本演示中的其他源文件可以使用完全相同的东西。 结果是目标表是正确的,并且每次都装入正确的数据–并且仅装入与源相对应的数据。 没有错误将被抛出。

From here there are some remarks to be taken into account:

从这里开始,有一些注意事项:

  1. As no errors are thrown, the source files could be empty and the data updated could be blank in the destination table. This is to be handled by processed outside this demo.

    由于没有引发错误,因此目标表中的源文件可能为空,更新的数据可能为空白。 这将由此演示外部的处理程序处理。

进一步的工作 (Further work)

As this demo and post shows it is possible to handle dynamic changing flat source files. Changing columns, column order and other changes, can be handled in an easy way with a few lines of code.

如本演示和文章所示,可以处理动态变化的平面源文件。 只需几行代码,即可轻松实现更改列,列顺序和其他更改。

Going from here, a suggestion could be to set up processes that compared the two tables (bulk and destination) and throws an error if X amount of the columns are not present in the bulk table or X amount of columns are new.

从这里开始,建议建立一个比较两个表(批量表和目标表)的过程,如果批量表中不存在X列或X列是新列,则抛出错误。

It is also possible to auto generate missing columns in the destination table based on columns from the bulk table.

也可以根据批量表中的列自动生成目标表中的缺失列。

The only boundaries are set by limits to your imagination

唯一的界限是由您的想象力限制的

摘要 (Summary)

With this blogpost I hope to have given you inspiration to build your own import structure of flat files in those cases where the structure might change.

希望通过本博文,您可以启发自己在平面文件可能发生更改的情况下构建自己的平面文件导入结构。

As seen above the approach needs some .NET programming skills – but when it is done and the console application has been built, it is simply a matter of reusing the same application around the different integration solutions in your environment.

如上所示,该方法需要一些.NET编程技能-但是完成此操作并构建控制台应用程序后,只需在环境中不同的集成解决方案周围重用同一应用程序即可。

Happy coding 🙂

快乐编码🙂

看更多 (See more)

Consider these free tools for SQL Server that improve database developer productivity.

考虑使用这些免费SQL Server工具来提高数据库开发人员的生产力。

外部链接: (External links:)

翻译自: https://www.sqlshack.com/how-to-import-flat-files-with-a-varying-number-of-columns-in-sql-server/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值