使用Python SQL脚本从压缩文件中导入数据

最新推荐文章于 2023-04-03 20:50:40 发布

culuo4781

最新推荐文章于 2023-04-03 20:50:40 发布

阅读量593

点赞数

文章标签： python 大数据数据库 java linux

原文链接：https://www.sqlshack.com/using-python-sql-scripts-for-importing-data-from-compressed-files/

版权

Using Python SQL scripts is a powerful technical combination to help developers and database administrators to do data analytics activities. Python provides many useful modules to perform data computation and processing of data efficiently. We can run Python scripts starting from SQL Server 2017. We can create the ETL solutions to extract data from various sources and insert into SQL Server.

使用Python SQL脚本是一种强大的技术组合，可帮助开发人员和数据库管理员进行数据分析活动。 Python提供了许多有用的模块来有效地执行数据计算和数据处理。我们可以从SQL Server 2017开始运行Python脚本。我们可以创建ETL解决方案，以从各种来源提取数据并将其插入SQL Server。

Suppose we are getting data in a flat file in a compressed format, we can use the ETL process to import these as well. It requires files to be extracted first using the tool to extract these files such as 7z.

假设我们以压缩格式获取平面文件中的数据，我们也可以使用ETL流程导入这些数据。它要求首先使用该工具提取这些文件（例如7z）来提取文件。

I have seen the developers using an external SSIS tool CozyRoc. It includes a Zip task to compress and decompress the files in various formats such as Zip, GZip, and BZip2 etc. In my previous article, we imported the compressed CSV file using the 7z compression utility. Python can also play an important role in importing data into SQL Server from the compressed files.

我已经看到开发人员使用外部SSIS工具CozyRoc 。它包括一个Zip任务，以各种格式压缩和解压缩文件，例如Zip，GZip和BZip2等。在我的上一篇文章中，我们使用7z压缩实用程序导入了压缩的CSV文件。 Python在将数据从压缩文件导入SQL Server方面也可以发挥重要作用。

In this article, I am using the SQL Server 2019 CTP 2.0

在本文中，我正在使用SQL Server 2019 CTP 2.0

You should install the Machine Learning Services (Python) to run the Python SQL scripts. If you do have installed it before, you can start the SQL Server installer. We need to add a feature in the existing SQL Server installation. In the Feature selection page, put a check on the Machine Learning Services and Python as shown in the following image:

您应该安装机器学习服务（Python）来运行Python SQL脚本。如果您以前已经安装过它，则可以启动SQL Server安装程序。我们需要在现有SQL Server安装中添加一个功能。在“功能选择”页面中，检查机器学习服务和Python，如下图所示：

It installs the SQL Server Launchpad service to run the Machine Learning Service. In the next page, you can set the service account it.

它安装SQL Server Launchpad服务以运行机器学习服务。在下一页中，您可以为其设置服务帐户。

Specify service account for the SQL Server Launchpad service

In the next page, we need to provide Consent to install the Python. Once you click on Accept, it downloads the required software and does the installation.

在下一页中，我们需要提供“同意”才能安装Python。单击接受后，它将下载所需的软件并进行安装。

You can review the SQL Server 2019 feature before installation in the summary page. You can see we are going to install the Python with the Machine Learning Services.

您可以在摘要页面中查看SQL Server 2019功能，然后再进行安装。您可以看到我们将在机器学习服务中安装Python。

In the SQL Server Configuration Manager, SQL Server service and the SQL Server Launchpad service should be running to use the Python scripts in the SQL Server.

在SQL Server配置管理器中，应运行SQL Server服务和SQL Server Launchpad服务以使用SQL Server中的Python脚本。

Verify SQL Server sevice and Launchpad service for Python

Connect to the SQL Server. We need to enable parameter external scripts enabled using the sp_configure command to run the Python SQL scripts

连接到SQL Server。我们需要启用使用sp_configure命令启用的参数外部脚本来运行Python SQL脚本

EXEC sp_configure 'external scripts enabled', 1
RECONFIGURE WITH OVERRIDE

In the below screenshot, we can see the external script enabled is configured successfully.

在下面的屏幕截图中，我们可以看到已成功配置了启用的外部脚本。

Verify 'external scripts enabled' Python SQL results

Once You have enabled the external scripts, you need to restart both the services. You might get the following error if the SQL Services and the Launchpad service is not started after we run the sp_configure command mentioned above.

一旦启用了外部脚本，就需要重新启动这两个服务。如果在运行上述sp_configure命令后未启动SQL Services和Launchpad服务，则可能会出现以下错误。

Error while using Python SQL script in SQL Server

We can run the following to test if the Python SQL script will process correctly in SQL Server.

我们可以运行以下命令来测试Python SQL脚本能否在SQL Server中正确处理。

execute sp_execute_external_script 
@language = N'Python', 
@script = N'
a = 9
b = 3
c = a/b
d = a*b
print(c, d)
'

For this example, let us prepare data using the below query in the WideWorldImporters database.

对于此示例，让我们使用WideWorldImporters数据库中的以下查询准备数据。

SELECT [PersonID]
      ,[FullName]
      ,[PreferredName]
      ,[SearchName]
      ,[IsPermittedToLogon]
       ,[ValidFrom]      
  FROM [WideWorldImporters].[Application].[People]

Extract the records to import in SQL Server using Python

Save the output (1111 rows) in the CSV format in a designated directory.

将输出（1111行）以CSV格式保存在指定目录中。

Now right click on this CSV file and go to 7-Zip and click Add to archive.

现在，右键单击此CSV文件，然后转到7-Zip，然后单击添加到存档 。

We need to create a bzip2 compressed archive file.

我们需要创建一个bzip2压缩存档文件。

Click Ok and it creates a compressed file Employee.csv.bzip2.

单击确定，它将创建一个压缩文件Employee.csv.bzip2 。

Connect to SQL Server instance and run the following Python SQL script

连接到SQL Server实例并运行以下Python SQL脚本

EXEC sp_execute_external_script
@language = N'Python',
@script =
N'
import pandas as pd
 
import datetime as datetime
 
OutputDataSet = pd.read_csv("C:\sqlshack\Draft articles\Data\person.csv.bz2", names = ["PersonID", "FullName", "PreferredName", "SearchName", "IsPermittedToLogon",  "ValidFrom"],
header = 0, compression = "bz2")
'
,@input_data_1 = N''
,@input_data_1_name = N''
WITH RESULT SETS
(
    (
        PersonID INT,
        FullName VARCHAR(512),
        PreferredName VARCHAR(512),
        SearchName VARCHAR(512),
        IsPermittedToLogon bit,
       ValidFrom nvarchar(500)
    )
)
)

You get data from the compressed CSV file as a output in SSMS. SQL Server uses Python code to interact with the compressed file and extract data using Python modules.

您可以从压缩的CSV文件中获取数据，作为SSMS中的输出。 SQL Server使用Python代码与压缩文件进行交互，并使用Python模块提取数据。

Verify the output using Python SQL script

Let us understand this query in the Python language.

让我们用Python语言了解此查询。

Part 1: Import Python Module: We can use Pandas module in Python to extract data from the compressed file. Python does not have a data type for the date; therefore we also need to import a module datetime.

第1部分：导入Python模块：我们可以在Python中使用Pandas模块从压缩文件中提取数据。 Python没有日期的数据类型。因此，我们还需要导入一个模块日期时间。

EXEC sp_execute_external_script
@language = N'Python',
@script =
N'
import pandas as pd
 
import datetime as datetime

Part 2: Define the CSV columns and compression format: In this part, we need to define the following parameters.

第2部分：定义CSV列和压缩格式：在这一部分中，我们需要定义以下参数。

OutputDataSet = pd.read_csv("C:\sqlshack\Draft articles\Data\person.csv.bz2", names = ["PersonID", "FullName", "PreferredName", "SearchName", "IsPermittedToLogon",  "ValidFrom"],header = 0, compression = "bz2")

OutputDataSet = pd.read_csv(“C:\sqlshack\Draft articles\Data\person.csv.bz2” OutputDataSet = pd.read_csv（“ C：\ sqlshack \ Draft article \ Data \ person.csv.bz2”
names = [“PersonID”, “FullName”, “PreferredName”, “SearchName”, “IsPermittedToLogon”, “ValidFrom”] 名称= [“ PersonID”，“ FullName”，“ PreferredName”，“ SearchName”，“ IsPermittedToLogon”，“ ValidFrom”]
header = 0 header = 0在第一列中指定了CSV标题
compression = “bz2” compression =“ bz2”指定了压缩格式

Part 3 – Define the output table column with data type: In this section, we need to define the table columns and their data types.

第3部分–用数据类型定义输出表列：在本节中，我们需要定义表列及其数据类型。

,@input_data_1 = N''
,@input_data_1_name = N''
WITH RESULT SETS
(
    (
        PersonID INT,
        FullName VARCHAR(512),
        PreferredName VARCHAR(512),
        SearchName VARCHAR(512),
        IsPermittedToLogon bit,
        ValidFrom nvarchar(500)
    )
)

Now we need to import this data into the SQL Server. We can directly use Insert statement in Python query to insert in a SQL Server table. We should have a SQL Server table with the columns similar to the CSV file columns. Let us create a table in this article using the following query.

现在我们需要将此数据导入到SQL Server中。我们可以在Python查询中直接使用Insert语句插入SQL Server表中。我们应该有一个SQL Server表，其列类似于CSV文件列。让我们使用以下查询在本文中创建一个表。

Create table PythonZipInsert
    (
        PersonID INT,
        FullName VARCHAR(512),
        PreferredName VARCHAR(512),
        SearchName VARCHAR(512),
        IsPermittedToLogon bit,
       ValidFrom varchar(512)
  )

Once we have the SQL Server table in place, we need to insert data using Python SQL script.

放置好SQL Server表后，我们需要使用Python SQL脚本插入数据。

insert into PythonZipInsert
EXEC sp_execute_external_script
@language = N'Python',
@script =
N'
import pandas as pd
 
import datetime as datetime
 
OutputDataSet = pd.read_csv("C:\sqlshack\Draft articles\Data\person.csv.bz2", names = ["PersonID", "FullName", "PreferredName", "SearchName", "IsPermittedToLogon",  "ValidFrom"],
header = 0, compression = "bz2")
'

We get the following output that shows the number of inserted records.

我们得到以下输出，显示插入的记录数。

Let us understand this query to insert records directly into SQL Server table.

让我们理解此查询，将记录直接插入到SQL Server表中。

Part 1: Insert statement: In this step, we need to specify an insert statement into our existing SQL Server table.

第1部分：Insert语句 ：在此步骤中，我们需要在现有SQL Server表中指定一条insert语句。

insert into PythonZipInsert

Part 2: Define the CSV columns and compression format: This step is precisely similar to the step we looked while viewing data in SSMS.

第2部分：定义CSV列和压缩格式 ：此步骤与我们在SSMS中查看数据时所看到的步骤完全相似。

We do not need to specify the column and their data type Part 3- Define the output table column with data type while importing data into SQL Server tables directly.

在将数据直接导入到SQL Server表中时，我们不需要指定列及其数据类型。 第3部分- 使用数据类型定义输出表列 。

使用Python SQL脚本将压缩数据导入SQL Server的SSIS包 (SSIS Package to Import compressed data into SQL Server using Python SQL script)

We can also use Python query in an SSIS package and import data into SQL Server tables.

我们还可以在SSIS包中使用Python查询并将数据导入到SQL Server表中。

Open Visual Studio 2017 and create an Integration Service Project. Specify a valid directory and solution name for this SSIS package.

打开Visual Studio 2017并创建一个Integration Service项目。指定此SSIS包的有效目录和解决方案名称。

It creates a solution for the PythonImport SSIS package. Drag a Data Flow task in the Control Flow area.

它为PythonImport SSIS包创建了一个解决方案。将数据流任务拖到“控制流”区域中。

Double click on Data Flow Task and drag OLE DB Source in data flow task area.

双击“ 数据流任务”，然后将OLE DB源拖到数据流任务区域中。

In this OLE DB Source, specify the connection to SQL Server instance. We need to run Python command in this OLE DB Source to get the required data. Select data access mode as SQL Command and in SQL Command text, paste the Python code. It is same Python SQL code that we used to display data in SSMS.

在此OLE DB源中，指定与SQL Server实例的连接。我们需要在此OLE DB源中运行Python命令以获取所需的数据。选择数据访问模式作为SQL命令，然后在SQL命令文本中粘贴Python代码。与我们用于在SSMS中显示数据的Python SQL代码相同。

Configure the OLE DB Source with connection and script

Click Ok and add an OLE DB Destination. In this OLE DB Destination, specify SQL Server instance and SQL Server table in which we want to insert data from CSV using the Python SQL query.

单击确定，然后添加一个OLE DB目标。在此OLE DB目标中，指定我们要使用Python SQL查询从CSV插入数据SQL Server实例和SQL Server表。

Select the name of the table to insert data

Click on Mapping and verify the mapping between CSV columns and SQL Server table columns. We can change the mapping between these columns if column name differs in CSV and table.

单击映射，并验证CSV列和SQL Server表列之间的映射。如果列名称在CSV和表格中不同，我们可以更改这些列之间的映射。

Verify the mapping between CSV columns and table columns

Click Ok, and you can see the SSIS package in the following screenshot. I renamed both Source and Destination as below.

单击“确定”，您可以在以下屏幕截图中看到SSIS包。我将“源”和“目标”重命名如下。

Source: Python SQL Query
资料来源：Python SQL查询
Destination: SQL Server Table
目标：SQL Server表

If there is any configuration error, you will get a red colour cross on the source or destination task.

如果有任何配置错误，您将在源或目标任务上看到一个红色的十字。

We will truncate our destination SQL Server table using the following query. It ensures that table is completely empty before running the SSIS package.

我们将使用以下查询截断目标SQL Server表。它确保在运行SSIS包之前该表完全为空。

truncate table PythonZipInsert

Now execute this SSIS package. In the following screenshot, you can see that we inserted 1111 rows from Python SQL Query to SQL Server Table. The SSIS package execution is successful as well.

现在执行此SSIS包。在下面的屏幕截图中，您可以看到我们从Python SQL Query插入了1111行到SQL Server Table。 SSIS包执行也成功。

Execcute and monitor status of SQL Server Python SSIS package

Verify records in SQL Server destination table. We have 1111 rows that are equivalent to number of rows in CSV file.

验证SQL Server目标表中的记录。我们有1111行，与CSV文件中的行数相等。

select count(*) as ImportusingPython from PythonZipInsert

We can perform data transformation as well in the SSIS package based on the requirement. Python makes the process easier for us to extract data from the compressed file. We do not need to extract the compressed file first.

我们也可以根据要求在SSIS包中执行数据转换。 Python使我们可以更轻松地从压缩文件中提取数据。我们不需要首先提取压缩文件。