在SQL Server 2017上充分利用Python

最新推荐文章于 2024-08-03 21:31:25 发布

culuo4781

最新推荐文章于 2024-08-03 21:31:25 发布

阅读量341

点赞数

文章标签： python java 数据库 linux 大数据

原文链接：https://www.sqlshack.com/get-python-sql-server-2017/

版权

介绍 (Introduction)

One of the new features announced with SQL Server 2017 is support for the Python language. This is big! In SQL Server 2016, Microsoft announced support for the R language – an open source language ideally suited for statistical analysis and machine learning (ML). Recognizing that many data scientists use Python with ML libraries, the easy-to-learn-hard-to-forget language has now been added to the SQL Server ML suite.

SQL Server 2017宣布的新功能之一是对Python语言的支持。好大！在SQL Server 2016中，Microsoft宣布支持R语言，R语言是一种非常适合统计分析和机器学习（ML）的开源语言。认识到许多数据科学家都将Python与ML库一起使用，因此，易于学习，难以忘记的语言现已添加到SQL Server ML套件中。

There’s a big difference between R and Python though: R is a domain-specific language while Python is general purpose. That means that the full power of Python is available within SQL Server. This article leaves ML aside for the moment and explores a few of the other possibilities.

R和Python之间有一个很大的区别：R是一种领域特定的语言，而Python是通用的。这意味着在SQL Server中可以使用Python的全部功能。本文暂时将ML放在一边，并探讨了其他一些可能性。

入门 (Getting started)

A full discussion of installing Python on SQL Server is out of the scope of this article but you can follow the link in the References if you need to do that. Note that at the moment, ML services for Python are only available for installations of SQL Server on Windows. Neither Linux nor Azure SQL Database is supported as of this writing.

在SQL Server上安装Python的完整讨论不在本文的讨论范围之内，但是如果需要，可以按照参考中的链接进行操作。请注意，目前，Python的ML服务仅可用于Windows上SQL Server安装。在撰写本文时，既不支持Linux，也不支持Azure SQL数据库。

There is a single API for invoking Python programs from SQL Server:

有一个用于从SQL Server调用Python程序的API：

sp_execute_external_script

This is a system stored procedure that first appeared with R services in SQL Server 2016. Assuming you have everything set up already, this query will run a Python “Hello world!” program:

这是一个系统存储过程，最初与R服务一起出现在SQL Server 2016中。假设您已经设置了所有内容，则此查询将运行Python“ Hello world！”。程序：

EXEC sp_execute_external_script  @language =N'Python',
@script=N'OutputDataSet = InputDataSet',
@input_data_1 = N'SELECT ''Hello'', ''world!'''
WITH RESULT SETS ((Hello VARCHAR(20), World VARCHAR(20)));

This produces:

这将产生：

(Note that you need to use WITH RESULT SETS to get column names added to the output.)

（请注意，您需要使用WITH RESULT SETS来将列名添加到输出中。）

用例1：正则表达式 (Use case 1: Regular expressions)

If you are an experienced user of SQL Server, you may have been frustrated on occasion by the lack of support for full regular expressions. Sure, a WHERE LIKE clause can do a lot, but there’s a lot that it can’t do. For example, consider email addresses. RFC 5352 describes them in detail and takes 57 pages to do it! Constructing a full LIKE clause to identify email addresses is cumbersome at best. However, this Python regular expression does it neatly:

如果您是SQL Server的经验丰富的用户，则有时可能会因为缺少对完整正则表达式的支持而感到沮丧。当然，WHERE LIKE子句可以做很多事情，但是有很多它不能做。例如，考虑电子邮件地址。 RFC 5352详细描述了它们，并且需要57页才能完成！构造一个完整的LIKE子句来识别电子邮件地址充其量是繁琐的。但是，此Python正则表达式会巧妙地做到这一点：

(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)

(This is difficult to do with a LIKE match because it lacks the “+” symbol meaning “one or more”.)

（这对于LIKE匹配很难做到，因为它缺少表示“一个或多个”的“ +”符号。）

Testing that in a little Python program, shows that it works:

在一个小的Python程序中进行测试，表明它可以工作：

email = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
m = email.match("me@you.com")
if m:
    print('email found: ', m.group())
else:
    print('No email found')

(Go ahead and try this on your own to prove it to yourself. If you haven’t installed Python on your workstation yet, this is a good time to do that!)

（继续，自己尝试一下，向自己证明。如果您尚未在工作站上安装Python，那么这是个不错的时机！）

Let’s apply this in SQL Server 2017:

让我们在SQL Server 2017中应用它：

DECLARE @Python NVARCHAR(4000) = '
import re
re_email = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
for index, row in InputDataSet.iterrows():
	name, email = row
	m = re_email.match(email)
	if m:
		print(name, "email found: ", m.group())
	else:
		print(name, "No email found")
';
 
DROP TABLE IF EXISTS #Input;
CREATE TABLE #Input (name varchar(50), email varchar(254));
INSERT INTO #Input (name, email) VALUES
('Arthur Dent', 'Arthur@earth.com'),
('Zaphod Beeblebrox', 'Zaphod@HeartOfGold.com'),
('Trillian', 'trillian@a_party')
;
 
EXEC sp_execute_external_script  @language =N'Python',
@script = @Python,
@input_data_1 = N'SELECT name, email from #Input'

This produces:

这将产生：

(3 rows affected)
STDOUT message(s) from external script:
Arthur Dent email found:  Arthur@earth.com
Zaphod Beeblebrox email found:  Zaphod@HeartOfGold.com
Trillian No email found

This would probably be more useful if it returned an output result set with an added “IsValidEmail” column. Before we get there, though, let’s look at a different problem.

如果它返回带有添加的“ IsValidEmail”列的输出结果集，这可能会更有用。不过，在到达那里之前，让我们看一个不同的问题。

处理CSV格式的数据 (Handling CSV formatted data)

One frequently-asked question found on many SQL Server forums concerns producing or consuming CSV files in SQL. When asked, my usual go-to answer is “Integration Services.” With Python in SQL Server 2017, there’s a new option. Python has long had CSV handling in its standard library. Let’s use it!

在许多SQL Server论坛上经常遇到的一个问题是有关在SQL中生成或使用CSV文件的问题。当被问到时，我通常的答案是“集成服务”。使用SQL Server 2017中的Python，有一个新选项。长期以来，Python在其标准库中都有CSV处理功能。让我们使用它！

For this exercise, I’ll take the lazy approach and grab a ready-made CSV file using the link in the references section. It is a name and addresses file that looks like this:

对于本练习，我将采用惰性方法，并使用参考部分中的链接获取现成的CSV文件。这是一个名称和地址文件，如下所示：

John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123

These six lines show some of the complexities of CSV data, including missing fields, quoted fields, embedded quotation marks and more. Handling this kind of data in SQL is awkward at best. Can we do it with Python? You bet! This little program will read the above data, parse it and print it:

这六行显示了CSV数据的一些复杂性，包括缺少的字段，带引号的字段，嵌入的引号等。在SQL中处理此类数据充其量是尴尬的。我们可以用Python做到吗？你打赌！这个小程序将读取以上数据，对其进行解析并打印：

import csv
with open('addresses.csv') as csvfile:
    rdr = csv.reader(csvfile)
    for row in rdr:
        print(', '.join(row))

We could just take this script, wrap it up in a call to sp_execute_external_script, and run it in SQL Server 2017. Let’s do a bit more, though. Let’s pass the input to the script as a query and return the results as a SQL Server result set.

我们可以只使用此脚本，将其包装在对sp_execute_external_script的调用中，然后在SQL Server 2017中运行它。不过，让我们做更多的事情。让我们将输入作为查询传递给脚本，然后将结果作为SQL Server结果集返回。

First, create a table to hold the sample data and populate it:

首先，创建一个表来保存示例数据并填充它：

IF OBJECT_ID(N'tempdb..#csv', N'U') IS NOT NULL
	DROP TABLE #csv;
 
CREATE TABLE #csv(line VARCHAR(256));
INSERT INTO #csv(line) VALUES
 
('John,Doe,120 jefferson st.,Riverside, NJ, 08075'),
('Jack,McGinnis,220 hobo Av.,Phila, PA,09119'),
('"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075'),
('Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234'),
(',Blankman,,SomeTown, SD, 00298'),
('"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123')

Next, compose a short Python script to use the Python csv module (part of the standard library):

接下来，编写一个简短的Python脚本以使用Python csv模块（标准库的一部分）：

DECLARE @Python NVARCHAR(MAX) = N'
import csv
csvinput = (row["line"] for index, row in InputDataSet.iterrows())
rdr = csv.reader(csvinput)
OutputDataSet = pandas.DataFrame().from_records(rdr)
';

InputDataSet is a DataFrame object from the pandas module:

InputDataSet是来自pandas模块的DataFrame对象：

The pandas module is included with SQL Server when you install Python support. This little script iterates over the rows in the DataFrame, then constructs OutputDataSet, also a pandas DataFrame object, using the reader method from the csv module, which does the actual parsing. Note that InputDataSet and OutpuDataSet are the default names used by the sp_execute_external_script stored procedure. These are overrideable. See the references section for more information

安装Python支持时，SQL Server附带了pandas模块。这个小脚本遍历DataFrame中的行，然后使用csv模块中的reader方法构造OutputDataSet，这也是一个熊猫DataFrame对象，该模块执行实际的解析。请注意，InputDataSet和OutpuDataSet是sp_execute_external_script存储过程使用的默认名称。这些是可覆盖的。有关更多信息，请参见参考资料部分。

Next, create a table to hold the results:

接下来，创建一个表来保存结果：

IF OBJECT_ID(N'tempdb..#csvparsed', N'U') IS NOT NULL
	DROP TABLE #csvparsed;
CREATE TABLE #csvparsed (
    FirstName VARCHAR(50), 
    LastName VARCHAR(50), 
    Street VARCHAR(50), 
    City Varchar(50), 
    StateProv CHAR(2), 
    PostCode VARCHAR(6));

Finally, run the script to populate the new table:

最后，运行脚本以填充新表：

INSERT INTO #csvparsed(FirstName, LastName, Street, City, StateProv, PostCode)
EXEC sp_execute_external_script 
	@language=N'Python',
	@script = @Python,
	@input_data_1 = N'SELECT line from #csv';
 
SELECT * FROM #csvparsed;

This produces the result set:

这将产生结果集：

Each row of the CSV file has been correctly parsed into rows and columns!

CSV文件的每一行都已正确解析为行和列！

Python also has great support for other types of data, such as JSON, XML, and HTML. While it is possible to shred (parse) and emit these with T-SQL, you may find it easier to do this with the Python standard libraries, depending on the application. It should go without saying that, for performance-critical applications, you need to be sure to perform extensive testing to find the best overall solution.

Python还对其他类型的数据（例如JSON，XML和HTML）提供了强大的支持。尽管可以使用T-SQL粉碎（解析）并发出这些信息，但您可能会发现，使用Python标准库可以更轻松地做到这一点，具体取决于应用程序。不用说，对于性能至关重要的应用程序，您需要确保进行广泛的测试以找到最佳的整体解决方案。

其他应用 (Other applications)

The Python standard library contains many other useful modules that I won’t go into here. You have the basic tools you need. The next time you have a problem that is tough to solve in T-SQL, consider using Python as demonstrated here instead.

Python标准库包含许多其他有用的模块，我将不在这里介绍。您拥有所需的基本工具。下次遇到难以在T-SQL中解决的问题时，请考虑使用Python，如此处所示。

Some modules will require additional permissions. For example, any module that interacts with the file system or remote servers will not work under the Launchpad service, as configured out of the box. You would need to grant the appropriate access to the service account. Also, there are often other ways to perform such operations – Integration Services, for example.

某些模块将需要其他权限。例如，任何与文件系统或远程服务器交互的模块都无法在启动板服务下工作，因为该配置是开箱即用的。您需要授予对服务帐户的适当访问权限。此外，通常还有其他方法可以执行此类操作-例如，Integration Services。

摘要 (Summary)

The addition of Python support to SQL Server opens up a world of new capabilities. Machine learning is certainly the primary reason for adding Python (and R) to SQL Server, but as this article has shown, there are other exciting use cases as well.

SQL Server中增加了对Python的支持，从而开辟了许多新功能。机器学习当然是向SQL Server添加Python（和R）的主要原因，但是正如本文所示，还有其他令人兴奋的用例。

Oh, about adding an IsValidEmail column as mentioned above, why not give that a go on your own? It’s not hard and will give you a chance to get your hands dirty with Python in SQL Server.

哦，关于添加如上所述的IsValidEmail列，为什么不自己去做呢？这并不难，并且将使您有机会尝试在SQL Server中使用Python。