SQL Server中的数据科学：数据理解和转换–序数变量和虚拟变量

最新推荐文章于 2023-01-25 17:14:43 发布

culuo4781

最新推荐文章于 2023-01-25 17:14:43 发布

阅读量957

点赞数

文章标签：数据库 python java 机器学习大数据

原文链接：https://www.sqlshack.com/data-science-sql-server-understanding-transformation-variables/

版权

In my previous article, Introduction to data science, data understanding and preparation, I showed how to make an overview of the distribution of a discrete SQL variable. I analyzed the NumberCarsOwned variable from the dbo.vTargetMail view that you can find in the AdventureWorksDW2016 demo database. The data science graphs I created in R and Python and the histogram created with T-SQL were all very nice. Now let me try to create a histogram for another SQL variable from that view, for the Education variable. I am starting with R, as you can see from the following code.

在上一篇文章《数据科学导论，数据理解和准备》中，我展示了如何概述离散SQL变量的分布。我从dbo.vTargetMail视图中分析了NumberCarsOwned变量，您可以在AdventureWorksDW2016演示数据库中找到该视图。我用R和Python创建的数据科学图和用T-SQL创建的直方图都非常好。现在，让我尝试从该视图为教育变量创建另一个SQL变量的直方图。我从R开始，如下面的代码所示。

# Load RODBC library (install only if needed)
# install.packages("RODBC")
library(RODBC)
 
# Connecting and reading the data
con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd")
TM <- as.data.frame(sqlQuery(con,
  "SELECT EnglishEducation AS Education
   FROM dbo.vTargetMail;"),
  stringsAsFactors = TRUE)
close(con)
 
# Plot the Education
plot(TM$Education, main = 'Education',
     xlab = 'Education', ylab = 'Number of Cases',
     col = "light blue")

You can see the result in the following figure.

您可以在下图中看到结果。

This histogram does not look good. The distribution is somehow weird; you would expect that the distribution would resemble the Gaussian curve. What is the problem? Note the labels – the bars are organized alphabetically. However, Education is not a pure nominal SQL variable; it is ordinal, just like NumberCarsOwned one. It has an intrinsic order. The intrinsic order of the NumberCarsOwned SQL variable is the same as the order of the values of that SQL variable. With Education, you need to define the order in some way, so the values get ordered properly.

该直方图看起来不太好。这种分布有些奇怪。您会期望该分布类似于高斯曲线。问题是什么？注意标签–条形按字母顺序组织。但是，Education不是纯粹的名义上SQL变量；它是序号，就像NumberCarsOwned一样。它具有内在顺序。 NumberCarsOwned SQL变量的固有顺序与该SQL变量的值的顺序相同。使用Education，您需要以某种方式定义顺序，以便正确地对值进行排序。

I will demonstrate a couple of techniques for defining the order of ordinal variables in T-SQL, R, and Python. In addition, I will show in this article another useful data preparation technique for working with discrete variables – creating dummy variables, or dummies, from a discrete one.

我将演示几种用于定义T-SQL，R和Python中序数变量顺序的技术。另外，我将在本文中展示另一种有用的数据准备技术，用于处理离散变量-从离散变量创建虚拟变量或虚拟变量。

订购普通 (Ordering ordinals)

If you want to change the order of the values of a SQL variable, you need to change the values themselves. You can do it with the CASE clause in the SELECT statement you use to read the data. Of course, you can also permanently update the values in a table, and even add a new column with the updated values. The following query shows how you can use the CASE clause to modify the values in a query and get a proper histogram.

如果要更改SQL变量的值的顺序，则需要更改值本身。您可以使用用于读取数据的SELECT语句中的CASE子句来完成此操作。当然，您也可以永久更新表中的值，甚至可以使用更新后的值添加新列。以下查询显示了如何使用CASE子句修改查询中的值并获得正确的直方图。

USE AdventureWorksDW2016;
GO
-- Ordering Education
WITH freqCTE AS
(
SELECT CASE v.EnglishEducation 
         WHEN 'Partial High School' THEN '1 - Partial High School'
		 WHEN 'High School' THEN '2 - High School'
         WHEN 'Partial College' THEN '3 - Partial College'
         WHEN 'Bachelors' THEN '4 - Bachelors'
		 WHEN 'Graduate Degree' THEN '5 - Graduate Degree'
       END AS Education,
 COUNT(v.EnglishEducation) AS AbsFreq,
 CAST(ROUND(100. * (COUNT(v.EnglishEducation)) /
       (SELECT COUNT(*) FROM vTargetMail), 0) AS INT) AS AbsPerc
FROM dbo.vTargetMail AS v
GROUP BY v.EnglishEducation
)
SELECT Education,
 AbsFreq,
 SUM(AbsFreq) 
  OVER(ORDER BY Education 
       ROWS BETWEEN UNBOUNDED PRECEDING
	    AND CURRENT ROW) AS CumFreq,
 AbsPerc,
 SUM(AbsPerc)
  OVER(ORDER BY Education
       ROWS BETWEEN UNBOUNDED PRECEDING
	    AND CURRENT ROW) AS CumPerc,
 CAST(REPLICATE('*',AbsPerc) AS VARCHAR(50)) AS Histogram
FROM freqCTE
ORDER BY Education;
GO

You can see the results in the next figure. You see that the order is now correct and that the histogram looks more like the bell curve.

您可以在下图中看到结果。您会看到顺序现在是正确的，并且直方图看起来更像钟形曲线。

In R, you don’t need to change the values of an ordinal to get the correct order. R defines discrete variables as so-called factors. Factors have levels, which represent the distinct values of the variable. You define the factors with the factor() function. When you define the levels, you also define whether the levels are ordered or not. If the levels are ordered, it is important to write them in the correct order. The following code shows how to use the factor() function to define the levels and their order for the Education variable, and then plots the histogram for this variable again.

在R中，您无需更改序数的值即可获得正确的顺序。 R将离散变量定义为所谓的因数。因子具有层次，代表变量的不同值。您可以使用factor（）函数定义因子。定义级别时，还定义了级别是否有序。如果级别是有序的，则以正确的顺序编写它们很重要。以下代码显示了如何使用factor（）函数为Education变量定义级别及其顺序，然后再次绘制该变量的直方图。

# Education is ordered
TM$Education = factor(TM$Education, order = TRUE,
    levels = c("Partial High School",
               "High School", "Partial College",
               "Bachelors", "Graduate Degree"))
# Plot it again
plot(TM$Education, main = 'Education',
     xlab = 'Education', ylab = 'Number of Cases',
     col = "dark green")

You can see the results in the following figure. Now the histogram looks like it should look from the beginning.

您可以在下图中看到结果。现在，直方图看起来应该从一开始就看起来像。

Of course, the next is Python. In Python, you can map values to properly ordered values by using a dictionary object. You use the old values for the keys of the dictionary and the properly ordered values for the values of the key-value pairs of the dictionary. The following code reads the data and creates the mapping dictionary object.

当然，下一个是Python。在Python中，您可以使用字典对象将值映射到正确排序的值。您将旧值用作字典的键，并将正确排序的值用作字典的键-值对的值。以下代码读取数据并创建映射字典对象。

# Imports needed
import numpy as np
import pandas as pd
import pyodbc
import matplotlib.pyplot as plt
 
# Connecting and reading the data
con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')
query = """SELECT EnglishEducation AS Education
           FROM dbo.vTargetMail;"""
TM = pd.read_sql(query, con)
 
# Mapping values to integers for the correct order
map1 = {
    "Partial High School" : 1,
    "High School" : 2, 
    "Partial College" : 3,
    "Bachelors" : 4, 
    "Graduate Degree": 5
    }

Then you replace the original values with the mapped values. You create a graph with the mapped values. However, you use the dictionary keys, the old values, for the labels. You can see the process in the following code.

然后，将原始值替换为映射值。您使用映射的值创建一个图形。但是，您将字典键（旧值）用于标签。您可以在以下代码中看到该过程。

# Replacing Education values with mapping values
TM.Education.replace(map1)
 
# Creating a proper graph
ax = pd.crosstab(TM.Education.replace(map1), 
                 columns = 'Count').plot(kind = 'bar', 
                                         legend = False,
                                         title = 'Education',
                                         fontsize = 12)
ax.set_xticklabels(map1.keys(), rotation=0)
plt.show()

And here is the result.

这就是结果。

制作假人 (Making dummies)

Sometimes you want to convert a categorical variable to a set of indicators that just show the presence or the absence of the values of the variable. You create a new indicator for each distinct value or level of the original variable. You assign the value 1 to the indicator for a specific value when the original variable takes that specific value and the value 0 for each other value of the original variable. Such indicators are also called dummy variables, or dummies. There are some algorithms, particularly when you perform regression analysis, that needs only numerical input.

有时您希望将分类变量转换为一组指标，这些指标仅显示变量值的存在与否。您为原始变量的每个不同值或级别创建一个新指标。当原始变量采用特定值时，可以将指标1的值分配给该特定值，将原始变量彼此的其他值的值分别分配为0。这样的指标也称为虚拟变量或虚拟变量。有一些算法，尤其是在执行回归分析时，仅需要数字输入。

You can easily understand the process of creating the indicators through the following T-SQL code that uses the IIF function.

通过以下使用IIF函数的T-SQL代码，您可以轻松理解创建指标的过程。

-- Getting dummies
SELECT TOP 10 v.EnglishEducation AS Education,
	   IIF(v.EnglishEducation = 'Partial High School', 1, 0)
	    AS [TM _Partial High School],
	   IIF(v.EnglishEducation = 'High School', 1, 0)
	    AS [TM _High School],
	   IIF(v.EnglishEducation = 'Partial College', 1, 0)
	    AS [TM _Partial College],
	   IIF(v.EnglishEducation = 'Bachelors', 1, 0)
	    AS [TM _Bachelors],
	   IIF(v.EnglishEducation = 'Graduate Degree', 1, 0)
	    AS [TM _Graduate Degree]
FROM dbo.vTargetMail AS v
ORDER BY NEWID();

I ordered the results by the NEWID() function values just to quickly mix the rows, so I could get all possible distinct values of the Education SQL variable in first ten rows. Here are the results.

我通过NEWID（）函数值对结果进行排序，只是为了快速混合行，因此我可以在前十行中获得Education SQL变量的所有可能的不同值。这是结果。

In R, there is a very convenient way to create dummies with the help of the dummies package, the dummy function. The following code installs the package and creates the dummies for the Education variable. For the sake of brevity, I am not showing the results of the code.

在R中，有一种非常方便的方法来借助假人包（虚拟功能）来创建假人。以下代码将安装软件包并为Education变量创建虚拟变量。为了简洁起见，我没有显示代码的结果。

# Get dummies in R
install.packages("dummies")
library(dummies)
 
# Create the dummies
TM1 <- cbind(TM, dummy(TM$Education, sep = "_"))
# Last 10 rows
tail(TM1, 10)

In Python, again you do not need to install any additional package. You already have the get_dummies() method in the pandas package. The following code shows how to create the dummies in Python. I don’t show the results of this code as well, as they are the same as in T-SQL and R.

同样，在Python中，您不需要安装任何其他软件包。您已经在pandas包中有了get_dummies（）方法。以下代码显示了如何在Python中创建虚拟变量。我也没有显示此代码的结果，因为它们与T-SQL和R中的结果相同。

# Create the dummies
TM1 = TM[['Education']].join(pd.get_dummies(TM.Education, prefix = 'TM '))
TM1.tail(10)

结论 (Conclusion)

You can already see that, in data science, proper data preparation is not that simple as it might look at the first glimpse. And I am not done even with the discrete variables yet. In the next article on data science, I will show some possibilities for discretization, for transforming a continuous SQL variable to a discrete one.

您已经看到，在数据科学中，正确的数据准备并不像乍看起来那样简单。而且我甚至还没有使用离散变量。在下一篇关于数据科学的文章中，我将展示一些离散化的可能性，可以将连续SQL变量转换为离散的变量。