SQL Server中的数据科学:数据分析和转换–合并连续变量

I started to explain the data preparation part of a data science project with discrete variables. As you should know by now, discrete variables can be categorical or ordinal. For ordinal, you have to define the order either through the values of the variable or inform about the order the R or the Python execution engine. Let me start this article with Python code that shows another way how to define the order of the Education variable from the dbo.vTargetMail view from the AdventureWorksDW2016 demo database.

我开始解释具有离散变量的数据科学项目的数据准备部分。 如您现在所知,离散变量可以是分类的或有序的。 对于序数,您必须通过变量的值定义顺序,或者告知R或Python执行引擎的顺序。 让我从Python代码开始本文,该代码演示了如何从AdventureWorksDW2016演示数据库的dbo.vTargetMail视图中定义Education变量的顺序。

# Imports needed
import numpy as np
import pandas as pd
import pyodbc
import matplotlib.pyplot as plt
import seaborn as sns
 
# Connecting and reading the Education
con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')
query = """SELECT CustomerKey, EnglishEducation AS Education
           FROM dbo.vTargetMail;"""
TM = pd.read_sql(query, con)
 
# Define Education as categorical
TM['Education'] = TM['Education'].astype('category')
# Reordering Education
TM['Education'].cat.reorder_categories(
    ["Partial High School", 
     "High School","Partial College", 
     "Bachelors", "Graduate Degree"], inplace=True)
TM['Education']
TM['Education'].value_counts(sort = False)
 
# Education barchart
sns.countplot(x="Education", data=TM);
plt.show()

This code imports the necessary libraries, reads the data from SQL Server, defines Education as categorical, and then reorder the values using the pandas built-in function cat.reorder_categories(). Then the code shows the distribution of the values and the bar plot for this variable. For the sake of brevity, I am not showing the bar chart, because it is the same as I have shown in my previous article, Data science, data understanding and preparation – ordinal variables and dummies.

该代码导入必要的库,从SQL Server读取数据,将Education定义为分类,然后使用pandas内置函数cat.reorder_categories()对值进行重新排序。 然后代码显示该变量的值分布和条形图。 为了简洁起见,我没有显示条形图,因为它与我上一篇文章《 数据科学,数据理解和准备–序数变量和虚拟变量》中显示的相同

Some data science algorithms need only discrete variables for the input. If you have a continuous variable, you have to group or bin the values. Sometimes you want to bin a continuous variable for other reasons as well, for example, to show the distribution in a bar chart, or to use it for grouping and pivoting. Therefore, let me explain a couple of options you have for binning a continuous variable.

一些数据科学算法仅需要离散变量作为输入。 如果有连续变量,则必须将值分组或合并。 有时,出于其他原因,您也想对连续变量进行装箱,例如,以条形图显示分布,或将其用于分组和旋转。 因此,让我解释一下对连续变量进行装箱的两个选项。

等宽合并 (Equal width binning)

There are many ways to do the binning. I will introduce here the three most popular ones, the equal width, equal height, and custom binning. Let me start with T-SQL code that prepares a new table with the Age variable and the key, Age lowered for 10 years, to make the data more plausible.

有很多方法可以进行装箱。 我将在这里介绍三种最受欢迎​​的宽度,宽度,高度和自定义装箱。 让我从T-SQL代码开始,该代码准备一个带有Age变量和关键字Age降低10年的新表,以使数据更合理。

USE AdventureWorksDW2016;
GO
 
-- Preparing demo table for the continuous variable
DROP TABLE IF EXISTS dbo.TMAge;
GO
SELECT CustomerKey, Age-10 AS Age
INTO dbo.TMAge
FROM dbo.vTargetMail;
GO
 
-- Data overview
SELECT MIN(Age) AS minA,
 MAX(Age) AS maxA,
 MAX(Age) - MIN(Age) AS rngA,
 AVG(Age) AS avgA,
 1.0 * (MAX(Age) - MIN(Age)) / 5 AS binwidth
FROM dbo.TMAge;

Here are the results.

这是结果。

You can see the minimal, maximal and average value of the age, the range, and the width of a bin for equal width binning. Equal width binning means that the width of each bin is equal, no matter of the number of cases in each bin. This kind of binning preserves well the distribution of the continuous variable, especially if the number of bins is high. The following Python code reads the Age from the table just created, and then bins it into 20 equal width bins, and then creates the bar chart.

您可以看到年龄,范围和宽度相等的装箱的最小,最大和平均值。 等宽合并意味着每个合并箱的宽度相等,而与每个合并箱中的箱子数量无关。 这种分箱可以很好地保留连续变量的分布,尤其是在分箱数较多的情况下。 以下Python代码从刚创建的表中读取Age,然后将其装箱到20个等宽的箱中,然后创建条形图。

# Connecting and reading the Age
con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')
query = """SELECT CustomerKey, Age
           FROM dbo.TMAge;"""
TMAge = pd.read_sql(query, con)
 
# Show the Age continuous variable in 20 equal width bins
TMAge['AgeEWB'] = pd.cut(TMAge['Age'], 20)
TMAge['AgeEWB'].value_counts()
pd.crosstab(TMAge.AgeEWB, 
            columns = 'Count').plot(kind = 'bar', 
                                    legend = False,
                                    title = 'AgeEWB')
plt.show()

So here is the bar chart. You can see the distribution of Age quite well.

这是条形图。 您可以很好地看到Age的分布。

If you use less bins, then the bar chart follows the shape of the distribution worse. The next Python code bins the values of the Age in 5 bins.

如果使用较少的箱,则条形图的分布形状会更糟。 下一个Python代码将Age的值分成5个bin。

# Equal width binning - 5 bins
TMAge['AgeEWB'] = pd.cut(TMAge['Age'], 5)
TMAge['AgeEWB'].value_counts(sort = False)

These are the counts, or the frequency distribution of the Age, binned in 5 equal width bins.

这些是年龄(Age)的计数或频率分布,分为5个等宽的条带。

In R, you can use the cut() function from the basic installation, without any additional package, to bin the data. The following code loads the RODBC library, reads the SQL Server table, and then bins the continuous variable, Age, in 5 equal width bins. The counts, the result of the table() function, is the same as the result from the Python code.

在R中,可以使用基本安装中的cut()函数,而无需任何其他程序包来对数据进行装箱。 以下代码加载RODBC库,读取SQL Server表,然后将连续变量Age放入5个等宽的箱中。 计数是table()函数的结果,与Python代码的结果相同。

# Load RODBC library (install only if needed)
# install.packages("RODBC")
library(RODBC)
 
# Connecting and reading the data
con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd")
TMAge <- as.data.frame(sqlQuery(con,
  "SELECT CustomerKey, Age
   FROM dbo.TMAge;"),
  stringsAsFactors = TRUE)
close(con)
 
# Equal width binning of a continuous variable
TMAge["AgeEWB"] = cut(TMAge$Age, 5)
table(TMAge$AgeEWB)

Finally, let’s do the equal width binning in T-SQL. The code first reads the minimal and the maximal value of Age and calculated the width of the bins, considering binning into 5 bins. Then the next query uses the CASE clause to define the bins and the labels of the bins.

最后,让我们在T-SQL中进行等宽合并。 该代码首先读取Age的最小值和最大值,然后考虑合并为5个bin,计算出bin的宽度。 然后,下一个查询使用CASE子句来定义垃圾箱和垃圾箱的标签。

-- Equal width binning of a continuous variable
DECLARE @binwidth AS NUMERIC(5,2), 
 @minA AS INT, @maxA AS INT;
SELECT @minA = MIN(AGE),
 @maxa = MAX(Age),
 @binwidth = 1.0 * (MAX(Age) - MIN(Age)) / 5
FROM dbo.TMAge;
SELECT CustomerKey, Age,
 CASE 
  WHEN Age >= @minA + 0 * @binwidth AND Age < @minA + 1 * @binwidth
   THEN CAST((@minA + 0 * @binwidth) AS VARCHAR(5)) + ' - ' +
        CAST((@minA + 1 * @binwidth) AS VARCHAR(5))
  WHEN Age >= @minA + 1 * @binwidth AND Age < @minA + 2 * @binwidth
   THEN CAST((@minA + 1 * @binwidth) AS VARCHAR(5)) + ' - ' +
        CAST((@minA + 2 * @binwidth) AS VARCHAR(5))
  WHEN Age >= @minA + 2 * @binwidth AND Age < @minA + 3 * @binwidth
   THEN CAST((@minA + 2 * @binwidth) AS VARCHAR(5)) + ' - ' +
        CAST((@minA + 3 * @binwidth) AS VARCHAR(5))
  WHEN Age >= @minA + 3 * @binwidth AND Age < @minA + 4 * @binwidth
   THEN CAST((@minA + 3 * @binwidth) AS VARCHAR(5)) + ' - ' +
        CAST((@minA + 4 * @binwidth) AS VARCHAR(5))
  ELSE CAST((@minA + 4 * @binwidth) AS VARCHAR(5)) + ' + '
 END AS AgeEWB
FROM dbo.TMAge;
GO

Again, the distribution of the binned continuous variable is the same as in Python or R. Here are the partial results of the query, showing a couple of rows with the original and the binned values of the Age continuous variable.

同样,合并的连续变量的分布与Python或R中的相同。这是查询的部分结果,显示了几行,分别包含Age连续变量的原始值和合并值。

等高合并 (Equal height binning)

Equal height binning of a continuous variable means that after the binning, there is am approximately equal number of cases in each bin, and the width of the bins varies. You can do this very simply with the T_SQL NTILE() function, like the following code shows. Note that I used the ORDER BY clause only to shuffle the results, to get all possible tiles of the continuous variable in the first few rows.

连续变量的等高分箱意味着在分箱之后,每个分箱中的箱子数量大致相等,并且分箱的宽度会变化。 您可以使用T_SQL NTILE()函数非常简单地执行此操作,如以下代码所示。 请注意,我仅使用ORDER BY子句对结果进行混洗,以获取前几行中连续变量的所有可能切片。

-- Equal height binning
SELECT CustomerKey, Age,
 CAST(NTILE(5) OVER(ORDER BY Age)
  AS CHAR(1)) AS AgeEHB
FROM dbo.TMAge
ORDER BY NEWID();
GO

Here are partial results.

这是部分结果。

In R, I will create a function for the equal height binning. Let me develop the function step by step. First, I use the integer division to calculate the minimal number of cases in each bin for 5 bins.

在R中,我将为等高合并创建一个函数。 让我逐步开发功能。 首先,我使用整数除法来计算每个bin中5个bin的最小案例数。

# Lower limit for the number of cases in a bin
length(TMAge$Age) %/% 5

The number given is 3,696. If the number of cases would be divisible by 5, then each bin would have this number of cases. Let’s create a vector with five times this number.

给出的数字是3,696。 如果案例数可以被5整除,则每个bin都将具有此案例数。 让我们用这个数字的五倍创建一个向量。

# Create the vector of the bins with number of cases
rep(length(TMAge$Age) %/% 5, 5)

However, the number of cases in this data frame we are using is not divisible by 5. With the modulo operator, you can see that there is a remainder of 4 rows.

但是,我们正在使用的该数据帧中的案例数不能被5整除。通过取模运算符,您可以看到剩下4行。

# How many bins need a case more
length(TMAge$Age) %% 5

We will add these four cases to the first four bins, like does the NTILE() function in T-SQL. The following code creates a vector of five values, namely (1, 1, 1, 1, 0), which will be added to the number of cases in each bin.

我们将把这四种情况添加到前四个bin中,就像T-SQL中的NTILE()函数一样。 下面的代码创建一个包含五个值的向量,即(1、1、1、1、0),这些值将添加到每个bin中的个案数中。

# Array to add cases to the first 4 bins
ifelse(1:5 <= length(TMAge$Age) %% 5, 1, 0)

Finally, here is the code that creates the binning function and does the binning.

最后,这是创建合并功能并执行合并的代码。

# Equal height binning - a function
EHBinning <- function(data, nofbins) {
    bincases <- rep(length(data) %/% nofbins, nofbins)
    bincases <- bincases + ifelse(1:nofbins <= length(data) %% nofbins, 1, 0)
    bin <- rep(1:nofbins, bincases)
    bin <- bin[rank(data, ties.method = "last")]
    return(factor(bin, levels = 1:nofbins, ordered = TRUE))
}
TMAge["AgeEHB"] = EHBinning(TMAge$Age, 5)
table(TMAge$AgeEHB)

The table() function shows the number of cases in each bin, 3697, 3697, 3697, 3697, and 3696.

table()函数显示每个分箱3697、3697、3697、3697和3696中的案例数。

In Python, you can use the pandas qcut() function for the equal height, or quantile binning, like the following code shows.

在Python中,可以将pandas qcut()函数用于相等的高度或分位数合并,如以下代码所示。

# Equal height binning
TMAge['AgeEHB'] = pd.qcut(TMAge['Age'], 5)
TMAge['AgeEHB'].value_counts(sort = False)

Note the results – the distribution of the values slightly differs from the results of the T-SQL and R code.

注意结果-值的分布与T-SQL和R代码的结果略有不同。

These small differences are due to the fact that the qcut() function classifies all cases with the same value in a single tile. In Python, all cases with Age of 25 are in the first tile; in T-SQL and R, some of these cases were assigned to tile 2, in order to have a really equal number of cases in each tile. Nevertheless, these differences are small and not really important.

这些小的差异是由于qcut()函数在单个图块中将所有情况分类为具有相同值的事实所致。 在Python中,年龄为25岁的所有案例都在第一个图块中; 在T-SQL和R中,其中一些情况被分配给了图块2,以便每个图块中的情况数实际上相等。 但是,这些差异很小,并不真正重要。

自定义装箱 (Custom binning)

If you want to follow the real life or business logic, you need to do a custom binning. The Age continuous variable is a very nice example. Age is many times binned with a custom logic. You can imagine that one-year difference in age means much more when you are a teenager than when you are in sixties or seventies. Therefore, you create narrower bins for the smaller values and wider for the bigger values of the Age continuous variable. Here is the T-SQL code that does the custom binning.

如果要遵循现实生活或业务逻辑,则需要执行自定义合并。 Age连续变量是一个很好的例子。 年龄是由自定义逻辑多次组合而成的。 您可以想象,青少年时期的一年差异比六十或七十年代的意义要大得多。 因此,您可以为Age连续变量的较小值创建较窄的容器,为较大的值创建较宽的容器。 这是执行自定义合并的T-SQL代码。

-- Custom binning
SELECT CustomerKey, Age,
 CASE 
  WHEN Age >= 18 AND Age < 24
   THEN '18 - 23'
  WHEN Age >= 24 AND Age < 31
   THEN '24 - 30'
  WHEN Age >= 31 AND Age < 41
   THEN '31 - 40'
  WHEN Age >= 41 AND Age < 56
   THEN '41 - 55'
  ELSE '56 +'
 END AS AgeCUB 
FROM dbo.TMAge;
GO

You can do it easily with the pandas cut() function in Python as well. You only need to provide the function a vector with the cutting values.

您也可以使用Python中的pandas cut()函数轻松完成此操作。 您只需要向函数提供带有切割值的向量即可。

# Custom binning
custombins = [17, 23, 30, 40, 55, 90]
pd.cut(TMAge['Age'], custombins)
TMAge['AgeCUB'] = pd.cut(TMAge['Age'], custombins)
TMAge['AgeCUB'].value_counts(sort = False)
pd.crosstab(TMAge.AgeCUB, 
            columns = 'Count').plot(kind = 'bar', 
                                    legend = False,
                                    title = 'AgeCUB')
plt.show()

Here is the graph showing the distribution of the Age binned in custom bins.

这是显示在自定义分类箱中分类的年龄分布的图形。

Finally, let’s do the custom binning in R. I am using the cut() function again and feed it with a vector of the cutting points, like in Python.

最后,让我们在R中进行自定义合并。我再次使用cut()函数,并向其提供切点向量,就像在Python中一样。

# Custom binning
TMAge["AgeCUB"] = cut(TMAge$Age, c(17, 23, 30, 40, 55, 90))
table(TMAge$AgeCUB)

And here are the counts of the binned continuous variable.

这是合并的连续变量的计数。

结论 (Conclusion)

In Data science working with variables is commonplace. Equal width and custom binning are both quite intuitive techniques for managing continuous variables. You can ask yourself why you would use equal height binning. Of course, there is a reason for this kind of binning as well. You preserve more information with equal height binning than with other two options when working with a continuous variable. I will explain this in detail in my next article in this data science series.

在数据科学中,使用变量很普遍。 等宽和自定义合并都是管理连续变量的非常直观的技术。 您可以问自己为什么要使用等高合并。 当然,这种装箱也是有原因的。 当使用连续变量时,与其他两个选项相比,使用等高合并将保留更多信息。 我将在本数据科学系列的下一篇文章中对此进行详细说明。

目录 (Table of contents)

Introduction to data science, data understanding and preparation
Data science in SQL Server: Data understanding and transformation – ordinal variables and dummies
Data science in SQL Server: Data analysis and transformation – binning a continuous variable
Data science in SQL Server: Data analysis and transformation – Information entropy of a discrete variable
Data understanding and preparation – basic work with datasets
Data science in SQL Server: Data analysis and transformation – grouping and aggregating data I
Data science in SQL Server: Data analysis and transformation – grouping and aggregating data II
Interview questions and answers about data science, data understanding and preparation
数据科学导论,数据理解和准备
SQL Server中的数据科学:数据理解和转换–序数变量和虚拟变量
SQL Server中的数据科学:数据分析和转换–合并连续变量
SQL Server中的数据科学:数据分析和转换–离散变量的信息熵
数据理解和准备–数据集的基础工作
SQL Server中的数据科学:数据分析和转换–分组和聚合数据I
SQL Server中的数据科学:数据分析和转换–分组和聚合数据II
面试有关数据科学,数据理解和准备的问答

参考资料 (References)

翻译自: https://www.sqlshack.com/data-science-sql-server-data-analysis-transformation-binning-a-continuous-variable/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值