熵权法离散_SQL Server中的数据科学：数据分析和转换–离散变量的信息熵

最新推荐文章于 2023-02-21 15:38:45 发布

culuo4781

最新推荐文章于 2023-02-21 15:38:45 发布

阅读量1k

点赞数

文章标签： python 人工智能机器学习数据分析大数据

原文链接：https://www.sqlshack.com/data-science-sql-server-data-analysis-transformation-information-entropy-discrete-variable/

版权

本文介绍了信息熵的概念及其在数据科学中的应用，特别是在SQL Server中计算离散变量的熵。文章通过举例说明信息熵与变量可变性的关系，讨论了如何在T-SQL、Python和R中计算信息熵，强调了熵在数据理解和准备过程中的作用。

摘要由CSDN通过智能技术生成

熵权法离散

In this article, in the data science: data analysis and transformation series, we’ll be talking about information entropy.

在本文的“数据科学：数据分析和转换”系列中，我们将讨论信息熵。

In the conclusion of my last article, Data science, data understanding and preparation – binning a continuous variable, I wrote something about preserving the information when you bin a continuous variable to bins with an equal number of cases. I am explaining this sentence in this article you are currently reading. I will show you how to calculate the information stored in a discrete variable by explaining the measure for the information, namely the information entropy.

在上一篇文章《数据科学，数据理解和准备–对连续变量进行装箱》的结论中，我写了一些有关在将连续变量装箱到相同数量的情况下时保存信息的信息。我正在您当前正在阅读的这篇文章中解释这句话。我将通过解释信息量度（即信息熵）向您展示如何计算存储在离散变量中的信息。

Information entropy was defined by Claude E. Shannon in his information theory. Shannon’s goal was to quantify the amount of information in a variable. Nowadays the information theory as a special branch of applied mathematics is used in many places, such as computer science, electrical engineering, and more.

信息熵由克劳德·E·香农（Claude E. Shannon）在其信息论中定义。 Shannon的目标是量化变量中的信息量。如今，信息论是应用数学的一个特殊分支，已在许多地方使用，例如计算机科学，电气工程等。

引入信息熵 (Introducing information entropy)

Information is actually a surprise. If you are surprised when you hear or read something, this means that you didn’t know that, that you actually learned something new, and this something new is a piece of information. So how does this connect to a variable in your dataset?

信息实际上是一个惊喜。如果您在听到或阅读某些内容时感到惊讶，则意味着您不知道该信息，而是您实际上学到了一些新知识，而这些新知识只是一条信息。那么这如何连接到数据集中的变量？

Intuitively, you can imagine that the information of a variable is connected with its variability. If a variable is a constant, where all cases occupy the same single state, you cannot be surprised with a value of any single case. Imagine that you have a group of attendees in a classroom. You know a bit about their education. Let’s start with an example where you split the attendees into two groups by the education: low and high. In one class, 95% of attendees have high and 5% low education. You ask a random attendee about her or his actual education level. You would expect that it would be high. You would be surprised only in 5% of cases, learning that for that particular attendee is actually low. Now imagine that you have a 50% – 50% case. The knowledge about this distribution does not help you much – whatever state you would expect; you would be surprised half of the times.

直观地，您可以想象变量的信息与变量的可变性有关。如果变量是一个常量，其中所有个案都处于相同的单一状态，那么对于任何单一个案的值都不会感到惊讶。想象一下，您在教室里有一群与会者。您对他们的教育有所了解。让我们从一个示例开始，在该示例中，您按教育程度将参加者分为两组：低和高。在一个班级中，95％的参与者受过高等教育，而5％的受教育程度低。您向随机参与者询问她或他的实际教育水平。您会期望它会很高。仅在5％的情况下，您会感到惊讶，因为得知该特定参与者的人数实际上很少。现在，假设您有50％– 50％的案例。关于这种分布的知识对您没有多大帮助-无论您期望什么状态；您会惊讶一半的时间。

Now imagine that you have previous knowledge about the attendees’ education classified into three distinct classes: low, medium, and high. With a distribution 33% – 33% – 33%, you would be surprised about some person’s education in two thirds of the cases. With any other distribution, for example 25% – 50% – 25%, you would be surprised fewer times. For this specific case, you would expect medium level, and half of the times you would learn that the level of the person you are talking with is not medium, being either low or high.

现在，假设您以前有关于参加者教育的知识，分为三类：低，中和高。在33％– 33％– 33％的分配中，三分之二的案例会让您感到惊讶。使用其他分布，例如25％– 50％– 25％，您会惊讶的次数减少了。对于这种特定情况，您期望获得中等水平，并且有一半的时间会知道与您交谈的人的水平不是中等，而是低或高。

From last two paragraphs, you can see that more possible states mean higher maximal possible surprise or maximal possible information. By binning, you are lowering the number of possible states. If you bin to classes with equal height, the loss of the information is minimal, preserving it as much as possible, as I wrote in my previous article.

从最后两段中，您可以看到更多可能的状态意味着更高的最大可能的惊喜或最大可能的信息。通过装箱，您正在减少可能的状态数。如果您将同等高度的类分类，则信息的损失是最小的，如我在上一篇文章中所写，它尽可能地保留了信息。

Of course, the question is how to measure this information. You can try to measure the spread of the variable. For example, for ordinal variables stored as integers, you could pretend that the variables are continuous, and you could use standard deviation as a measure. In order to compare the spread of two different variables, you could use the relative variability, of the coefficient of variation, defined as standard deviation divided by the mean:

当然，问题是如何衡量这些信息。您可以尝试测量变量的范围。例如，对于存储为整数的序数变量，您可以假设变量是连续的，并且可以使用标准偏差作为度量。为了比较两个不同变量的分布，可以使用变异系数的相对变异性，变异系数定义为标准偏差除以平均值：

Let me start by preparing the demo data for this article with T-SQL. Note that I created a calculated variable GenMar by simply concatenating the Gender and the MaritalStatus variables into on that will have four distinct stated (FM, FS, MM, MS).

让我首先使用T-SQL准备本文的演示数据。请注意，我通过简单地将Gender和MaritalStatus变量串联在一起创建了一个计算变量GenMar，该变量将具有四个不同的陈述（FM，FS，MM，MS）。

USE AdventureWorksDW2016;
GO
-- Preparing demo table
DROP TABLE IF EXISTS dbo.TM;
GO
SELECT CustomerKey,
 NumberCarsOwned, BikeBuyer,
 Gender + MaritalStatus AS GenMar
INTO dbo.TM
FROM dbo.vTargetMail;
GO

Now I will switch to R. Here is the code that reads the data from the table I just created.

现在，我将切换到R。这是从刚创建的表中读取数据的代码。

# Load RODBC library (install only if needed)
# install.packages("RODBC")
library(RODBC)
# Connecting and reading the data
con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd")
TM <- as.data.frame(sqlQuery(con,
  "SELECT CustomerKey, NumberCarsOwned, BikeBuyer, GenMar
   FROM dbo.TM;"),
  stringsAsFactors = TRUE)
close(con)

In this article, I will use the RevoScaleR package for drawing the histogram, and DescTools package for calculating the information entropy, so you need to load them, and potentially also install the DescTools package (you should already have the RevoScaleR package if you use the Microsoft R engine).

在本文中，我将使用RevoScaleR软件包来绘制直方图，并使用DescTools软件包来计算信息熵，因此您需要加载它们，并可能还安装DescTools软件包（如果使用Microsoft R引擎）。

# Histogram from RevoScaleR
# Information entropy (install only if needed) from DescTools
# install.packages("DescTools");
library("RevoScaleR")
library("DescTools")

The following R code calculates the frequencies, creates a histogram, and calculates the mean, standard deviation, and coefficient of variation for the NumberCarsOwned variable. Note that I use the paste() function to concatenate the strings with the calculated values.

以下R代码计算NumberCarsOwned变量的频率，创建直方图，并计算平均值，标准偏差和变异系数。请注意，我使用paste（）函数将字符串与计算值连接在一起。

# Frequencies and histogram
table(TM$NumberCarsOwned)
rxHistogram(formula = ~NumberCarsOwned,
            data = TM)
# Mean, StDev, CV
paste('mean:', mean(TM$NumberCarsOwned))
paste('sd  :', sd(TM$NumberCarsOwned))
paste('CV  :', sd(TM$NumberCarsOwned) / mean(TM$NumberCarsOwned))

Here is the histogram for this variable.

这是此变量的直方图。

And here are the numerical results.

这是数值结果。

0	1	2	3	4
4238	4883	6457	1645	1261

0	1个	2	3	4
4238	4883	6457	1645	1261

[1] “mean: 1.50270504219866”

[1] “sd : 1.13839374115481”

[1] “CV : 0.757563000846255”

[1]“平均值：1.50270504219866”

[1]“ SD：1.13839374115481”

[1]“简历：0.757563000846255”

Let me do the same calculations for the BikeBuyer variable.

让我对BikeBuyer变量进行相同的计算。

table(TM$BikeBuyer)
paste('mean:', mean(TM$BikeBuyer))
paste('sd  :', sd(TM$BikeBuyer))
paste('CV  :', sd(TM$BikeBuyer) / mean(TM$BikeBuyer))

Here are the results.

这是结果。

0	1
9352	9132

0	1个
9352	9132

[1] “mean: 0.494048907162952”

[1] “sd : 0.499978108041433”

[1] “CV : 1.01200124277681”

[1]“平均值：0.494048907162952”

[1]“ SD：0.499978108041433”

[1]“简历：1.01200124277681”

You can see that although the standard deviation for the BikeBuyer is lower than for the NumberCarsOwned, the relative variability is higher.

您可以看到，尽管BikeBuyer的标准偏差小于NumberCarsOwned的标准偏差，但相对变异性却较高。

Now, what about the categorical, or nominal variables? Of course, you cannot use the calculations that are intended for the continuous variables. Time to calculate the information entropy.

现在，分类变量或名义变量呢？当然，您不能使用用于连续变量的计算。计算信息熵的时间。

定义信息熵 (Defining information entropy)

Shannon defined the information of a particular state as a probability multiplied by the logarithm with base two of the probability:

香农将特定状态的信息定义为概率乘以对数乘以概率的底数二：

The probability can take a value in an interval between 0 and 1. The logarithm function returns negative values for the interval between zero and one. This is why the negative sign, or the multiplication by -1, is added.

概率可以取介于0和1之间的一个值。对数函数返回介于0和1之间的一个负值。这就是为什么要加上负号或乘以-1的原因。

The information entropy of a variable, or the actual amount of information stored in this variable, is simply a sum of information of all particular states:

变量的信息熵或该变量中存储的实际信息量只是所有特定状态的信息之和：

What is the maximal possible entropy for a specific number of states, let’s say three? Let’s do the calculation.

特定数量的状态（例如三个）的最大可能熵是多少？让我们进行计算。

From logarithm formulas, we know that we can express

从对数公式，我们知道我们可以表达

Therefore, we can develop the equation for the maximal possible information entropy:

因此，我们可以为最大可能的信息熵建立方程：

You can see that the maximal possible information entropy of a variable with n states is logarithm base two of this number n.

您可以看到，具有n个状态的变量的最大可能信息熵是此数字n的对数以2为底的对数。

在T-SQL中计算熵 (Calculating the entropy in T-SQL)

Let me start with a calculation of the maximal possible information entropy for a different number of states.

让我从计算不同数量状态的最大可能信息熵开始。

SELECT LOG(2,2) AS TwoStatesMax,
 LOG(3,2) AS ThreeStatesMax,
 LOG(4,2) AS FourStatesMax,
 LOG(5,2) AS FiveStatesMax;

Here are the results.

这是结果。

The following code calculates the frequencies of the GenMar variable, the information entropy for each state, and then the total entropy, compared to the maximal possible information entropy of a four-state variable.

与四态变量的最大可能信息熵相比，以下代码计算GenMar变量的频率，每个状态的信息熵，然后计算总熵。

-- Entropy of the GenMar
WITH ProbabilityCTE AS
(
SELECT GenMar,
 COUNT(GenMar) AS StateFreq
FROM dbo.TM
GROUP BY GenMar
),
StateEntropyCTE AS
(
SELECT GenMar,
 1.0*StateFreq / SUM(StateFreq) OVER () AS StateProbability
FROM ProbabilityCTE
)
SELECT 'GenMar' AS Variable,
 (-1)*SUM(StateProbability * LOG(StateProbability,2)) AS TotalEntropy,
 LOG(COUNT(*),2) AS MaxPossibleEntropy,
 100 * ((-1)*SUM(StateProbability * LOG(StateProbability,2))) / 
 (LOG(COUNT(*),2)) AS PctOfMaxPossibleEntropy
FROM StateEntropyCTE;
GO

From the results, you can see that this variable has a high relative entropy.

从结果可以看出，此变量具有较高的相对熵。

用Python计算信息熵 (Calculating the information entropy in Python)

For a start, let’s import all of the libraries needed and read the data.

首先，让我们导入所有需要的库并读取数据。

# Imports needed
import numpy as np
import pandas as pd
import pyodbc
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sc
# Connecting and reading the data
con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')
query = """SELECT CustomerKey, NumberCarsOwned, BikeBuyer, GenMar
           FROM dbo.TM;"""
TM = pd.read_sql(query, con)

For the calculation of the information entropy, you can use the scipy.stat.entropy() function. This function need the probabilities as the input. Therefore, I defined my own function that calculates the entropy with a single argument – the name of the variable, and then in the body I calculate the state probabilities and the entropy of the variable with the scipy.stats.entropy() function. Then I calculate the information entropy of the discrete variables I have in my dataset.

为了计算信息熵，可以使用scipy.stat.entropy（）函数。该函数需要概率作为输入。因此，我定义了自己的函数，该函数使用一个参数（变量名）来计算熵，然后在主体中使用scipy.stats.entropy（）函数计算状态概率和变量的熵。然后，我计算数据集中具有离散变量的信息熵。

# Function that calculates the entropy
def f_entropy(indata):
    indataprob = indata.value_counts() / len(indata) 
    entropy=sc.stats.entropy(indataprob, base = 2) 
    return entropy
# Use the function on variables
f_entropy(TM.NumberCarsOwned), np.log2(5), f_entropy(TM.NumberCarsOwned) / np.log2(5)
f_entropy(TM.BikeBuyer), np.log2(2), f_entropy(TM.BikeBuyer) / np.log2(2)
f_entropy(TM.GenMar), np.log2(4), f_entropy(TM.GenMar) / np.log2(4)

Here are the results.

这是结果。

(2.0994297487400737, 2.3219280948873622, 0.9041751781042634)

(0.99989781003755662, 1.0, 0.99989781003755662)

(1.9935184517263986, 2.0, 0.99675922586319932)

（2.0994297487400737、2.3219280948873622、0.9041751781042634）

（0.99989781003755662，1.0，0.99989781003755662）

（1.9935184517263986，2.0，0.99675922586319932）

计算R中的信息熵 (Calculating information entropy in R)

Finally, let’s do the calculation of the information entropy also in R. But before that, let me show you the distribution and the histogram for the GenMar calculated variable.

最后，让我们也在R中进行信息熵的计算。但是在此之前，让我向您展示GenMar计算变量的分布和直方图。

# GenMar
table(TM$GenMar)
rxHistogram(formula = ~GenMar,
            data = TM)

Here are the numerical and graphical results.

这是数值和图形结果。

FM	FS	MM	MS
4745	4388	5266	4085

调频	FS	MM	多发性硬化症
4745	4388	5266	4085

In R, I will use the DescTools Entropy() function. This function expects the absolute frequencies as the input. Here you can see the code.

在R中，我将使用DescTools Entropy（）函数。该功能期望绝对频率作为输入。在这里您可以看到代码。

# Entropy
NCOT = table(TM$NumberCarsOwned)
print(c(Entropy(NCOT), log2(5), Entropy(NCOT) / log2(5)))
BBT = table(TM$BikeBuyer)
print(c(Entropy(BBT), log2(2), Entropy(BBT) / log2(2)))
GenMarT = table(TM$GenMar)
print(c(Entropy(GenMarT), log2(4), Entropy(GenMarT) / log2(4)))

And here are the last results in this article.

这是本文的最后结果。

[1] 2.0994297 2.3219281 0.9041752

[1] 0.9998978 1.0000000 0.9998978

[1] 1.9935185 2.0000000 0.9967592

[1] 2.0994297 2.3219281 0.9041752

[1] 0.9998978 1.0000000 0.9998978

[1] 1.9935185 2.0000000 0.9967592

结论 (Conclusion)

This concludes my article on information entropy. After exhausting articles on working with discrete variables, it looks like it would be time for switching to continuous ones. But before that, I want to explain something else. In some of my previous articles, I tacitly added calculated variables to a dataset. Therefore, I want to introduce some of the operations on the whole datasets.

到此我结束了关于信息熵的文章。在用完离散变量的文章用尽之后，似乎该切换到连续变量了。但在此之前，我想解释一下其他内容。在以前的一些文章中，我默认将计算出的变量添加到数据集中。因此，我想介绍一些对整个数据集的操作。

目录 (Table of contents)

Introduction to data science, data understanding and preparation

Data science in SQL Server: Data understanding and transformation – ordinal variables and dummies

Data science in SQL Server: Data analysis and transformation – binning a continuous variable

Data science in SQL Server: Data analysis and transformation – Information entropy of a discrete variable

Data understanding and preparation – basic work with datasets

Data science in SQL Server: Data analysis and transformation – grouping and aggregating data I

Data science in SQL Server: Data analysis and transformation – grouping and aggregating data II

Interview questions and answers about data science, data understanding and preparation

数据科学导论，数据理解和准备

SQL Server中的数据科学：数据理解和转换–序数变量和虚拟变量

SQL Server中的数据科学：数据分析和转换–合并连续变量

SQL Server中的数据科学：数据分析和转换–离散变量的信息熵

数据理解和准备–数据集的基础工作

SQL Server中的数据科学：数据分析和转换–分组和聚合数据I

SQL Server中的数据科学：数据分析和转换–分组和聚合数据II

面试有关数据科学，数据理解和准备的问答

参考资料 (References)

翻译自: https://www.sqlshack.com/data-science-sql-server-data-analysis-transformation-information-entropy-discrete-variable/

熵权法离散

culuo4781

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
熵权法离散_SQL Server中的数据科学：数据分析和转换–离散变量的信息熵

熵权法离散In this article, in the data science: data analysis and transformation series, we’ll be talking about information entropy. 在本文的“数据科学：数据分析和转换”系列中，我们将讨论信息熵。 In the conclusion of my last artic...
复制链接

扫一扫