数据eda_关于分类和有序数据的EDA

本文聚焦于数据探索性分析(EDA),主要探讨如何处理分类和有序数据。通过Python和相关工具,深入理解大数据集中的这类数据,为后续的数据分析和人工智能应用打下基础。
摘要由CSDN通过智能技术生成

数据eda

数据科学和机器学习统计 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)

Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the case of Ordinal variables, the options can be ordered by some rule, like the Likert Scale:

分类变量是将可能的值作为一组选项提供的变量,可以预定义或打开。 一个例子可以是一个人的性别。 对于序数变量,可以按照某些规则对选项进行排序,例如Likert Scale:

  • Like

    喜欢
  • Like Somewhat

    有点像
  • Neutral

    中性
  • Dislike Somewhat

    有点不喜欢
  • Dislike

    不喜欢

To simplify further examples, we will use a simple example, based on a group of students that have passed or not 2 distinct exams, the results are represented in the next RxC table:

为了简化更多示例,我们将使用一个简单示例,该示例基于一组已通过或未通过2次不同考试的学生,结果显示在下一个RxC表中:

Image for post
The example used in the whole article, self-generated.
整篇文章中使用的示例是自生成的。

Statisticians have developed specific techniques to analyze this data, the most important are:

统计人员已经开发出分析此数据的特定技术,其中最重要的是:

协议措施 (Measures of Agreement)

百分比协议 (Percent Agreement)

Calculated as the divisions between the number of cases where the rates are in a certain class by the total number of rates.

计算为费率在特定类别中的案例数除以费率总数。

Image for post
Adding totals to the example, self-generated.
将总计添加到示例中,自行生成。
  • The percent agreement for Passing the exam 2 is 25/(25+60) = 0.29, so 29.4%

    通过考试2的百分比协议是25 /(25 + 60)= 0.29,所以29.4%
  • The percent agreement for Passing the exam 1 is 30/85 = 0.35, so 35.3%

    通过考试1的百分比协议是30/85 = 0.35,所以35.3%
  • The percent agreement of passing the exam 1 and not passing the exam 2 is 10/85 = 0.117, so 11.7%.

    通过考试1和未通过考试2的百分比协议是10/85 = 0.117,所以11.7%。

The problem with the percent agreement is that the data can be obtained only by chance.

百分比一致性的问题在于只能偶然获得数据。

科恩的卡帕 (Cohen’s Kappa)

Image for post
The example used in the whole article, self-generated.
整篇文章中使用的示例是自生成的。

To overcome the problems of percent agreement, we calculate Kappa as:

为了克服百分比协议的问题,我们将Kappa计算为:

Image for post
Cohen’s Kappa formula, self-generated.
科恩的Kappa公式,是自生成的。

where P0 is the observed agreement and Pe the expected agreement, calculated as:

其中P0是观察到的协议, Pe是期望的协议,计算公式为:

Image for post
P0 and Pe formulas, self-generated.
P0和Pe公式,是自生成的。

In our example:

在我们的示例中:

  • P0 = 70/85 = 0.82

    P0 = 70/85 = 0.82

  • Pe = 30 x 25 / 85² + 55 x 60 / 85² = 0.56

    Pe = 30 x 25 /85²+ 55 x 60 /85²= 0.56

  • K = 0.26 / 0.44 = 0.59

    K = 0.26 / 0.44 = 0.59

The Kappa results are in possible range is (-1,1), where 0 means that observed agreement and chance agreement is the same, 1 if all cases were in agreement and -1 if all cases were in disagreement.

Kappa结果的可能范围是(-1,1),其中0表示观察到的一致和机会一致是相同的,如果所有情况都一致,则为1;如果所有情况都不一致,则为-1。

卡方分布 (The Chi-Squared Distribution)

To do hypothesis testing with categorical variables, we need to use custom distributions, the most common is the Chi-Square, being a continuous theoretical probability distribution.

要使用分类变量进行假设检验,我们需要使用自定义分布,最常见的是卡方,即连续的理论概率分布。

This distribution has only one parameter, k which means degrees of freedom. As k approaches infinity, the chi-Squared distribution becomes similar to the normal distribution.

这种分布只有一个参数, k表示自由度。 当k接近无穷大时,卡方分布变得类似于正态分布。

卡方检验 (Chi-Squared Test)

This test is used to check if two categorical variables are independent, we will use the same example to explain how to calculate it:

该测试用于检查两个类别变量是否独立,我们将使用相同的示例来说明如何计算它:

First, we define the hypothesis that we want to test, in our case, we want to check if passing exam 1 and exam 2 are independent, so:

首先,我们定义要测试的假设,在本例中,我们要检查通过考试1和考试2是否独立,因此:

  • H0 = Pass exam 1 and pass exam 2 are independent.

    H0 =通过考试1和通过考试2是独立的。
  • Ha = Pass exam 1 and pass exam 2 are dependent.

    Ha =通过考试1和通过考试2是相关的。

This test relies on the difference between expected and observed values, to calculate the expected values(what you expect to find if both variables were independent), we use:

该测试依赖于期望值与观察值之间的差异,以计算期望值(如果两个变量都是独立的,您会发现什么),我们使用:

Image for post
Expected values formula, self-generated.
期望值公式,自行生成。

To simplify the calculations first we calculate the marginals, these values are the sums per row and column that we already calculated in the second table if this post. The expected values are calculated as:

为了简化计算,首先我们计算边际,这些值是我们在第二张表中已经计算出的每行和每列的总和。 期望值的计算公式为:

Image for post
Expected values calculation for our example, self-generated.
本示例的期望值计算,是自生成的。

Now we have all we need to calculate the chi-squared formula:

现在我们有了计算卡方公式所需的全部:

Image for post
The chi-Squared formula, self-generated.
卡方公式,自生成。

With the sum symbol, we mean that we have to calculate the formula for all combinations of our variables, in our case 4, and sum the results:

对于总和符号,我们的意思是我们必须为变量4的所有组合计算公式,并对结果求和:

Image for post
Results for each sum of the formula, self-generated.
公式的每个和的结果,自生成。

The final values are the sum of all 4, being 26.96, now we have to compare this result with the statistical tables, for this we need to know the degrees of freedom, they are calculated as (num rows-1)*(num columns-1), in our case we have a degree of freedom = 1.

最终值是所有4的总和,即26.96 ,现在我们必须将此结果与统计表进行比较,为此,我们需要知道自由度,它们的计算方式为(num rows-1)*(num columns -1) ,在我们的情况下,我们的自由度= 1。

According to the tables found easy searching Chi-Squared table at Google(statistical packages for any language should have them in a function), the critical value for 𝝰 = 0.05, is 3.841, our result is much larger, so, we reject the null hypothesis which means that pass exam 1 and pass exam 2 are dependent.

根据在Google上发现的易于搜索的Chi-Squared表(任何语言的统计软件包都应在函数中包含它们),, = 0.05的临界值为3.841,我们的结果要大得多,因此,我们拒绝空值假设意味着通过考试1和通过考试2是相互依赖的。

分类数据的相关统计 (Correlation statistics for categorical data)

As person correlation requires variables to be measured on at least interval level, we need to adopt a new calculation for binary and ordinal variables, let’s introduce them:

由于人的相关性要求至少在区间水平上测量变量,因此我们需要对二进制和序数变量采用新的计算方法,让我们对其进行介绍:

二进制变量 (Binary Variables)

Phi is a measure of the degree of association between two binary variables, based on the table introduced at the Cohen’s Kappa sections, it’s calculated as:

Phi是两个二进制变量之间关联度的度量,基于Cohen Kappa部分介绍的表,其计算公式为:

Image for post
Formulas to calculate the phi statistic, self-generated.
自行计算phi统计信息的公式。

Using the second formula, in our example, Φ = (26.96/85)^(1/2) = 0.1

在我们的示例中,使用第二个公式,Φ=( 26.96 / 85)^(1/2)= 0.1

Notice that the first formula can obtain negative values, meanwhile, the second one can only result in positive values, we don't care about the direction of our result, we just analyze the absolute value.

注意,第一个公式可以得出负值,而第二个公式只能得出正值,我们不在乎结果的方向,我们只分析绝对值。

If the distribution of the data is 50–50, so data is evenly distributed, phi can reach the value of 1, else the potential max value is lower. In our case, we have very little relationship.

如果数据的分布是50–50,则数据分布均匀,phi可以达到1的值,否则潜在的最大值较低。 就我们而言,我们之间的关系很少。

点-双相关 (The Point-Biserial Correlation)

It’s a measure that calculates the correlation between dichotomous and continuous variables, the formula is the next-one:

这是一种计算二分变量和连续变量之间的相关性的度量,公式为下一个:

Image for post
Point biserial correlation formula, self-generated.
点双数相关公式,自生成。

Where:

哪里:

  • x̄1 = mean of the continuous variable for group 1

    x̄1 =组1连续变量的平均值

  • x̄2 = mean of the continuous variable for group 2

    x̄2 =第2组连续变量的平均值

  • p = proportion of class 1 in the dichotomous variable

    p = 1类在二分变量中的比例

  • s_x = Standart deviation of the continuous variable

    s_x =连续变量的标准偏差

To follow our example we will suppose the next values, obtained comparing the exam 1 variable with the number of hours studied:

遵循我们的示例,我们将假定下一个值,该值是将考试1变量与学习的小时数进行比较而获得的:

  • x̄ pass = 5.5

    x̄通过 = 5.5

  • x̄ not pass = 3.1

    x̄不及格 = 3.1

  • p = 20/25 = 0.8

    p = 20/25 = 0.8

  • s_x = 2

    s_x = 2

With these values, we obtain a result of 2.4 * 0.4 / 2 = 0.48, indicating that there’s some relation between our variables.

使用这些值,我们得到的结果为2.4 * 0.4 / 2 = 0.48 ,表明变量之间存在某种关系。

序数变数 (Ordinal Variables)

The most used correlation coefficient for ordinal variables is the Spearman’s rank-order coefficient, usually called Spearman’s r.

序数变量最常用的相关系数是Spearman的秩序系数 ,通常称为Spearman的r

Image for post
Spearman’s r correlation coefficient for ordinal variables, self-generated.
Spearman的r相关系数,用于自变量。

where d_i means the difference between 2 variables for each individual and n the size of the sample.

其中d_i表示每个个体的2个变量与样本大小的n之差。

摘要 (Summary)

In data science, we’re used to do some scatter plots of the binary, categorical or ordinary variables, use them as color differences in other plots, but when we calculate the correlations it’s easy to skip this variable, because of the built-in functions for pandas in the case of python or Dplyr in R don't use them.

在数据科学中,我们习惯于对二进制,分类或普通变量进行散点图绘制,将它们用作其他图中的色差,但是当我们计算相关性时,由于内置变量,很容易跳过此变量R中的python或Dplyr的熊猫函数不使用它们。

In this post, we showed how to analyze these variables' distribution and their correlation with all the other variables.

在这篇文章中,我们展示了如何分析这些变量的分布以及它们与所有其他变量的相关性。

This is the tenth post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).

这是我特别#十后100daysofML,我会发布在GitHub上,Twitter和中型企业(这一挑战的进步阿德里亚塞拉 )。

https://twitter.com/CrunchyML

https://twitter.com/CrunchyML

https://github.com/CrunchyPistacho/100DaysOfML

https://github.com/CrunchyPistacho/100DaysOfML

翻译自: https://medium.com/ai-in-plain-english/eda-on-categorical-and-ordinal-data-22f8a4407836

数据eda

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值