chi-squared检验_每位数据科学家都必须具备Chi-S方检验统计量：客户流失中的案例研究

最新推荐文章于 2024-01-11 11:38:20 发布

weixin_26752765

最新推荐文章于 2024-01-11 11:38:20 发布

阅读量1.8k

点赞数 1

文章标签：大数据 python 人工智能 java 数据分析

原文链接：https://towardsdatascience.com/the-chi-squared-test-statistic-is-a-must-for-every-data-scientist-a-case-study-in-customer-churn-bcdb17bbafb7

版权

本文深入探讨了Chi-Squared检验在数据科学中的重要性，特别是在客户流失案例研究中的应用。通过理解这一统计量，数据科学家可以更好地分析离散变量之间的关联性，从而助力业务决策。

摘要由CSDN通过智能技术生成

chi-squared检验

重点 (Top highlight)

介绍 (Introduction)

The chi-square statistic is a useful tool for understanding the relationship between two categorical variables.

卡方统计量是了解两个分类变量之间关系的有用工具。

For the sake of example, let’s say you work for a tech company that has rolled out a new product and you want to assess the relationship between this product and customer churn. In the age of data, tech or otherwise, many companies undergo to risk of taking evidence that is either anecdotal or perhaps a high level visualization to indicate certainty of a given relationship. The chi-square statistic gives us a way to quantify and assess the strength of a given pair of categorical variables.

举例来说，假设您在一家已经推出新产品的技术公司工作，并且想要评估该产品与客户流失之间的关系。在数据时代，技术时代或其他时代，许多公司都冒着获取证据的风险，这些证据既可能是轶事，也可能是高级可视化，以表明给定关系的确定性。卡方统计量为我们提供了一种量化和评估给定类别变量对强度的方法。

客户流失 (Customer Churn)

Let’s explore chi-square from this lens of customer churn.

让我们从客户流失的角度探讨卡方。

You can download the customer churn dataset that we’ll be working with from kaggle. This dataset provides details for a variety of telecom customers and whether or not they “churned” or closed their account.

您可以从kaggle下载将要使用的客户流失数据集。该数据集提供了各种电信客户的详细信息，以及他们是否“搅动”或关闭了他们的帐户。

Regardless of what company, teams, products, or industries you work with, the following example should be very generalizable.

无论与您合作的公司，团队，产品或行业是什么，以下示例都应该非常概括。

Now that we have our dataset, let’s quickly use dplyr's select command to pull down the fields we'll be working with for simplicity sake. I'll also be dropping the number of levels down to two for simplicity sake. You can certainly run a chi-square test on categorical variables with more than two levels, but as we venture to understand it from the ground up, we'll keep it simple.

现在我们有了数据集，为简单起见，让我们快速使用dplyr的select命令下拉我们将要使用的字段。为了简单起见，我还将级别的数目降低到两个。您当然可以对具有两个以上级别的分类变量进行卡方检验，但是当我们从头开始理解它时，我们将使其保持简单。

churn <- churn %>%
  select(customerID, StreamingTV, Churn)%>%
  mutate(StreamingTV = ifelse(StreamingTV == 'Yes', 1, 0))

Churn is going to be classified as a Yes or a No. As you just saw, StreamingTV will be encoded with either a 1 or 0.

流失率将被分类为是或否。如您所见，StreamingTV将被编码为1或0。

探索性数据分析 (Exploratory Data Analysis)

I won’t go into great depth on exploratory data analysis here, but I will give you two quick tools to being able to assess a relationship between two categorical variables.

在这里，我不会深入探讨探索性数据分析，但我将为您提供两个快速工具，以便能够评估两个类别变量之间的关系。

比例表 (Proportion Tables)

Proportion tables are a great way to establish some fundamental understanding about the relationship between two categoricals

比例表是一种建立关于两个类别之间关系的基本理解的好方法

table(churn$StreamingTV)
table(churn$Churn)

round(prop.table(table(churn$StreamingTV)),2)
round(prop.table(table(churn$Churn)),2)

Table gives us a quick idea of the counts in any given level, wrapping that in prop.table() allows us to see the percentage break down.

Table使我们可以快速了解任何给定级别的计数，将其包装在prop.table()可以让我们看到百分比细分。

Let’s now pass both variables to our table() function

现在让我们将两个变量都传递给table()函数

table(churn$StreamingTV, churn$Churn)
round(prop.table(table(churn$StreamingTV, churn$Churn),1),2)

Once you pass another variable into the proportion table, you’re then able to establish where you want to assess relative proportion. In this case, the second parameter we pass to the prop.table() function, "1", which specifies that we'd like to see the relative proportion of records across each row or value of StreamingTV. As you can see in the above table in cases when a customer did not have streaming tv, they remained active 76% of the time, conversely if they did have streaming tv they actually stuck around less at 70%.

将另一个变量传递到比例表后，您就可以确定要评估相对比例的位置。在这种情况下，我们将第二个参数传递给prop.table()函数“ 1”，该参数指定我们希望查看StreamingTV每一行或值的记录的相对比例。如上表所示，如果客户没有流媒体电视，则他们有76％的时间保持活动状态；相反，如果客户有流媒体电视，则实际上停留在70％左右。

Now before we go getting ahead of ourselves, saying that having streaming tv most certainly is causing more people to churn… we need to make an assessment of whether or not we really have grounds to make such a claim. Yes the proportion of return customers is lower, but the difference could be random noise. More on this shortly.

现在，在我们超越自己之前，要说拥有电视流肯定会导致更多人流失……我们需要评估我们是否真的有理由提出这样的要求。是的，回头客的比例较低，但差异可能是随机噪声。不久之后会更多。

可视化的时间 (Time to Visualize)

This will give us similar information to what we just saw, but visualization tends to lend better to quickly understanding relative value.

这将为我们提供与我们刚刚看到的信息类似的信息，但是可视化往往有助于更好地快速理解相对价值。

Let’s start off with a quick bar plot with StreamingTV across the x-axis, and the fill as Churn.

让我们从在x轴上使用StreamingTV的快速条形图开始，填充为Churn 。

churn %>%
  ggplot(aes(x = StreamingTV, fill = Churn))+
  geom_bar()

As you can see, nearly as many tv streamers churned and with a substantially lower total customer count. Similar to what we saw with proportion tables, 100% stacked bar helps assess relative distribution among values of a categorical variable. All we have to do is pass position = 'fill' to geom_bar().

如您所见，几乎流失了许多电视流媒体，并且客户总数大大减少。与比例表类似，100％堆积条形图有助于评估分类变量的值之间的相对分布。我们要做的就是将position = 'fill'传递给geom_bar() 。

churn %>%
  ggplot(aes(x = StreamingTV, fill = Churn))+
  geom_bar(position = 'fill')

深入卡方统计 (Diving into the Chi-square Statistic)

Now there appears to be some sort of relationship between the two variables, yet we don’t have an assessment of the statistical significance. In other words, is it because of something about the relationship between tv streamers and customers, i.e. did they hate the service so much that they churn at a higher rate? Does their overall bill appear way to high as a product of the streaming plan, such that they churn all together?

现在，两个变量之间似乎存在某种关系，但我们没有统计意义的评估。换句话说，这是否是由于电视流媒体和客户之间的关系所致，也就是说，他们是否讨厌这项服务，以致于流失率更高？作为流媒体计划的产物，他们的总账单看起来是否很高，以至于他们一起流失？

All great questions, and we won’t have the answer to them just yet, but what we are doing is taking the first steps to assessing whether this larger investigative journey is worthwhile.

所有伟大的问题，我们暂时还没有答案，但是我们正在采取的第一步是评估这一较大的调查旅程是否值得。

卡方说明 (Chi-square Explanation)

Before we dive into the depths of creating a chi-square statistic, it’s very important that you understand the purpose conceptually.

在深入研究创建卡方统计量之前，从概念上了解目标非常重要。

We can see two categorical variables that appear to be related, however we don’t definitively know if the disparate proportions are a product of randomness or some other underlying affect. This is where chi-square comes in. The chi-square test statistic is effectively a comparison of our distribution to the distribution we would expect, in the case that the two variables were indeed perfectly independent.

我们可以看到两个似乎相关的类别变量，但是我们并不确定是否不同的比例是随机性还是其他潜在影响的产物。这就是卡方检验的出处。在两个变量确实完全独立的情况下，卡方检验统计量实际上是我们的分布与期望分布的比较。

So first things first, we need a dataset to represent said independence.

首先，我们需要一个数据集来表示独立性。

生成样本数据集 (Generating Our Sample Dataset)

We will be making use of the infer package. This package is incredibly useful for creating sample data for hypothesis testing, creating confidence intervals, etc.

我们将使用infer包。该软件包对于创建用于假设检验的样本数据，创建置信区间等非常有用。

I won’t break down all of the details on how to use infer, but at a high level, you're creating a new dataset. In this case, we want to create a dataset that looks a lot like what we just saw with the churn dataset, only this time, we want to ensure independent distribution, i.e. in cases when customers are tv streamers, we shouldn't see a greater occurrence of churn.

我不会分解有关如何使用infer所有详细信息，但总的来说，您正在创建一个新的数据集。在这种情况下，我们想要创建一个看起来很像流失数据集的数据集，只是这次，我们要确保独立分发，即在客户是电视流媒体的情况下，我们不应该看到流失的可能性更大。

Easy way to think about infer is in the following the steps of specify, hypothesize, and generate. We specify the relationship we’re modeling, we input the intended distribution, independent, and finally we specify the number of replicates we want to generate. A replicate in this case will mirror the row count of our original dataset. There are instances in which you would create many replicates of the same dataset and make calculations on top of that, but not for this part of the process.

考虑推理的简单方法是指定，假设和生成以下步骤。我们指定要建模的关系，输入预期的分布，独立，最后指定要生成的重复数。在这种情况下，副本将反映原始数据集的行数。在某些情况下，您将创建同一数据集的许多副本，并在此之上进行计算，但并不针对过程的这一部分。

churn_perm <- churn %>%
  specify(Churn ~ StreamingTV) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1, type = "permute")

Lets’s quickly take a look at this dataset.

让我们快速看一下该数据集。

head(churn_perm)

As you can see we have the two variables we specified, as well as replicate. All records in this table will be replicate: 1, as we only made a single replicate.

如您所见，我们拥有我们指定的两个变量以及replicate 。该表中的所有记录将被复制：1，因为我们仅进行了一次复制。

样本摘要 (Sample Summaries)

Let’s quickly visualize our independent dataset to visualize the relative proportions now.

让我们快速可视化我们的独立数据集以可视化现在的相对比例。

churn_perm %>%
  ggplot(aes(x = StreamingTV, fill = Churn))+
  geom_bar(position = 'fill')

As desired you can see that the relative proportions line up almost exactly. There is some randomness at play so we may not see that these two line up perfectly… but that’s really the point. We’re not doing this quite yet, but remember when I mentioned the idea of creating many replicates?

根据需要，您可以看到相对比例几乎完全对齐。游戏中存在一些随机性，因此我们可能看不到两者完美地对接……但这就是重点。我们还没有这样做，但是还记得我提到创建多个副本的想法吗？

What might the purpose of that be?

这样做的目的是什么？

If we create this sample dataset tons of times, do we ever see a gap as wide as 70% to 76% churn as we saw in our observed dataset? If so, how often do we see it? Is it so often that we don’t have grounds to chalk up the difference to anything more than random noise?

如果我们多次创建此样本数据集，是否有看到像在观察到的数据集中看到的70％至76％的流失宽度？如果是这样，我们多久见一次？难道我们常常没有理由将差异归因于随机噪声吗？

Alright enough of that rant… On to making an assessment of how much our observed data varies from our sample data.

足够好了……继续评估我们观察到的数据与样本数据之间的差异。

让我们开始计算 (Let’s Get Calculating)

Now that we really understand our purpose, let’s go ahead and calculate our statistic. Simply enough, our intent is to calculate the distance between each cell of our table of observed counts with that of our sample counts.

现在我们已经真正了解了我们的目的，让我们继续计算我们的统计数据。简而言之，我们的目的是计算观察计数表中每个像元与样本计数之间的距离。

The formula for said “distance” looks like this:

所说“距离”的公式如下：

sum(((obs - sample)^2)/sample)

We subtract observed from our sample,
我们从样本中减去观察到的
but square them such that they don’t cancel each other out.
但要对它们进行平方，以免彼此抵消。
We divide them by the sample count to prevent any single cell from having too great a presence due to its size,
我们将它们除以样本数，以防止任何单个单元格由于其大小而存在过多，
and finally we take the sum.
最后，我们求和。

The chi-square statistic that we get is: 20.1

我们得到的卡方统计是：20.1

So, great. We understand the purpose of the chi-square statistic, we even have it… but what we still don’t know is… is a chi-square stat of 20.1 meaningful?

很好。我们了解卡方统计量的目的，甚至有这个目的……但是我们仍然不知道…… 卡方统计量20.1是否有意义？

假设检验 (Hypothesis Testing)

Earlier in the post, we spoke about how we can use the infer package to create many, many replicates. A hypothesis test is precisely the time for that type of sampling.

在文章的前面，我们谈到了如何使用infer包创建许多复制品。假设检验正是该类型抽样的时间。

Let’s use infer again, just this time we'll generate 500 replicates & calculate a chi-square statistic for each group of replicates.

让我们再次使用infer ，仅这次我们将生成500个重复并为每个重复组计算卡方统计量。

churn_null <- churn %>%
  specify(Churn ~ StreamingTV) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 500, type = "permute") %>%
  calculate(stat = "Chisq")
churn_null

Based on the above output, you can see that each replicate has it's own stat.

根据上面的输出，您可以看到每个replicate都有自己的stat 。

Let’s use a density plot to see what our distribution of chi-square statistics looks like.

让我们使用密度图来查看我们的卡方统计分布。

churn_null %>%
ggplot(aes(x = stat)) +
  # Add density layer
  geom_density()

At a first glance we can see the distribution of chi-square statistics is very right skewed. We can also see that our statistic of 20.1 is not even on the plot.

乍一看，我们可以看到卡方统计的分布非常偏斜。我们还可以看到我们的统计数据20.1甚至没有在图中。

Let’s add a vertical line to show how our observed chi-square compares to the permuted distribution.

让我们添加一条垂直线，以显示观察到的卡方与排列后的分布相比。

churn_null %>%
ggplot(aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = obs_chi_sq, color = "red")

When it comes to having sufficient evidence to reject the null hypothesis, this is promising. Null hypothesis being that there is no relationship between the two variables.

当有足够的证据拒绝原假设时，这是有希望的。零假设是两个变量之间没有关系。

计算P值 (Calculating P-value)

As a final portion to this lesson on how to use chi-square statistics, let’s talk about how we should go about calculating p-value.

作为本课有关如何使用卡方统计量的最后一部分，让我们讨论如何计算p值。

Earlier I mentioned the idea that we might want to know if our simulated chi-square stat was ever as large as our observed chi-square stat, and if so how often it might have occurred.

早些时候，我提到过这样一个想法，我们可能想知道模拟卡方统计量是否与观察到的卡方统计量一样大，如果是，那么它可能会发生多久。

That is the essence of p-value.

这就是p值的本质。

When taking the chi-square stat of two variables that we know are independent of one another (the simulated case), what percentage of these replicates’ chi-square stats are greater than or equal to our observed chi-square stat.

当我们知道彼此独立的两个变量的卡方统计量(模拟情况)时，这些重复样本的卡方统计量的百分比大于或等于我们观察到的卡方统计量。

churn_null %>%
  summarise(p_value = 2 * mean(stat >= obs_chi_sq))

In the case of our sample, we’re getting a p-value of 0. As to say that in the course of 500 replicates, we never surpassed a chi-square stat of 20.1.

以我们的样本为例，我们得到的p值为0。可以说，在500次重复的过程中，我们从未超过卡方值20.1。

As such, we would reject the null hypothesis that churn and streaming tv are independent.

因此，我们将拒绝流失和电视流是独立的零假设。

结论 (Conclusion)

We have done a lot in such a short amount of time. It’s easy to get lost when dissecting statistics concepts like the chi-square statistic. My hope is that having a strong foundational understanding of the need and corresponding calculation of this statistic lends to the right instinct for recognizing the right opportunity to put this tool to work.

在这么短的时间内我们做了很多事情。剖析卡方统计之类的统计概念时，很容易迷失方向。我的希望是，对此统计信息的需求和相应的计算方法有一个深刻的基础理解，有助于正确认识本机，使该工具投入使用。

In just a few minutes, we have covered:

在短短几分钟内，我们涵盖了：