

Business analytics and data science is a convergence of many fields of expertise. Professionals form multiple domains and educational backgrounds are joining the analytics industry in the pursuit of becoming data scientists.

业务分析和数据科学是许多专业领域的融合。 专业人士来自多个领域,教育背景正在加入分析行业,以成为数据科学家。

Two kinds of data scientist I met in my career. One who provides attention to the details of the algorithms and models. They always try to understand the mathematics and statistics behind the scene. Want to take full control over solution and the theory behind it. The other kind are more interested in the end result without looking at the theoretical details. They are fascinated by the implementation of new and advanced models. Inclined towards solving the problem in hand rather than the theory behind the solution.

我在职业生涯中遇到的两种数据科学家。 一位关注算法和模型细节的人。 他们总是试图了解幕后的数学和统计学。 想要完全控制解决方案及其背后的理论。 另一类对最终结果更感兴趣,而不关注理论细节。 他们对新的和先进的模型的实施着迷。 倾向于解决现有问题,而不是解决方案背后的理论。

Believers of both of these approaches have their own logic to support their stand. I respect their choices.

这两种方法的信徒都有自己的逻辑来支持自己的立场。 我尊重他们的选择。

In this post, I shall share some statistical tests that are commonly used in data science. It will be good to know some of these irrespective of the approach you believe in.

在这篇文章中,我将分享一些数据科学中常用的统计测试。 无论您采用哪种方法,都应该了解其中的一些内容。

In statistics, there are two ways of drawing an inference from any exploration. Estimation of parameters is one of the ways. Here unknown values of population parameters are computed through various methods. The other way is testing of hypothesis. It helps us to test the parameter values that are guessed by some prior knowledge.

在统计中,有两种方法可以从任何探索中得出推论。 参数估计是方法之一。 这里,人口参数的未知值是通过各种方法计算的。 另一种方法是检验假设。 它可以帮助我们测试一些先验知识猜测的参数值。

I shall list out some statistical test procedures which you will frequently encounter in data science.


“The only relevant test of the validity of a hypothesis is comparison of its predictions with experience.” — Milton Friedman

“关于假设有效性的唯一相关检验是将其预测与经验进行比较。” —米尔顿·弗里德曼

作为数据科学家,我真的需要了解假设检验吗? (As a data scientist, do I really need to know hypothesis testing?)

In most decision-making procedures in data science, we are knowing or unknowingly using hypothesis testing. Here are some evidences in support of my statement.

在数据科学的大多数决策程序中,我们都在使用或不使用假设检验。 这里有一些证据支持我的发言。

Being data scientist, the kind of data analysis we do can be segregated into four broad areas —


  1. Exploratory Data Analysis (EDA)


2. Regression and Classification


3. Forecasting


4. Data Grouping


Each of these areas include some amount of statistical testing.


探索性数据分析(EDA) (Exploratory Data Analysis (EDA))

It is an unavoidable part of data science in which every data scientist spends a significant amount of time. It establishes the foundation for creating machine learning and statistical models. Some common tasks that involve statistical testing in EDA are —

这是数据科学中不可避免的一部分,每个数据科学家都花费大量时间。 它为创建机器学习和统计模型奠定了基础。 在EDA中涉及统计测试的一些常见任务是-

  1. Test for normality


2. Test for Outliers


3. Test for correlation


4. Test of homogeneity


5. Test for equality of distribution


Each of these tasks involves testing of hypothesis at some point.


1.How to Test for normality?


Normality is everywhere in Statistics. Most theories we use in statistics are based on normality assumption. Normality means the data should follow a particular kind of probability distribution, which is the normal distribution. It has a particular shape and represented by a particular function.

统计数据中到处都有常态。 我们在统计学中使用的大多数理论都基于正态性假设。 正态性表示数据应遵循一种特定的概率分布,即正态分布。 它具有特定的形状并由特定的功能表示。

In Analysis of Variance(ANOVA), we assume normality of the data. While doing regression we expect the residual to follow normal distribution.

在方差分析(ANOVA)中,我们假设数据是正态的。 在进行回归时,我们期望残差遵循正态分布。

To check normality of data we can use Shapiro–Wilk Test. The null hypothesis for this test is — the distribution of the data sample is normal.

要检查数据的正态性,我们可以使用Shapiro-Wilk Test。 该检验的零假设是-数据样本的分布是正态的。

Python implementation:


import numpy as np
from scipy import stats
data = stats.norm.rvs(loc=2.5, scale=2, size=100)
shapiro_test = stats.shapiro(data)

2. How to test whether a data point is an outlier?


When I start any new data science use case, where I have to fit some model, one of the routine tasks I do is detection of outliers in the response variable. Outliers affect the regression models greatly. A careful elimination or substitution strategy is required for the outliers.

当我开始任何新的数据科学用例时,我必须适应某种模型,我要做的日常任务之一是检测响应变量中的异常值。 离群值极大地影响回归模型。 离群值需要谨慎的消除或替换策略。

An outlier can be global outlier if its value significantly deviate from rest of the data. It is called contextual outlier if it deviates only from the data point originated from a particular context. Also, a set of data point can be collectively outlier when they deviate considerably from the rest.

如果异常值的值与其他数据有明显偏差,则该异常值可以是全局异常值。 如果它仅偏离源自特定上下文的数据点,则称为上下文离群值。 同样,当一组数据点与其他数据点有很大差异时,它们可能在总体上离群。

The Tietjen-Moore test is useful for determining multiple outliers in a data set. The null hypothesis for this test is — there are no outliers in the data.

Tietjen-Moore检验对于确定数据集中的多个异常值很有用。 该检验的零假设是-数据中没有异常值。

Python implementation:


import scikit_posthocs
x = np.array([-1.40, -0.44, -0.30, -0.24, -0.22, -0.13, -0.05, 0.06, 0.10, 0.18, 0.20, 0.39, 0.48, 0.63, 1.01])
scikit_posthocs.outliers_tietjen(x, 2)

3. How to test the significance of correlation coefficient between two variables?


In data science, we deal with a number of independent variables that explain the behavior of the dependent variable. Significant correlation between the independent variables may affect the estimated coefficient of the variables. It makes the standard error of the regression coefficients unreliable. Which hurts the interpretability of the regression.

在数据科学中,我们处理许多自变量,这些自变量解释了因变量的行为。 自变量之间的显着相关性可能会影响变量的估计系数。 这使得回归系数的标准误差不可靠。 这损害了回归的可解释性。

When we calculate the correlation between two variables, we should check the significance of the correlation. It can be checked by t-test. The null hypothesis of this test assumes that the correlation among the variables is not significant.

当我们计算两个变量之间的相关性时,我们应该检查相关性的重要性。 可以通过t检验进行检查。 该检验的零假设假设变量之间的相关性不显着。

Python implementation:


from scipy.stats import pearsonr
data1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)
stat, p = pearsonr(data1, data2)
print(stat, p)

4. How to test the homogeneity of a categorical variable in two data sets?


It would be convenient to explain the test of homogeneity if I use an example. Suppose you we want to check if the viewing preference of Netflix subscribers are same for males and females. You can use Chi-square test for homogeneity for the same. You have to check whether the frequency distribution of the males and females are significantly different from each other.

如果我举一个例子,解释同质性测试将很方便。 假设您要检查男性和女性的Netflix订户的观看偏好是否相同。 您可以使用卡方检验进行同质性检验。 您必须检查男性和女性的频率分布是否显着不同。

The null hypotheses for the test is the two data sets are homogeneous.


Python implementation:


import scipy
import scipy.stats
from scipy.stats import chisquare
data1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)
chisquare(data1, data2)

5. How to check if a given data sets follow a particular distribution?


Sometimes in data analysis we require checking if the data follows a particular distribution. Even we may want to check if two samples follow the same distribution. In such cases we use Kolmogorov-Smirnov (KS) test. We often use KS test to check for goodness of fit of a regression model.

有时,在数据分析中,我们需要检查数据是否遵循特定的分布。 甚至我们可能要检查两个样本是否遵循相同的分布。 在这种情况下,我们使用Kolmogorov-Smirnov(KS)检验。 我们经常使用KS检验来检查回归模型的拟合优度。

This test compares the empirical cumulative distribution functions (ECDF) with the theoretical distribution function. The null hypothesis for this test assumes that the given data follows the specified distribution.

该测试将经验累积分布函数(ECDF)与理论分布函数进行了比较。 此检验的零假设假设给定数据遵循指定的分布。

Python implementation:


from scipy import stats
x = np.linspace(-25, 17, 6)
stats.kstest(x, ‘norm’)

回归与分类 (Regression and Classification)

Most of the modeling we do in data science fall under either regression or classification. Whenever we predict some value or some class, we take help of these two methods.

我们在数据科学中所做的大多数建模属于回归或分类。 每当我们预测某个值或某个类时,我们都会使用这两种方法。

Both regression and classification involves statistical tests at different stages of decision making. Also, the data need to satisfy some prerequisite conditions to be eligible for these tasks. Some tests are required to be performed to check these conditions.

回归和分类都涉及决策不同阶段的统计检验。 同样,数据需要满足一些前提条件才能有资格执行这些任务。 需要执行一些测试以检查这些条件。

Some common statistical tests associated with regression and classification are —


  1. Test for heteroscedasticity


2. Test or multicollinearity


3. Test of the significance of regression coefficients


4. ANOVA for regression or classification model


1.How to test for heteroscedasticity?


Heteroscedasticity is a quite heavy term. It simply means unequal variance. Let me explain it with an example. Suppose you are collecting income data from different cities. You will see that the variation of income differs significantly over cities.

异方差性是一个很沉重的名词。 它只是意味着方差不均。 让我用一个例子来解释它。 假设您正在收集来自不同城市的收入数据。 您将看到,收入的差异在城市之间存在很大差异。

If the data is heteroscedastic, it affects the estimation of the regression coefficients largely. It makes the regression coefficients less precise. The estimates will be far from actual values.

如果数据是异方差的,那么它将极大地影响回归系数的估计。 这使得回归系数不太精确。 该估计将与实际值相差甚远。

To test heteroscedasticity in the data White’s Test can be used. White’s test considers the null hypothesis — the variance is constant over the data.

要测试数据中的异方差性,可以使用White's Test。 White的检验考虑了原假设-方差在数据上是恒定的。

Python implementation:


from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
expr = ‘y_var ~ x_var’
y, X = dmatrices(expr, df, return_type=’dataframe’)
keys = [‘LM stat’, ‘LM test p-value’, ‘F-stat’, ‘F-test p-value’]
results = het_white(olsr_results.resid, X)
lzip(keys, results)

2. How to test for multicollinearity in the variables?


Data science problems often include multiple explanatory variables. Some time these variables become correlated due to their origin and nature. Also, sometimes we create more than one variable from the same underlying fact. In these cases the variables become highly correlated. It is called multicollinearity.

数据科学问题通常包含多个解释变量。 一段时间以来,这些变量由于其来源和性质而变得相关。 此外,有时我们会根据相同的基础事实创建多个变量。 在这些情况下,变量变得高度相关。 这称为多重共线性。

Presence of multicollinearity increases standard error of the coefficients of the regression or classification model. It makes some important variables insignificant in the model.

多重共线性的存在增加了回归或分类模型的系数的标准误差。 它使一些重要变量在模型中无关紧要。

Farrar–Glauber Test can be used to check the presence of multicollinearity in the data.


3. How to test if the model coefficients are significant?


In classification or regression models we require identifying the important variables which have strong influence on the target variable. The models perform some tests and provide us with the extent of significance of the variables.

在分类或回归模型中,我们需要确定对目标变量有很大影响的重要变量。 这些模型执行了一些测试,并为我们提供了变量的重要程度。

t-test is used in models to check the significance of the variables. The null hypothesis of the test is- the coefficients are zero. You need to check p-values of the tests to understand the significance of the coefficients.

模型中使用t检验来检查变量的重要性。 检验的原假设是-系数为零。 您需要检查测试的p值以了解系数的重要性。

Python implementation:


from scipy import stats
rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
stats.ttest_1samp(rvs1, 7)

4. How to test statistical significance of a model?


While developing regression or classification model, we perform Analysis of Variance (ANOVA). It checks the validity of regression coefficients. ANOVA compares the variation due to model with the variation due to error. If the variation due to model is significantly different from variation due to error, the effect of the variable is significant.

在开发回归或分类模型时,我们执行方差分析(ANOVA)。 它检查回归系数的有效性。 方差分析将模型引起的变化与误差引起的变化进行比较。 如果因模型引起的变化与因误差引起的变化显着不同,则变量的影响就很大。

F-test is used to take the decision. The null hypothesis in this test is — the regression coefficient is equal to zero.

F检验用于做出决定。 该检验中的零假设是-回归系数等于零。

Python implementation:


import scipy.stats as stats
data1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)

预测 (Forecasting)

In data science we deal with two kinds of data- cross-section and time series. The profiles of a set of customers on an e-commerce website are a cross-section data. But, the daily sales of an item in the e-commerce website for a year will be time series data.

在数据科学中,我们处理两种数据:横截面和时间序列。 电子商务网站上一组客户的资料是横截面数据。 但是,电子商务网站中某项商品一年的每日销售额将是时间序列数据。

We often use forecasting models on time series data to estimate the future sales or profits. But, before forecasting, we go through some diagnostic checking of the data to understand the data pattern and its fitness for forecasting.

我们经常对时间序列数据使用预测模型来估计未来的销售或利润。 但是,在进行预测之前,我们会对数据进行一些诊断检查,以了解数据模式及其对预测的适用性。

As a data scientist I frequently use these tests on time series data:


  1. Test for trend


2. Test for stationarity


3. Test for autocorrelation


4. Test for causality


5. Test for temporal relationship


1. How to test for trend in time series data?


Data generated over time from business often shows an upward or downward trend. Be it sales or profit or any other performance metrics that depicts business performance, we always prefer to estimate the future movements.

随着时间推移从业务生成的数据通常显示出上升或下降的趋势。 无论是销售或利润,还是描述业务绩效的任何其他绩效指标,我们始终希望估算未来的走势。

To forecast the such movements, you need to estimate or eliminate the trend component. To understand if the trend is significant, you can use some statistical test.

要预测这种运动,您需要估计或消除趋势分量。 要了解趋势是否显着,可以使用一些统计检验。

Mann-Kendall Test can be used to test the existence of trend. The null hypothesis assumes that there is no significant trend.

Mann-Kendall检验可以用来检验趋势的存在。 零假设假设没有明显趋势。

Python implementation:


pip install pymannkendall
import numpy as np
import pymannkendall as mk
data = np.random.rand(250,1)
test_result = mk.original_test(data)

2. How to test whether a time series data is stationary?


Non-stationarity is an inherent characteristic of most time series data. We always need to test for stationarity before any time series modeling. If the data is non-stationary it may produce unreliable and spurious results after modeling. It will lead to a poor understanding of the data.

非平稳性是大多数时间序列数据的固有特征。 在任何时间序列建模之前,我们始终需要测试平稳性。 如果数据不稳定,则建模后可能会产生不可靠且虚假的结果。 这将导致对数据的理解不充分。

Augmented Dickey-Fuller (ADF) can be used to check for non-stationarity. The null hypothesis for ADF is the series is non-stationary. At 5% level of significance, if the p-value is less than 0.05, we reject the null hypothesis.

增强的Dickey-Fuller(ADF)可用于检查非平稳性。 ADF的原假设是级数是非平稳的。 在5%的显着性水平下,如果p值小于0.05,我们将拒绝原假设。

Python implementation:


from statsmodels.tsa.stattools import adfuller
X = [15, 20, 21, 20, 21, 30, 33, 45, 56]
result = adfuller(X)

3. How to check autocorrelation among the values of a time series?


For time series data, the causal relationship between past and present values is a common phenomenon. For financial time series often we see that current price is influenced by the prices of the last few days. This feature of time series data is measured by autocorrelation.

对于时间序列数据,过去值和现在值之间的因果关系是一种常见现象。 对于财务时间序列,我们经常看到当前价格受最近几天的价格影响。 时间序列数据的此功能通过自相关度量。

To know whether the autocorrelation is strong enough, you can test for it. Durbin-Watson test reveals the extent of it. The null hypothesis for this test assumes that there is no autocorrelation between the values.

要知道自相关是否足够强,可以对其进行测试。 Durbin-Watson检验揭示了其程度。 此检验的零假设假设值之间不存在自相关。

Python implementation:


from statsmodels.stats.stattools import durbin_watson
X = [15, 20, 21, 20, 21, 30, 33, 45, 56]
result = durbin_watson(X)

4. How can you test one variable has causes effect on other?


Two time series variable can share causal relationship. If you are familiar with financial derivatives, a financial instrument defined on underlying stocks, you would know that spot and future values have causal relationships. They influence each other according to the situation.

两个时间序列变量可以共享因果关系。 如果您熟悉金融衍生工具(一种定义在基础股票上的金融工具),则您会知道现货和未来价值具有因果关系。 它们根据情况相互影响。

The causality between two variables can be tested by Granger Causality test. This test uses a regression setup. The current value of one variable regresses on lagged values of the other variable along with lagged values of itself. The null hypothesis of no causality is determined by F-test.

两个变量之间的因果关系可以通过格兰杰因果关系检验进行检验。 该测试使用回归设置。 一个变量的当前值与其他变量的滞后值一起回归。 没有因果关系的零假设由F检验确定。

Python implementation:


import statsmodels.api as sm
from statsmodels.tsa.stattools import grangercausalitytests
import numpy as np
data = sm.datasets.macrodata.load_pandas()
data = data.data[[“realgdp”, “realcons”]].pct_change().dropna()
gc_res = grangercausalitytests(data, 4)

5. How can you check the temporal relationship between two variables?


Two time series sometimes moves together over time. In the financial time series you will often observe that spot and future price of derivatives move together.

有时两个时间序列会随着时间一起移动。 在金融时间序列中,您经常会观察到衍生产品的现货价格和未来价格会同时波动。

This co-movements can be checked through a characteristic called cointegration. This cointegration can be tested by Johansen’s test. The null hypothesis of this test assumes no cointegartion between the variables.

可以通过称为协整的特征来检查这种共同运动。 可以通过约翰森的检验来检验这种协整。 该检验的零假设假设变量之间没有共同含义。

Python implementation:


from statsmodels.tsa.vector_ar.vecm import coint_johansen
data = sm.datasets.macrodata.load_pandas()
data = data.data[[“realgdp”, “realcons”]].pct_change().dropna()
#x = getx() # dataframe of n series for cointegration analysis
jres = coint_johansen(data, det_order=0, k_ar_diff=1

资料分组 (Data Grouping)

Many times in real-life scenario we try to find similarity among the data points. The intention becomes grouping them together in some buckets and study them closely to understand how different buckets behave.

在现实生活中,很多时候我们试图找到数据点之间的相似性。 目的是将它们分组到一些存储桶中,并仔细研究它们以了解不同存储桶的行为。

The same is applicable for variables as well. We identify some latent variable those are formed by the combination of a number of observable variables.

同样适用于变量。 我们确定一些潜在变量,它们是由多个可观察变量的组合形成的。

A retail store might be interested to form segments among its customers like — cost-conscious, brand-conscious, bulk-purchaser, etc. It requires grouping of the customers based on their characteristics like — transactions, demographics, psychographics, etc.


In this area we often encounter the following tests:


1. Test of sphericity


2. Test for sampling adequacy


3. Test for clustering tendency


1. How to test for Sphericity of the variables?


If the number of variables in the data is very high, the regression models in this situation tend to perform badly. Besides, identifying important variables becomes challenging. In this scenario, we try to reduce the number of variables.

如果数据中的变量数量非常多,则这种情况下的回归模型往往表现不佳。 此外,识别重要变量也变得充满挑战。 在这种情况下,我们尝试减少变量的数量。

Principal Component Analysis (PCA) is one method of reducing the number of variables and identifying major factors. These factors will help you built a regression model with reduced dimension. Also, help to identify key features of any object or incident of interest.

主成分分析(PCA)是减少变量数量和识别主要因素的一种方法。 这些因素将帮助您构建尺寸减小的回归模型。 此外,有助于识别感兴趣的任何物体或事件的关键特征。

Now, variables can form factors only when they share some amount of correlation. It is tested by Bartlet’s test. The null hypothesis of this test is — variables are uncorrelated.

现在,变量只有在它们共享一定程度的相关性时才能形成因素。 它通过Bartlet的测试进行了测试。 该检验的零假设是-变量不相关。

Python implementation:


from scipy.stats import bartlett
a = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99]
b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05]
c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98]
stat, p = bartlett(a, b, c)
print(p, stat)

2. How to test for sampling adequacy of variables?


The PCA method will produce a reliable result when the sample size is large enough. This is called sampling adequacy. It is to be checked for each variable.

当样本量足够大时,PCA方法将产生可靠的结果。 这称为抽样充分性。 将检查每个变量。

Kaiser-Meyer-Olkin (KMO) test is used to check sampling adequacy for the overall data set. The statistic measures the proportion of variance among variables that could be common variance.

Kaiser-Meyer-Olkin(KMO)测试用于检查整个数据集的采样是否足够。 该统计数据衡量的是可能是普通方差的变量之间方差的比例。

Python implementation:


import pandas as pd
from factor_analyzer.factor_analyzer import calculate_kmo
a = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99]
b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05]
c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98]
df= pd.DataFrame({‘x’:a,’y’:b,’z’:c})

3. How to test for clustering tendency of a data set?


To group the data in different buckets, we use clustering techniques. But before going for clustering you need to check if there is clustering tendency in the data. If the data has uniform distribution then it not suitable for clustering.

为了将数据分组到不同的存储桶中,我们使用聚类技术。 但是在进行聚类之前,您需要检查数据中是否存在聚类趋势。 如果数据具有均匀分布,则不适合聚类。

Hopkins test can check for spatial randomness of variables. Null hypothesis in this test is — the data is generated from non-random, uniform distribution.

Hopkins检验可以检查变量的空间随机性。 该测试中的零假设是-数据是由非随机,均匀分布生成的。

Python implementation:


from sklearn import datasets
from pyclustertend import hopkins
from sklearn.preprocessing import scale
X = scale(datasets.load_iris().data)

In this article, I mentioned some frequently used tests in data science. There are a lot of others which I could not mention. Let me know if you find some which I haven’t mentioned here.

在本文中,我提到了数据科学中一些常用的测试。 还有很多我不能提及的。 如果您找到我在这里未提及的内容,请告诉我。







