皮尔逊相关系数 相似系数_皮尔逊相关系数

数据科学和机器学习统计 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)

In the last post, we analyzed the relationship between categorical variables and categorical and continuous variables. In this case, we will analyze the relation between two ratio level or continuous variables.

Peason’s Correlation, sometimes just called correlation, is the most used metric for this purpose, it searches the data for a linear relationship between two variables.

Peason的相关性 (有时也称为相关性 )是为此目的最常用的度量标准，它在数据中搜索两个变量之间的线性关系。

Analyzing the correlations is one of the first steps to take in any statistics, data analysis, or machine learning process, it allows data scientists to early detect patterns and possible outcomes of the machine learning algorithms, so it guides us to choose better models.

Correlation is a measure of relation between variables, but cannot prove causality between them.

Some examples of random correlations that exist in the world are found un this website.

In the case of the last graph, it’s clearly not true that one of these variables implies the other one, even having a correlation of 99.79%

散点图 (Scatterplots)

To take the first look to our dataset, a good way to start is to plot pairs of continuous variables, one in each coordinate. Each point on the graph corresponds to a row of the dataset.

Scatterplots give us a sense of the overall relationship between two variables:

• Direction: positive or negative relation, when one variable increases the second one increases or decreases?

方向：正向或负向关系，当一个变量增加时，第二个变量增加或减少？
• Strength: how much a variable increases when the second one increases.

强度：第二个变量增加时变量增加多少。
• Shape: The relation is linear, quadratic, exponential…?

形状：该关系是线性，二次方，指数...？

Using scatterplots is a fast technique for detecting outliers if a value is widely separated from the rest, checking the values for this individual will be useful.

We will go with the most used data frame when studying machine learning, Iris, a dataset that contains information about iris plant flowers, and the objective of this one is to classify the flowers into three groups: (setosa, versicolor, virginica).

The objective of the iris dataset is to classify the distinct types of iris with the data that we have, to deliver the best approach to this problem, we want to analyze all the variables that we have available and their relations.

In the last plot we have the petal length and width variables, and separate the distinct classes of iris in colors, what we can extract from this plot is:

• There’s a positive linear relationship between both variables.

这两个变量之间存在正线性关系。
• Petal length increases approximately 3 times faster than the petal width.

花瓣长度的增加速度大约是花瓣宽度的3倍。
• Using these 2 variables the groups are visually differentiable.

使用这两个变量，这些组在视觉上是可区分的。

散点图矩阵 (Scatter Plot Matrix)

To plot all relations at the same time and on the same graph, the best approach is to deliver a pair plot, it’s just a matrix of all variables containing all the possible scatterplots.

As you can see, the plot of the last section is in the last row and third column of this matrix.

In this matrix, the diagonal can show distinct plots, in this case, we used the distributions of each one of the iris classes.

Being a matrix, we have two plots for each combination of variables, there’s always a plot combining the same variables inverse of the (column, row), the other side of the diagonal.

Using this matrix we can obtain all the information about all the continuous variables in the dataset easily.

皮尔逊相关系数 (Pearson Correlation Coefficient)

Scatter plots are an important tool for analyzing relations, but we need to check if the relation between variables is significant, to check the lineal correlation between variables we can use the Person’s r, or Pearson correlation coefficient.

The range of the possible results of this coefficient is (-1,1), where:

• 0 indicates no correlation.

0表示没有相关性。
• 1 indicates a perfect positive correlation.

1表示完全正相关。
• -1 indicates a perfect negative correlation.

-1表示完美的负相关。

To calculate this statistic we use the following formula:

相关系数的检验显着性 (Test significance of correlation coefficient)

We need to check if the correlation is significant for our data, as we already talked about hypothesis testing, in this case:

• H0 = The variables are unrelated, r = 0

H0 =变量无关，r = 0

• Ha = The variables are related, r ≠ 0

Ha =变量相关，r≠0

This statistic has a t-student distribution with (n-2) degrees of significance, being n the number of values.

The formula for the t value is the following, and we need to compare the result with the t-student table.

t值的公式如下，我们需要将结果与t学生表进行比较。

If our result is bigger than the table value we reject the null hypothesis and say that the variables are related.

确定系数 (Coefficient of determination)

To calculate how much the variation of a variable can affect the variation of the other one, we can use the coefficient of determination, calculated as the . This measure will be very important in regression models.

摘要 (Summary)

In the last post, we talked about correlation for categorical data and mentioned that the correlation for continuous variables is easier, in this case, we explained how to perform this correlation analysis and how to check if it’s statistically significant.

Adding to the typical analysis of the statistical significance will give a better understanding about how to use each variable.

This is the eleventh post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).

https://github.com/CrunchyPistacho/100DaysOfML

03-12 5万+
11-21 442
01-23 1562
06-26 6134
07-16 1万+
07-11 15万+
09-12 7万+
05-29 6107
07-10 9万+
05-06 6万+
05-13 5749
08-13 931
02-18 9552