Spearman's rank correlation coefficient (斯皮尔曼秩相关系数)

Spearman's rank correlation coefficient

From Wikipedia, the free encyclopedia

Jump to: navigation, search

In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of correlation – that is, it assesses how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables.

Contents

[hide]
<script type=text/javascript> // </script>

[edit] Calculation

In principle, ρ is simply a special case of the Pearson product-moment coefficient in which two sets of data Xi and Yi are converted to rankings xi and yi before calculating the coefficient.[1] In practice, however, a simpler procedure is normally used to calculate ρ. The raw scores are converted to ranks, and the differences di between the ranks of each observation on the two variables are calculated.

If there are no tied ranks, i.e.

then ρ is given by:

where:

di = xiyi = the difference between the ranks of corresponding values Xi and Yi, and
n = the number of values in each data set (same for both sets).

If tied ranks exist, classic Pearson's correlation coefficient between ranks has to be used instead of this formula:[1]

One has to assign the same rank to each of the equal values. It is an average of their positions in the ascending order of the values:

An example of averaging ranks

In the table below, notice how the rank of values that are the same is the mean of what their ranks would otherwise be.

Variable XiPosition in the descending orderRank xi
0.855
1.24
1.23
2.322
1811

In this case we cannot use the shortcut formula (because of the tied ranks in the data) and must use the second, product-moment form.

[edit] Example

The raw data used in this example is shown below where we want to calculate the correlation between the IQ of someone with the number of hours spent in front of TV per week.

IQ, XiHours of TV per week, Yi
1067
860
10027
10150
9928
10329
9720
11312
1126
11017

The first step is to sort this data by the first column. Next, two more columns are created (xi and yi). The last of these columns (yi) is assigned 1,2,3,...n, and then the data is sorted by the first original column (Xi). The first of the newly created columns (xi) is assigned 1,2,3,...n. Then a column di is created to hold the differences between the two rank columns (xi and yi). Finally another column should be created. This is just column di squared.

After doing this process with the example data you should end up with something like:

IQ, XiHours of TV per week, Yirank xirank yidi
8601100
972026-416
992838-525
1002747-39
10150510-525
1032969-39
106773416
110178539
112692749
11312104636

The values in the column can now be added to find . The value of n is 10. So these values can now be substituted back into the equation,

which evaluates to ρ = − 0.175758 which shows that the correlation between IQ and hour spend between TV is really low (barely any correlation). In the case of ties in the original values, this formula should not be used. Instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).

[edit] Determining significance

The modern approach to testing whether an observed value of ρ is significantly different from zero (we will always have 1 ≥ ρ ≥ −1) is to calculate the probability that it would be greater than or equal to the observed ρ, given the null hypothesis, by using a permutation test. This approach is almost always superior to traditional methods, unless the data set is so large that computing power is not sufficient to generate permutations, or unless an algorithm for creating permutations that are logical under the null hypothesis is difficult to devise for the particular case (but usually these algorithms are straightforward).

Although the permutation test is often trivial to perform for anyone with computing resources and programming experience, traditional methods for determining significance are still widely used. The most basic approach is to compare the observed ρ with published tables for various levels of significance. This is a simple solution if the significance only needs to be known within a certain range or less than a certain value, as long as tables are available that specify the desired ranges. A reference to such a table is given below. However, generating these tables is computationally intensive and complicated mathematical tricks have been used over the years to generate tables for larger and larger sample sizes, so it is not practical for most people to extend existing tables.

An alternative approach available for sufficiently large sample sizes is an approximation to the Student's t-distribution with degrees of freedom N-2. For sample sizes above about 20, the variable

has a Student's t-distribution in the null case (zero correlation). In the non-null case (i.e. to test whether an observed ρ is significantly different from a theoretical value, or whether two observed ρs differ significantly) tests are much less powerful, though the t-distribution can again be used.

A generalization of the Spearman coefficient is useful in the situation where there are three or more conditions, a number of subjects are all observed in each of them, and we predict that the observations will have a particular order. For example, a number of subjects might each be given three trials at the same task, and we predict that performance will improve from trial to trial. A test of the significance of the trend between conditions in this situation was developed by E. B. Page and is usually referred to as Page's trend test for ordered alternatives.

[edit] Correspondence analysis based on Spearman's rho

Classic correspondence analysis is a statistical method which gives a score to every value of two nominal variables, in this way that Pearson's correlation coefficient between them is maximized.

There exists an equivalent of this method, called grade correspondence analysis, which maximizes Spearman's rho or Kendall's tau[2].

[edit] See also

[edit] Notes

  1. ^ a b Myers, Jerome L.; Arnold D. Well (2003). Research Design and Statistical Analysis, second edition, Lawrence Erlbaum, p. 508. ISBN 0805840370. 
  2. ^ Kowalczyk, T.; Pleszczyńska E. , Ruland F. (eds.) (2004). Grade Models and Methods for Data Analysis with Applications for the Analysis of Data Populations, Studies in Fuzziness and Soft Computing vol. 151. Berlin Heidelberg New York: Springer Verlag. ISBN 9783540211204. 

[edit] References

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值