About Rank Data

A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known. Samples of data where we already know or can easily identify the distribution of are called parametric data.

specialized nonparametric statistical methods can be used that discard all information about the distribution.

  • The difference between parametric and nonparametric data.
  • How to rank data in order to discard all information about the data’s distribution.
  • Example of statistical methods that can be used for ranked data.

1.1 Tutorial Overview

1.Parametric Data

2. Nonparametric Data

3. Ranking Data

4. Working with Ranked Data

1.2 Parametric Data

Parametric data is a sample of data drawn from a known data distribution.parametric is shorthand for real-valued data drawn from a Gaussian distribution.

Continuing with the shorthand of parametric meaning Gaussian. If we have parametric data, we can harness the entire suite of statistical methods developed for data assuming a Gaussian distribution, such as:

  • Summary statistics
  • Correlation between variables
  • Significance tests for comparing means

1.3 Nonparametric Data

Data that does not fit a known or well-understood distribution is referred to as nonparametric data. Data could be nonparametric for many reasons, such as:

  • Data is not real-valued, but instead is ordinal, intervals, or some other form.
  • Data is real-valued but does not fit a well understood shape.
  • Data is almost parametric but contains outliers, multiple peaks, a shift, or some other feature.

There are a suite of methods that we can use for nonparametric data called nonparametric statistical methods.

For real-valued data, nonparametric statistical methods are required in applied machine learning when you are trying to make claims on data that does not fit the familiar Gaussian distribution.

1.4 Ranking Data

Before a nonparametric statistical method can be applied, the data must be converted into a rank format. As such, statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests. Ranking data is exactly as its name suggests. The procedure is as follows:

  • Sort all data in the sample in ascending order.
  • Assign an integer rank from 1 to N for each unique value in the data sample.

We can then apply this procedure to another data sample and start using nonparametric statistical methods. There are variations on this procedure for special circumstances such as handling ties, using a reverse ranking, and using a fractional rank score, but the general properties hold. The SciPy library provides the rankdata() function to rank numerical data, which supports a number of variations on ranking. The example below demonstrates how to rank a numerical dataset.

#example of ranking real-valued observations
from numpy.random import rand
from numpy.random import seed
from scipy.stats import rankdata
# seed random number generator
seed(1)
# generate dataset
data = rand(1000)
# review first 10 samples
print(data[:10])
# rank data
ranked = rankdata(data)
# review first 10 ranked samples
print(ranked[:10])

Running the example first generates a sample of 1,000 random numbers from a uniform distribution, then ranks the data sample and prints the result.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值