A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known. Samples of data where we already know or can easily identify the distribution of are called parametric data.
specialized nonparametric statistical methods can be used that discard all information about the distribution.
- The difference between parametric and nonparametric data.
- How to rank data in order to discard all information about the data’s distribution.
- Example of statistical methods that can be used for ranked data.
1.1 Tutorial Overview
1.Parametric Data
2. Nonparametric Data
3. Ranking Data
4. Working with Ranked Data
1.2 Parametric Data
Parametric data is a sample of data drawn from a known data distribution.parametric is shorthand for real-valued data drawn from a Gaussian distribution.
Continuing with the shorthand of parametric meaning Gaussian. If we have parametric data, we can harness the entire suite of statistical methods developed for data assuming a Gaussian distribution, such as:
- Summary statistics
- Correlation between variables
- Significance tests for comparing means
1.3 Nonparametric Data
Data that does not fit a known or well-understood distribution is referred to as nonparametric data. Data could be nonparametric for many reasons, such as:
- Data is not real-valued, but instead is ordinal, intervals, or some other form.
- Data is real-valued but does not fit a well understood shape.
- Data is almost parametric but contains outliers, multiple peaks, a shift, or some other feature.
There are a suite of methods that we can use for nonparametric data called nonparametric statistical methods.
For real-valued data, nonparametric statistical methods are required in applied machine learning when you are trying to make claims on data that does not fit the familiar Gaussian distribution.
1.4 Ranking Data
Before a nonparametric statistical method can be applied, the data must be converted into a rank format. As such, statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests. Ranking data is exactly as its name suggests. The procedure is as follows:
- Sort all data in the sample in ascending order.
- Assign an integer rank from 1 to N for each unique value in the data sample.
We can then apply this procedure to another data sample and start using nonparametric statistical methods. There are variations on this procedure for special circumstances such as handling ties, using a reverse ranking, and using a fractional rank score, but the general properties hold. The SciPy library provides the rankdata() function to rank numerical data, which supports a number of variations on ranking. The example below demonstrates how to rank a numerical dataset.
#example of ranking real-valued observations
from numpy.random import rand
from numpy.random import seed
from scipy.stats import rankdata
# seed random number generator
seed(1)
# generate dataset
data = rand(1000)
# review first 10 samples
print(data[:10])
# rank data
ranked = rankdata(data)
# review first 10 ranked samples
print(ranked[:10])
Running the example first generates a sample of 1,000 random numbers from a uniform distribution, then ranks the data sample and prints the result.