You will discover how fundamental statistical operations work and how to implement them using NumPy with notation and terminology from linear algebra.
After completing this tutorial , you will know:
- What the expected value,average, and mean are and how to calculate them.
- What the variance and standard deviation are and how to calculate them.
- What the covarience,correlation, and covariance matrix are and how to calculate them.
1.1 Tutorial Overiview
This tutorial is divided into 4 parts; they are:
- 1.Expected Value and Mean
- 2.Variance and Standard Deviation
- 3. Covariance and Correlation
- 4. Covariance Matrix
1.2 Expected Value and Mean
In probability, the average value of some random variable X is called the expected value or the expectation. The expected value uses the notation E with square brackets around the name of the variable; for example: E[X]
It is calculated as the probability weighted sum of values that can be drawn.
E[X] = x1 × p1, x2 × p2, x3 × p3, · · · , xn × pn
In simple cases, such as the flipping of a coin or rolling a dice, the probability of each event is just as likely. Therefore, the expected value can be calculated as the sum of all values multiplied by the reciprocal of the number of values.
E[X] = 1 n × Xx1, x2, x3, · · · , xn
In statistics, the mean, or more technically the arithmetic mean or sample mean, can be estimated from a sample of examples drawn from the domain. It is confusing because mean, average, and expected value are used interchangeably. In the abstract, the mean is denoted by the lower case Greek letter mu µ and is calculated from the sample of observations, rather than all possible values.
# Example of calculating a vector mean
# vector mean
from numpy import array
from numpy import mean
# define vector
v = array([1, 2, 3, 4, 5,6])
print(v)
# calculate mean
result = mean(v)
print(result)
Running the example first prints the defined vector and the mean of the values in the vector.
The mean function can calculate the row or column means of a matrix by specifying the axis argument and the value 0 or 1 respectively. The example below defines a 2 × 6 matrix and calculates both column and row means.
# matrix means
from numpy import array
from numpy import mean
# define matrix
M = array([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
])
# column means
col_mean = mean(M, axis=0)
print(col_mean)
# row means
row_mean = mean(M, axis=1)
print(row_mean)
Running the example first prints the defined matrix, then the calculated column and row mean values.
1.3 Variance and Standard Deviation
# Example of calculating a vector variance
# vector variance
from numpy import array
from numpy import var
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate variance
result = var(v, ddof=1)
print(result)
Running the example first prints the defined vector and then the calculated sample variance of the values in the vector.
The var function can calculate the row or column variances of a matrix by specifying the axis argument and the value 0 or 1 respectively, the same as the mean function above. The example below defines a 2 × 6 matrix and calculates both column and row sample variances.
# Example of calculating matrix variances
# matrix variances
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]
])
print(M)
# column variances
col_var = var(M, ddof=1,axis=0)
print(col_var)
# raw variances
row_var = var(M, ddof=1, axis=1)
print(row_var)
Running the example first prints the defined matrix and then the column and row sample variance values.
The standard deviation is calculated as the square root of the variance and is denoted as lowercase s.
# Example of calculating matrix standard deviations
# matrix standard deviation
from numpy import array
from numpy import std
# define matrix
M = array([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis = 1)
print(row_std)
Running the example first prints the defined matrix and then the column and row sample standard deviation values
1.4 Covariance and Correlation
The sign of the covariance can be interpreted as whether the two variables increase together (positive) or decrease together (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent. NumPy does not have a function to calculate the covariance between two variables directly. Instead, it has a function for calculating a covariance matrix called cov() that we can use to retrieve the covariance. By default, the cov()function will calculate the unbiased or sample covariance between the provided random variables. The example below defines two vectors of equal length with one increasing and one decreasing. We would expect the covariance between these variables to be negative. We access just the covariance for the two variables as the [0, 1] element of the square covariance matrix returned.
# Example of calculating a vector covariance
# vector covariance
from numpy import array
from numpy import cov
# define first vector
x = array([1, 2, 3, 4, 5, 6, 7, 8, 9])
print(x)
# define second covariance
y = array([9, 8, 7, 6, 5, 4, 3, 2, 1])
print(y)
# calculate covariance
Sigma = cov(x,y)[0,1]
print(Sigma)
Running the example first prints the two vectors followed by the covariance for the values in the two vectors. The value is negative, as we expected.
# Example of calculating a vector correlation.
# Vector correlation
from numpy import array
from numpy import corrcoef
# define first vector
x = array([1, 2, 3, 4, 5, 6, 7, 8, 9])
print(x)
# define second vector
y = array([9, 8, 7, 6, 5, 4, 3, 2, 1])
print(y)
# calculate correlation
corr = corrcoef(x,y)[0,1]
print(corr)
Running the example first prints the two defined vectors followed by the correlation coefficient. We can see that the vectors are maximally negatively correlated as we designed.
1.5 Covariance Matrix
# Example of calculating a covariance matrix
# covariance matrix
from numpy import array
from numpy import cov
# define matrix of observation
X = array([
[1, 5, 8],
[3, 5, 11],
[2, 4, 9],
[3, 6, 10],
[1, 5, 10]
])
print(X)
# calculate covariance matrix
Sigma = cov(X.T)
print(Sigma)
Running the example first prints the defined dataset and then the calculated covariance matrix