Performing summary statistics and plots —— Python Data Science Cookbook

source from :  Python Data Science Cookbook case

The primary purpose of using summary statistics is to get a good understanding of the location and dispersion of the data. By summary statistics, we refer to mean, median, and standard deviation. These quantities are quite easy to calculate. However, one should be careful when using them. If the underlying data is not unimodal, that is, it has multiple peaks, these quantities may not be of much use.


  1. If the given data is unimodal, that is, having only one peak, the mean, which gives the location, and standard deviation, which gives the variance, are valuable metrics. 
  2. Compared to the regular mean, a trimmed mean is less sensitive to outliers.SciPy provides us with a trim mean function. We will demonstrate the trimmed mean
    calculation in step 2.
  3. the mean is very sensitive to outliers; variance also uses the mean, and hence, it’s prone to the same issues as the mean. We can use other measures for variance to avoid this trap. One such measure isabsolute average deviation; instead of taking the square of the difference between the individual values and mean and dividing it by the number of instances, we will take the absolute of the difference between the mean and individual values and divide it by the number of instances. In step 5,  we will define a function for this:
    def mad(x,axis=None):
           mean = np.mean(x,axis=axis)
           return np.sum(np.abs(x-mean))/(1.0 * len(x))
  4. With the data having many outliers, there is another set of metrics that come in handy. They are themedian and percentiles. Traditionally, median is defined as a value from the dataset such that half of all the points in the dataset are smaller and the other half is larger than the median value. 
    Interpreting the percentiles:
    25% of the points in the dataset are below 13.00 (25th percentile value).
    50% of the points in the dataset are below 18.50 (50th percentile value).
    75% of the points in the dataset are below 25.25 (75th percentile value).
    A point to note is that the 50th percentile is the median. Percentiles give us a good idea of the range of our values.
    The median is the measure of the location of the data distribution. Using percentiles, we can get a metric forthe dispersion of the data, the interquartile range. The interquartile rangeis the distance between the 75th percentile and 25th percentile. 
  5. Similar to the mean absolute deviation as explained previously, we also have themedian absolute deviation.:
    def mdad(x,axis=None):
           median = np.median(x,axis=axis)
           return np.median(np.abs(x-median))
source code :
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
@author: snaildove
# Load Libraries
from sklearn.datasets import load_iris
import numpy as np
from scipy.stats import trim_mean
# Load iris data
data = load_iris()
x = data['data']
y = data['target']
col_names = data['feature_names']
# Let’s now demonstrate how to calculate the mean, trimmed mean, and range values:
# 1. Calculate and print the mean value of each column in the Iris dataset
print "col name,mean value"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f"%(col_name,np.mean(x[:,i]))
# 2. Trimmed mean calculation.
p = 0.1 # 10% trimmed mean
print "col name,trimmed mean value"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f"%(col_name,trim_mean(x[:,i],p))
# 3. Data dispersion, calculating and display the range values.
print "col_names,max,min,range"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f,%0.2f,%0.2f"%(col_name,max(x[:,i]),min(x[:,i]),max(x[:,i])-min(x[:,i]))
# Finally, we will show the variance, standard deviation, mean absolute deviation, and
# median absolute deviation calculations:
# 4. Data dispersion, variance and standard deviation
print "col_names,variance,std-dev"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f,%0.2f"%(col_name,np.var(x[:,i]),np.std(x[:,i]))
# 5. Mean absolute deviation calculation
def mad(x,axis=None):
    mean = np.mean(x,axis=axis)
    return np.sum(np.abs(x-mean))/(1.0 * len(x))

print "col_names,mad"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f"%(col_name,mad(x[:,i]))
# 6. Median absolute deviation calculation
def mdad(x,axis=None):
    median = np.median(x,axis=axis)
    return np.median(np.abs(x-median))
print "col_names,median,median abs dev,inter quartile range"
for i,col_name in enumerate(col_names):
    iqr = np.percentile(x[:,i],75) - np.percentile(x[i,:],25)
    print "%s,%0.2f,%0.2f,%0.2f"%(col_name,np.median(x[:,i]),mdad(x[:,i]),iqr)

col name,mean value
sepal length (cm),5.84
sepal width (cm),3.05
petal length (cm),3.76
petal width (cm),1.20

col name,trimmed mean value
sepal length (cm),5.81
sepal width (cm),3.04
petal length (cm),3.76
petal width (cm),1.18

sepal length (cm),7.90,4.30,3.60
sepal width (cm),4.40,2.00,2.40
petal length (cm),6.90,1.00,5.90
petal width (cm),2.50,0.10,2.40

sepal length (cm),0.68,0.83
sepal width (cm),0.19,0.43
petal length (cm),3.09,1.76
petal width (cm),0.58,0.76

sepal length (cm),0.69
sepal width (cm),0.33
petal length (cm),1.56
petal width (cm),0.66

col_names,median,median abs dev,inter quartile range
sepal length (cm),5.80,0.70,5.30
sepal width (cm),3.00,0.25,2.20
petal length (cm),4.35,1.25,4.07
petal width (cm),1.30,0.70,0.62

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


