


Table of contents

  1. Introduction


  2. Data types


  Analyzing Quantitative Data
I. Measures of Center
II. Measures of Spread
III. Shape of distribution

    Analyzing Quantitative Data
I. Measures of Center
II. Measures of Spread
III. Shape of distribution
IV. Outliers

  4. Descriptive vs. Inferential Statistics


  5. Looking Ahead


  6. Summary


Introduction

The word “data” is defined as distinct pieces of information. You may think of data as simply numbers on a spreadsheet, but it can come in many forms from text to videos to spreadsheets and databases to images to audio … Utilizing data is the new way of the world. Data is used to understand and improve nearly every facet of our lives, from early disease detection to social networks that allow us to connect and communicate with people around the world. No matter what field you’re in, from insurance and banking to medicine, to education, to agriculture … You can utilize data to make better decisions and accomplish your goals. Running descriptive statistics on your datasets is absolutely crucial before you begin the process of inferential statistics. Many people do not take the time, especially novice folks, at the research they carefully run descriptive statistics and clean data and make sure that the data meet the assumptions that are required of the more robust statistical test but it’s absolutely imperative that this process is done correctly.

Data Types

  • Quantitative data takes on numeric values that allow us to perform mathematical operations.

    Quantitative data takes on numeric values that allow us to perform mathematical operations.
- Continuous data can be split into smaller and smaller units, and still a smaller unit exists. (for example, we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with it).

    Quantitative data takes on numeric values that allow us to perform mathematical operations.
- Continuous data can be split into smaller and smaller units, and still a smaller unit exists. (for example, we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with it).
- Discrete data only takes on countable values.

  • Categorical data are used to label a group or set of items.


    Categorical data are used to label a group or set of items.
- Categorical Ordinal: data take on a ranked ordering (for example a ranked interaction on a scale from Very bad to Very Good).
- Categorical Nominal: data that do not have an order or ranking.

For more information about data types, check out this story:


Analyzing Quantitative Data

Measure of center

  1. The Mean
The mean is often called the average or the expected value in mathematics. We calculate the mean by adding all of our values together and dividing by the number of values in our dataset.

  2. The MedianThe median splits our data so that 50% of our values are lower and 50% are higher.


    2. The Median
The median splits our data so that 50% of our values are lower and 50% are higher.
- Median for Odd Values: If we have an odd number of observations, the median is simply the number in the direct middle.
- If we have an even number of observations, the median is the average of the two values in the middle.

    The Median
The median splits our data so that 50% of our values are lower and 50% are higher.
- Median for Odd Values: If we have an odd number of observations, the median is simply the number in the direct middle.
- If we have an even number of observations, the median is the average of the two values in the middle.
Note: In order to compute the median we MUST sort our values first.

  3. The Mode
The mode is the most frequently observed value in our dataset.

    The Mode
The mode is the most frequently observed value in our dataset.
Note 1: There might be multiple modes for a particular dataset or no mode at all.

    The Mode
The mode is the most frequently observed value in our dataset.
Note 1: There might be multiple modes for a particular dataset or no mode at all.
Note 2: The mode of a distribution is essentially the tallest bar in a histogram. There may be multiple modes depending on the number of peaks in our histogram.

Measures of Spread

One of the most common ways to measure the spread of our data is by looking at the Five Number Summary. It consists of five values:

  1. The minimum: The smallest number in the dataset.


  2. The first quartile Q1: The value such that 25% of the data falls below.


  3. The second quartile Q2(median): The value such that 50% of the data falls below.


  4. The third quartile Q3: The value such that 75% of the data falls below.


  5. Maximum: The largest value in the dataset.


We represent the five-number summary with a boxplot as shown below


Image for post
Author

Measures of Spread are used to provide us an idea of how spread out our data are from one another. Common measures of spread include:

  1. Range:
The range is the difference between the maximum and the minimum.

  2. Interquartile Range (IQR):
The interquartile range is calculated as the difference between Q3​ and Q1​.

  3. Variance:


    3. Variance:
The variance is used to compare the spread of two different groups. A set of data with higher variance is more spread out than a dataset with a lower variance. Be careful though, there might just be an

    outlier (or outliers) that is increasing the variance when most of the data are actually very close. The variance is the average squared difference of each observation from the mean.

Image for post
Source by Author

4. Standard Deviation
The standard deviation is one of the most common measures for talking about the spread of data. It is defined as the square root of the variance. The standard deviation is used more often in practice than the variance because it shares the units of the original dataset.

Image for post
Source by Author

Note: If you're interested in mathematical writing with LaTeX check out this article

The shape of the Distribution

From a histogram, we can quickly identify the shape of our data, which can actually tell us a lot about the measures of center and spread. The distribution of data is frequently associated with one of the three shapes:

  1. Right-skewed


    A histogram that has


    shorter bins on the right and taller bins on the left is considered a right-skewed shape. In this distribution, the mean is greater than the median.

    shorter bins on the right and taller bins on the left is considered a right-skewed shape. In this distribution, the mean is greater than the median.
Real-world examples: The amount of drug left in your bloodstream over time, human athletic abilities …

  2. Left-skewed
A histogram that has shorter bins on the left and taller bins on the right is considered a right-skewed shape. In this distribution, the mean is less than the median.

    Left-skewed
A histogram that has shorter bins on the left and taller bins on the right is considered a right-skewed shape. In this distribution, the mean is less than the median.
Real-world examples: The age of death, asset price changes …

  3. Symmetric


    3. Symmetric
Any distribution where you can draw a line down the middle and the right side mirrors the left side is considered symmetric. One of the most common symmetric distributions is known as

    normal distribution and it's also called 'Bell Curve'.

    Symmetric distributions have a mean that’s equal to the median, which also equals the mode, alternatively it has also a symmetric box spot.


    Real-world examples: Heights, weights, precipitation amount …


Note 1: Data in the real world can be messy and it might not follow any of these distributions.
Note 2: In a skewed distribution, the mean is pulled by the tail of the distribution while the median stays closer to the mode.

Image for post
Author

Outliers

Outliers are data points that fall very far from the rest of the values in a data set. In order to determine what is very far, there are a number of different methods. The method I usually use for detecting outliers isn’t very scientific, I just plot the data and see if there is a point really far from any of the other data points.You can check here the methods and techniques for identifying outliers.

A quick plot of your data can often help you understand a lot in a short amount of time.


Image for post
Author

In order to illustrate the impact that outliers can have on the way we report summary statistics, let's consider the income of startups/companies. Imagine I select ten startup earnings and I pull these nine values here as earnings in thousands of dollars and the tenth is Facebook or Tesla. The measure of mean, variance, Standard deviation are incredibly misleading, none of the ten salaries can be even close to the mean calculated. A better measure of center would certainly be the median

Working with outliers

If you’re the one doing the reporting, here are some of my personal guidelines when analyzing data:


  1. Plot your data

  2. If there are outliers, determine how you should handle them. This might require a domain expert of the field. Should you remove them? should you fix them? should you keep them?

  3. If you're working with data that are normally distributed, the bell shape that we saw before, you can find out every little detail about the data with only the mean and the standard deviation. This may seem surprising but it's true. However, if you're working with skewed data, the five-number summary provides much more information for these datasets than the mean and the standard deviation can provide.

    If you're working with data that are normally distributed, the bell shape that we saw before, you can find out every little detail about the data with only the mean and the standard deviation. This may seem surprising but it's true. However, if you're working with skewed data, the five-number summary provides much more information for these datasets than the mean and the standard deviation can provide.
Note: If you aren't sure if your data are normally distributed there are statistical methods like the Kolmogorov-Smirnov test that are aimed to help you understand whether or not your data are normally distributed.

How should we work with these outliers in practice?


At the very least, we should note that they exist. We need to realize the impact they have on our summary statistics. If the outliers are typos or data entry errors, this is a reason to remove these points, or if we know what they should be, we can update them with the correct values. In cases like the example above (Startups/Facebook), we might try to understand what was so different about the outlier when compared to the other startups. How did this startup/company become so successful? And why the earnings so large in comparison? There is an entire field aimed at this idea called "the anomaly detection".

A single number can be very misleading about what is actually happening in our data. Some statistics are more misleading than others. If you are the consumer of information based on data, which we all are, it's important to know how to ask the right questions regarding the statistics around you.

Descriptive vs. Inferential Statistics

The topics covered this far have all been aimed at descriptive statistics. That is, describing the data we’ve collected. There’s an entire other field of statistics known as inferential statistics that’s aimed at drawing conclusions about population of individuals based only on a sample of individuals from that population.

The vocabulary you need to know:


  1. Population: A collection of all the measurements you are analyzing.

  2. Sample: Subset of the population.

  3. Statistic: Any numeric summary calculated from the sample.

  4. Parameter: Numeric summary about a population (the result of inferential statistics: we don't know this number, as it's a number that requires information from all the population).

Drawing conclusions regarding a parameter based on our statistics is known as inference.


Looking Ahead

Through this article, we'll not be diving deep into inferential statistics, you're now aware of the difference between these two branches of statistics. The way we perform inferential statistics is changing as technology evolves. Many career paths involving Machine Learning and Artificial Intelligence are aimed at using collected data to draw conclusions about entire populations at an individual level.

Summary

We started with identifying data types as either categorical or quantitative. Then we learned that we could identify quantitative data as either continuous or discrete, and categorical data as either ordinal or nominal.

When analyzing categorical variables, we commonly just look at the count or percent of a group that falls into each level of a category. When analyzing quantitative data there are four main aspects:

  1. Measures of Center
I. Means

    Measures of Center
I. Means
II. Medians

    Measures of Center
I. Means
II. Medians
III. Modes

  2. Measures of Spread
I. Range
II. Interquartile Range (IQR)
III. Variance
IV. Standard Deviation

  3. Shape of distribution


    3. Shape of distribution
I. Right-skewed

    Shape of distribution
I. Right-skewed
II. Left-Skewed

    Shape of distribution
I. Right-skewed
II. Left-Skewed
III. Symmetric

  4. Outliers


There are two types of statistics:
1. Descriptive statistics: Present, organize, summarize, and describe the collected data using the measures discussed throughout measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.

2. Inferential statistics: This is where you run different tests and draw conclusions about your sample that we can impute to a larger population. Performing inferential statistics well requires that we take a sample that accurately represents our population of interest.

Thanks For Reading! 😄

谢谢阅读! 😄

Image for post
Khelifi Ahmed Aziz


翻译自: https://medium.com/dataseries/understand-descriptive-statistics-c29282b7a62e






