Descriptive Statistics

Preamble

I'm recently trying to deploy LLM privately for my work and find it necessary to brush upon some stat. fundamentals to pull off some customized quantization of models; here is a good start.

Source and Resources

main/summary: Descriptive Statistics: Definition, Overview, Types, and Examples

kurtosis: Kurtosis: Definition, Types, and Importance

skew: Right Skewed vs. Left Skewed Distribution

scipy kurtosis: kurtosis — SciPy v1.14.0 Manual

scipy skew:skew — SciPy v1.14.0 Manual

Descriptive Statistics: Definition, Overview, Types, and Examples

By 

Adam Hayes

Updated June 27, 2024

Reviewed by Thomas Brock

Fact checked by 

Vikki Velasquez

What Are Descriptive Statistics?

Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the meanmedian, and mode, while measures of variability include standard deviationvariance, minimum and maximum variables, kurtosis, and skewness.

Key Takeaways

  • Descriptive statistics summarizes or describes the characteristics of a data set.
  • Descriptive statistics consists of three basic categories of measures: measures of central tendency, measures of variability (or spread), and frequency distribution.
  • Measures of central tendency describe the center of the data set (mean, median, mode).
  • Measures of variability describe the dispersion of the data set (variance, standard deviation).
  • Measures of frequency distribution describe the occurrence of data within the data set (count).

Understanding Descriptive Statistics

Descriptive statistics help describe and explain the features of a specific data set by giving short summaries about the sample and measures of the data. The most recognized types of descriptive statistics are measures of center. For example, the mean, median, and mode, which are used at almost all levels of math and statistics, are used to define and describe a data set. The mean, or the average, is calculated by adding all the figures within the data set and then dividing by the number of figures within the set.

For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a data set is the value appearing most often, and the median is the figure situated in the middle of the data set. It is the figure separating the higher figures from the lower figures within a data set. However, there are less common types of descriptive statistics that are still very important.1

People use descriptive statistics to repurpose hard-to-understand quantitative insights across a large data set into bite-sized descriptions. A student's grade point average (GPA), for example, provides a good understanding of descriptive statistics. The idea of a GPA is that it takes data points from a range of individual course grades, and averages them together to provide a general understanding of a student's overall academic performance. A student's personal GPA reflects their mean academic performance.

Types of Descriptive Statistics

All descriptive statistics are either measures of central tendency or measures of variability, also known as measures of dispersion.

Central Tendency

Measures of central tendency focus on the average or middle values of data sets, whereas measures of variability focus on the dispersion of data. These two measures use graphs, tables, and general discussions to help people understand the meaning of the analyzed data.

Measures of central tendency describe the center position of a distribution for a data set. A person analyzes the frequency of each data point in the distribution and describes it using the mean, median, or mode, which measures the most common patterns of the analyzed data set.

Measures of Variability

Measures of variability (or measures of spread) aid in analyzing how dispersed the distribution is for a set of data. For example, while the measures of central tendency may give a person the average of a data set, it does not describe how the data is distributed within the set.

So while the average of the data might be 65 out of 100, there can still be data points at both 1 and 100. Measures of variability help communicate this by describing the shape and spread of the data set. Range, quartiles, absolute deviation, and variance are all examples of measures of variability.3

Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which is calculated by subtracting the lowest number (5) in the data set from the highest (100).

Distribution

Distribution (or frequency distribution) refers to the number of times a data point occurs. Alternatively, it can be how many times a data point fails to occur. Consider this data set: male, male, female, female, female, other. The distribution of this data can be classified as:

  • The number of males in the data set is 2.
  • The number of females in the data set is 3.
  • The number of individuals identifying as other is 1.
  • The number of non-males is 4.

Univariate vs. Bivariate

In descriptive statistics, univariate data analyzes only one variable. It is used to identify characteristics of a single trait and is not used to analyze any relationships or causations.

For example, imagine a room full of high school students. Say you wanted to gather the average age of the individuals in the room. This univariate data is only dependent on one factor: each person's age. By gathering this one piece of information from each person and dividing by the total number of people, you can determine the average age.

Bivariate data, on the other hand, attempts to link two variables by searching for correlation. Two types of data are collected, and the relationship between the two pieces of information is analyzed together.4 Because multiple variables are analyzed, this approach may also be referred to as multivariate.

Let's say each high school student in the example above takes a college assessment test, and we want to see whether older students are testing better than younger students. In addition to gathering the ages of the students, we need to find out each student's test score. Then, using data analytics, we mathematically or graphically depict whether there is a relationship between student age and test scores.

Note

The preparation and reporting of financial statements is an example of descriptive statistics. Analyzing that financial information to make decisions on the future is inferential statistics.

Descriptive Statistics and Visualizations

One essential aspect of descriptive statistics is graphical representation. Visualizing data distributions effectively can be incredibly powerful, and this is done in several ways.

Histograms are tools for displaying the distribution of numerical data. They divide the data into bins or intervals and represent the frequency or count of data points falling into each bin through bars of varying heights. Histograms help identify the shape of the distribution, central tendency, and variability of the data.

Another visualization is boxplots. Boxplots, also known as box-and-whisker plots, provide a concise summary of a data distribution by highlighting key summary statistics including the median (middle line inside the box), quartiles (edges of the box), and potential outliers (points outside, or the "whiskers"). Boxplots visually depict the spread and skewness of the data and are particularly useful for comparing distributions across different groups or variables.

Descriptive Statistics and Outliers

Whenever descriptive statistics are being discussed, it's important to note outliers. Outliers are data points that significantly differ from other observations in a dataset. These could be errors, anomalies, or rare events within the data.

Detecting and managing outliers is a step in descriptive statistics to ensure accurate and reliable data analysis. To identify outliers, you can use graphical techniques (such as boxplots or scatter plots) or statistical methods (such as Z-score or IQR method). These approaches help pinpoint observations that deviate substantially from the overall pattern of the data.

The presence of outliers can have a notable impact on descriptive statistics, skewing results and affecting the interpretation of data. Outliers can disproportionately influence measures of central tendency, such as the mean, pulling it towards their extreme values. For example, the dataset of (1, 1, 1, 997) is 250, even though that is hardly representative of the dataset. This distortion can lead to misleading conclusions about the typical behavior of the dataset.

Depending on the context, outliers can often be treated by removing them (if they are genuinely erroneous or irrelevant). Alternatively, outliers may hold important information and should be kept for the value they may be able to demonstrate. As you analyze your data, consider the relevance of what outliers can contribute and whether it makes more sense to just strike those data points from your descriptive statistic calculations.

Descriptive Statistics vs. Inferential Statistics

Descriptive statistics have a different function from inferential statistics, which are data sets that are used to make decisions or apply characteristics from one data set to another.

Imagine another example where a company sells hot sauce. The company gathers data such as the count of sales, average quantity purchased per transaction, and average sale per day of the week. All of this information is descriptive, as it tells a story of what actually happened in the past. In this case, it is not being used beyond being informational.

Now let's say that the company wants to roll out a new hot sauce. It gathers the same sales data above, but it uses the information to make predictions about what the sales of the new hot sauce will be. The act of using descriptive statistics and applying characteristics to a different data set makes the data set inferential statistics. We are no longer simply summarizing data; we are using it to predict what will happen regarding an entirely different body of data (in this case, the new hot sauce product).

What Is Skewness?

Skewness is the degree of asymmetry observed in a probability distribution. When data points on a bell curve are not distributed symmetrically to the left and right sides of the median, the bell curve is skewed. Distributions can be positive and right-skewed, or negative and left-skewed. A normal distribution exhibits zero skewness.

Key Takeaways

  • Skewness is the degree of asymmetry observed in a probability distribution.
  • Distributions can be positive and right-skewed, or negative and left-skewed. A normal distribution exhibits zero skewness.

What Is Kurtosis?

Kurtosis is a statistical measure used to describe a characteristic of a dataset. When normally distributed data is plotted on a graph, it generally takes the form of a bell. This is called the bell curve. The plotted data that are farthest from the mean of the data usually form the tails on each side of the curve. Kurtosis indicates how much data resides in the tails.

Key Takeaways

  • Kurtosis describes the “fatness” (==> or length, it's a weighted sum of data over deviation from mean) of the tails found in probability distributions.
  • There are three kurtosis categories: mesokurtic (normal), platykurtic (less than normal), and leptokurtic (more than normal).
  • Kurtosis risk is a measurement of how often an investment’s price moves dramatically.

Types of Kurtosis

There are three categories of kurtosis that a set of data can display: mesokurtic, leptokurtic, and platykurtic. All measures of kurtosis are compared against a normal distribution curve.


Mesokurtic (Kurtosis = 3.0)

The first category of kurtosis is mesokurtic distribution. This distribution has a kurtosis similar to that of the normal distribution, meaning the extreme value characteristic of the distribution is similar to that of a normal distribution. Therefore, a stock with a mesokurtic distribution generally depicts a moderate level of risk.

Leptokurtic (Kurtosis > 3.0)

The second category is leptokurtic distribution. Any distribution that is leptokurtic displays greater kurtosis than a mesokurtic distribution. This distribution appears as a curve with long tails (outliers). The “skinniness” of a leptokurtic distribution is a consequence of the outliers, which stretch the horizontal axis of the histogram graph, making the bulk of the data appear in a narrow (“skinny”) vertical range.

A stock with a leptokurtic distribution generally depicts a high level of risk but the possibility of higher returns, because the stock has typically demonstrated large price movements.

While a leptokurtic distribution may be “skinny” in the center, it also features “fat tails.”

Platykurtic (Kurtosis < 3.0)

The final type of distribution is platykurtic distribution. These types of distributions have short tails (fewer outliers). Platykurtic distributions have demonstrated more stability than other curves because extreme price movements rarely occurred in the past. This translates into a less-than-moderate level of risk.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值