After reading the blog, you will know :
- Exploratory data analysis, data summarization, and data visualizations can be used to help frame your predictive modeling problem and better understand the data.
- That statistical methods can be used to clean and prepare data ready for modeling.
- That statistical hypothesis tests and estimation statistics can aid in model selection and in presenting the skill and predictions from final models.
1.1 Overview
we are going to look at 10 examples of where statistical methods are used in an applied machine learning project.This will demonstrate that a working knowledge of statistics is essential for sucessfully working through a predictive modeling problem.
- Problem Framing
- Data Understanding
- Data Cleaning
- Data Selection
- Data Preparation
- Model Evaluation
- Model Configuration
- Model Selection
- Model Presentation
- Model Predictions
1.2 Problem Framing
Perhaps the point of biggest leverage in a predictive modeling problem is the framing of the problem.
Statistical methods that can aid in the exploration of the data during the framing of a problem include:
- Exploratory Data Analysis. Summarization and visualization in order to explore ad hoc views of the data.
- Data Mining. Automatic discovery of structured relationships and patterns in the data.
1.3 Data Understanding
Data understanding means having an intimate grasp of both the distributions of variables and the relationships between variables.
Two large branches of statistical methods are used to aid in understanding data:
- Summary Statistics. Methods used to summarize the distribution and relationships between variables using statistical quantities.
- Data Visualizations. Methods used to summarize the distribution and relationships between variables using visualizations such as charts, plots, and graphs.
1.4 Data Cleaning
- Data corruption
- Data errors
- Data loss
The process of identifying and repairing issues with the data is called data cleaning Statistical methods are used for data cleaning;
- Outlier detection. Methods for identifying observations that are far from the expected value in a distribution.
- Imputation. Methods for repairing or filling in corrupt or missing values in observations.
1.5 Data Selection
The process of reducing the scope of data to those elements that are most useful for making predictions is called data selection. Two types of statistical methods that are used for data selection include:
- Data Sample. Methods to systematically create smaller representative samples from larger datasets.
- Feature Selection. Methods to automatically identify those variables that are most relevant to the outcome variable.
1.6 Data Preparation
Data can often not be used directly for modeling. Some transformation is often required in order to change the shape or structure of the data to make it more suitable for the chosen framing of the problem or learning algorithms. Data preparation is performed using statistical methods. Some common examples include:
- Scaling. Methods such as standardization and normalization.
- Encoding. Methods such as integer encoding and one hot encoding.
- Transforms. Methods such as power transforms like the Box-Cox method.
1.7 Model Evaluation
A crucial part of a predictive modeling problem is evaluating a learning method. This often requires the estimation of the skill of the model when making predictions on data not seen during the training of the model.
This is a whole subfield of statistical methods.
- Experimental Design. Methods to design systematic experiments to compare the effect of independent variables on an outcome, such as the choice of a machine learning algorithm on prediction accuracy.
As part of implementing an experimental design, methods are used to resample a dataset in order to make economic use of available data in order to estimate the skill of the model.
- Resampling Methods. Methods for systematically splitting a dataset into subsets for the purposes of training and evaluating a predictive model.
1.8 Model Configuration
- Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
- Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.
1.9 Model Selection
The process of selecting one method as the solution is called model selection.
As with model configuration, two classes of statistical methods can be used to interpret the estimated skill of different models for the purposes of model selection.
- Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
- Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.
1.10 Model Presentation
Methods from the field of estimation statistics can be used to quantify the uncertainty in the estimated skill of the machine learning model through the use of tolerance intervals and confidence intervals.
- Estimation Statistics. Methods that quantify the uncertainty in the skill of a model via confidence intervals.
1.11 Model Predictions
we can use methods from the field of estimation statistics to quantify this uncertainty, such as confidence intervals and prediction intervals.
- Estimation Statistics. Methods that quantify the uncertainty for a prediction via prediction intervals.