About Examples of Statistics in Machine Learning

After reading the blog, you will know :

  • Exploratory data analysis, data summarization, and data visualizations can be used to help frame your predictive modeling problem and better understand the data.
  • That statistical methods can be used to clean and prepare data ready for modeling.
  • That statistical hypothesis tests and estimation statistics can aid in model selection and in presenting the skill and predictions from final models.

1.1 Overview

we are going to look at 10 examples of where statistical methods are used in an applied machine learning project.This will demonstrate that a working knowledge of statistics is essential for sucessfully working through a predictive modeling problem.

  1. Problem Framing
  2. Data Understanding
  3. Data Cleaning
  4. Data Selection
  5. Data Preparation
  6. Model Evaluation
  7. Model Configuration
  8. Model Selection
  9. Model Presentation
  10. Model Predictions

1.2 Problem Framing

Perhaps the point of biggest leverage in a predictive modeling problem is the framing of the problem.

Statistical methods that can aid in the exploration of the data during the framing of a problem include:

  • Exploratory Data Analysis. Summarization and visualization in order to explore ad hoc views of the data.
  • Data Mining. Automatic discovery of structured relationships and patterns in the data.

1.3 Data Understanding

Data understanding means having an intimate grasp of both the distributions of variables and the relationships between variables.

Two large branches of statistical methods are used to aid in understanding data:

  • Summary Statistics. Methods used to summarize the distribution and relationships between variables using statistical quantities.
  • Data Visualizations. Methods used to summarize the distribution and relationships between variables using visualizations such as charts, plots, and graphs.

1.4 Data Cleaning

  • Data corruption
  • Data errors
  • Data loss

The process of identifying and repairing issues with the data is called data cleaning Statistical methods are used for data cleaning;

  • Outlier detection. Methods for identifying observations that are far from the expected value in a distribution.
  • Imputation. Methods for repairing or filling in corrupt or missing values in observations.

1.5 Data Selection

The process of reducing the scope of data to those elements that are most useful for making predictions is called data selection. Two types of statistical methods that are used for data selection include:

  • Data Sample. Methods to systematically create smaller representative samples from larger datasets.
  • Feature Selection. Methods to automatically identify those variables that are most relevant to the outcome variable.

1.6 Data Preparation

Data can often not be used directly for modeling. Some transformation is often required in order to change the shape or structure of the data to make it more suitable for the chosen framing of the problem or learning algorithms. Data preparation is performed using statistical methods. Some common examples include:

  • Scaling. Methods such as standardization and normalization.
  • Encoding. Methods such as integer encoding and one hot encoding.
  • Transforms. Methods such as power transforms like the Box-Cox method.

1.7 Model Evaluation

A crucial part of a predictive modeling problem is evaluating a learning method. This often requires the estimation of the skill of the model when making predictions on data not seen during the training of the model.

This is a whole subfield of statistical methods.

  • Experimental Design. Methods to design systematic experiments to compare the effect of independent variables on an outcome, such as the choice of a machine learning algorithm on prediction accuracy.

As part of implementing an experimental design, methods are used to resample a dataset in order to make economic use of available data in order to estimate the skill of the model.

  • Resampling Methods. Methods for systematically splitting a dataset into subsets for the purposes of training and evaluating a predictive model.

1.8 Model Configuration

  • Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
  • Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.

1.9 Model Selection

The process of selecting one method as the solution is called model selection.

As with model configuration, two classes of statistical methods can be used to interpret the estimated skill of different models for the purposes of model selection.

  • Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
  • Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.

1.10 Model Presentation

Methods from the field of estimation statistics can be used to quantify the uncertainty in the estimated skill of the machine learning model through the use of tolerance intervals and confidence intervals.

  • Estimation Statistics. Methods that quantify the uncertainty in the skill of a model via confidence intervals.

1.11 Model Predictions

we can use methods from the field of estimation statistics to quantify this uncertainty, such as confidence intervals and prediction intervals.

  • Estimation Statistics. Methods that quantify the uncertainty for a prediction via prediction intervals.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值