About Examples of Statistics in Machine Learning

最新推荐文章于 2024-07-17 17:53:42 发布

DB架构

最新推荐文章于 2024-07-17 17:53:42 发布

阅读量134

点赞数

分类专栏： Statistical Methods 文章标签：大数据统计方法机器学习人工智能

本文链接：https://blog.csdn.net/u011868279/article/details/125438215

版权

Statistical Methods 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

After reading the blog, you will know :

Exploratory data analysis, data summarization, and data visualizations can be used to help frame your predictive modeling problem and better understand the data.
That statistical methods can be used to clean and prepare data ready for modeling.
That statistical hypothesis tests and estimation statistics can aid in model selection and in presenting the skill and predictions from final models.

1.1 Overview

we are going to look at 10 examples of where statistical methods are used in an applied machine learning project.This will demonstrate that a working knowledge of statistics is essential for sucessfully working through a predictive modeling problem.

Problem Framing
Data Understanding
Data Cleaning
Data Selection
Data Preparation
Model Evaluation
Model Configuration
Model Selection
Model Presentation
Model Predictions

1.2 Problem Framing

Perhaps the point of biggest leverage in a predictive modeling problem is the framing of the problem.

Statistical methods that can aid in the exploration of the data during the framing of a problem include:

Exploratory Data Analysis. Summarization and visualization in order to explore ad hoc views of the data.
Data Mining. Automatic discovery of structured relationships and patterns in the data.

1.3 Data Understanding

Data understanding means having an intimate grasp of both the distributions of variables and the relationships between variables.

Two large branches of statistical methods are used to aid in understanding data:

Summary Statistics. Methods used to summarize the distribution and relationships between variables using statistical quantities.
Data Visualizations. Methods used to summarize the distribution and relationships between variables using visualizations such as charts, plots, and graphs.

1.4 Data Cleaning

Data corruption
Data errors
Data loss

The process of identifying and repairing issues with the data is called data cleaning Statistical methods are used for data cleaning;

Outlier detection. Methods for identifying observations that are far from the expected value in a distribution.
Imputation. Methods for repairing or filling in corrupt or missing values in observations.

1.5 Data Selection

The process of reducing the scope of data to those elements that are most useful for making predictions is called data selection. Two types of statistical methods that are used for data selection include:

Data Sample. Methods to systematically create smaller representative samples from larger datasets.
Feature Selection. Methods to automatically identify those variables that are most relevant to the outcome variable.

1.6 Data Preparation

Data can often not be used directly for modeling. Some transformation is often required in order to change the shape or structure of the data to make it more suitable for the chosen framing of the problem or learning algorithms. Data preparation is performed using statistical methods. Some common examples include:

Scaling. Methods such as standardization and normalization.
Encoding. Methods such as integer encoding and one hot encoding.
Transforms. Methods such as power transforms like the Box-Cox method.

1.7 Model Evaluation

A crucial part of a predictive modeling problem is evaluating a learning method. This often requires the estimation of the skill of the model when making predictions on data not seen during the training of the model.

This is a whole subfield of statistical methods.

Experimental Design. Methods to design systematic experiments to compare the effect of independent variables on an outcome, such as the choice of a machine learning algorithm on prediction accuracy.

As part of implementing an experimental design, methods are used to resample a dataset in order to make economic use of available data in order to estimate the skill of the model.

Resampling Methods. Methods for systematically splitting a dataset into subsets for the purposes of training and evaluating a predictive model.

1.8 Model Configuration

Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.

1.9 Model Selection

The process of selecting one method as the solution is called model selection.

As with model configuration, two classes of statistical methods can be used to interpret the estimated skill of different models for the purposes of model selection.

Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.

1.10 Model Presentation

Methods from the field of estimation statistics can be used to quantify the uncertainty in the estimated skill of the machine learning model through the use of tolerance intervals and confidence intervals.