Review 4995 Applied Machine Learning

最新推荐文章于 2022-11-26 17:15:33 发布

Alex Tech Bolg

最新推荐文章于 2022-11-26 17:15:33 发布

阅读量871

点赞数

分类专栏： Python机器学习基础教程文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/qq_41103204/article/details/120985109

版权

Python机器学习基础教程专栏收录该内容

29 篇文章 25 订阅

订阅专栏

Lecture 1
- Basic concept
- Exploratory Data Analysis & Visualization
Lecture 2
Lecture 3
Lecture 4
of trees
of features
estimators
of estimators
- - Gradient Boosting
Lecture 5

Lecture 1

Basic concept

Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data.
Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so.

Until 1990’s…
Machine Learning: Supervised Learning, UnSupervised Learning, Other (Reinforcement Learning, Active Learning, Self-supervised Learning Transfer Learning…)
Recently (Because of the development of Deep Learning)
Machine Learning: Supervised Learning, UnSupervised Learning, Reinforcement Learning, Other (Active Learning, Self-supervised Learning Transfer Learning…)

Supervised learning algorithms learn a function that maps inputs to an output from a set of labeled training data.
Unsupervised learning algorithms learn patterns from unlabeled data samples.
Deep learning is a class of ML algorithms that uses multiple layers to progressively extract higher-level features/abstractions from raw inputs.

Model complexity: Computational power, big data, breakthrough in deep learning

Exploratory Data Analysis & Visualization

Exploratory Data Analysis (EDA) is an approach of analyzing datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

Data types:
● Quantitative/numerical continuous - 1, 3.5, 100, 10^10, 3.14
● Quantitative/numerical discrete - 1, 2, 3, 4
● Qualitative/categorical unordered - cat, dog, whale
● Qualitative/categorical ordered - good, better, best
● Date or time - 09/15/2021, Jan 8th 2020 15:00:00
● Text - The quick brown fox jumps over the lazy dog
Data Visualization
Ugly, bad, wrong
A typical data visualization chart
● Two position scales:
○ displacement (x-axis)
○ fuel efficiency (y-axis)
● One color scale: ○ power
● One shape scale: ○ cylinders
● One size scale: ○ weight

Position scale
Nonlinear axes ( Logarithmic scale)

Visualization Collections
Typically, we would like to visualize the following kinds of data:
○ Amounts
○ Distributions
○ Proportions
○ X-Y relationships
○ Uncertainty

Lecture 2

Supervised Learning

Supervised learning algorithms learn a function that maps inputs to an output from a set of labeled training data.
Supervised Learning goal: Deployed ML model generalizes well on unseen data

Supervised learning framework

● Development-test split
● Hyperparameter tuning
● Optimal model training
● Model evaluation
● Model deployment

k-nearest neighbors

A simple non-parametric supervised learning method
Assigns the value of the nearest neighbor(s) to the unseen data point
Prediction is computationally expensive, while training is trivial
Generally performs poorly at high dimensions

Development-test split

Typically the dataset is split into development dataset and test dataset in the ratio of 4:1 (also called 80/20 split) or 3:1.
The purpose of test dataset is to evaluate the performance of the final optimal model
Model evaluation is supposed to give a pulse on how the model would perform in the wild.
Splitting strategies:
- Random splitting
- Stratified splitting
- Structured splitting

Random Splitting

The dataset is split at random to create development & test datasets
The size of test dataset is generally determined by the total number of samples.

Stratified Splitting

The stratified splitting ensures that the ratio of indices (classes) in development and test datasets equals that of the original dataset.
Generally employed when performing classification tasks on highly imbalanced datasets.

Structured Splitting

The structured splitting is generally employed to prevent data leakage.
Examples:
- Stock price predictions
- Time-series predictions

Hyperparameter tuning

Model hyperparameters control the model complexity
Model complexity v.s. Model performance
Bias-Variance Tradeoff

请添加图片描述

Model hyperparameters v.s. Model parameters
- Model hyperparameter (tuned during the training process)
- Model parameter (learned during the training process)
Hyperparameter search
- Grid search
- Random search
- Bayesian optimization
- Evolutionary optimization

Grid search v.s. Random search

Grid search and Random search are both considered as uninformed search strategies
Possible to combine both strategies:
- Search a larger space using random search
- Find promising areas
- Perform grid search in the smaller area
- Continue until optimal score is
  obtained

Bayesian optimization

Bayesian optimization works by constructing a probability distribution of possible functions (gaussian process) that best describe the function you want to optimize.
A utility function helps explore the parameter space by trading between exploration and exploitation.
The probability distribution of functions is updated (bayesian) based on observations so far.
Informed search

Model selection

Once a hyperparameter search strategy is chosen, we need a way to know how good a hyperparameter value is.
Theoretically, we could use the test dataset to evaluate the performance of the model trained using the hyperparameter value.
However, this could lead to overfitting and/or providing an overly optimistic model performance estimate.
We will need a validation dataset to estimate the effectiveness of a hyperparameter value, thereby eventually helping with the model selection.

Model selection strategies

Three-way holdout
- Model selection corresponds to the hyperparameter with the best performance on validation data.
- (Repeated) K-fold CV
  - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedKFold.html
- Leave-one-out CV
  - The development data is split into k-folds randomly.
  - For every hyperparameter, model is trained using k-1 folds and evaluated on the kth-fold
  - The process is repeated for all k-folds and an average model performance is estimated.
  - Model selection corresponds to the hyperparameter with the best average performance.
  - In repeated k-fold CV, steps (1)-(3) are repeated for n-times and performance is estimated over k*n iterations.
  - In leave-one-out CV, we set k equal to the number of samples.
- (Repeated) Stratified K-fold CV
  - Similar to (repeated) k-fold cv, with one difference that each fold (and test data) is now a stratified sample.
- Random permutation CV
  - A user specified number of train/validation datasets are created.
  - For every pair, the development data is shuffled before creating the training & validation
    datasets
  - The remaining steps are similar to K-fold CV.

Guidelines on model selection strategies

Three-way holdout can give reasonable approximation of test performance on large balanced datasets
Leave-one-out CV generally has high variance and is applied on small datasets
K-fold CV works well in general, more stable and is used if model selection is not
computationally expensive
Stratified sampling is used when working with highly imbalanced datasets
Repeated versions of K-fold CV and Stratified K-fold CV have lower variance, but are computationally expensive.

Model evaluation

Model complexity v.s. Model performance

Data preprocessing

Numerical data

Standard Scaler
Min-max Scaler
Max absolute Scaler
Robust Scaler (Outliers)
Normalizer (Normalization is the process of scaling individual samples to have unit norm.)
The scaler from training step should be used to transform the test data. ()

https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35

Handling missing data

It is important to know how missing values are encoded in dataset
Often missingness is informative (often captured by adding missing indicator
columns)
Several ways to handle missing values:
- Drop column (typically used as baseline)
- Drop rows (if there are only a few with missing values)
- Impute using mean or median (SimpleImputer in sklearn API)
- kNN (neighbors are found using nan_euclidean_distance metric)
- Regression models
- Matrix factorization

Categorical data

Ordinal encoding
One-hot encoding
- One-hot encoding introduces multi-collinearity
  - For e.g., x3 = 1 - x1 - x2 (in case when we have three categories)
  - Possible to remove one feature (In practice, we typically use all features and add regularization.)
  - Has implications on model interpretation
- Could be problematic for some models
  - Non-regularized regression techniques
- Some modeling techniques handle categorical features as-is
  - Tree-based models
  - Naive Bayes models
- Leads to high-dimensional datasets
Target encoding
- Generally applicable for high cardinality categorical features
- The encoding is specific to the problem type.
- Regression:
  - Average target value for each category
- Binary classification:
  - Probability of being in class 1
- Multiclass classification:
  - One feature per class that gives the probability distribution

Lecture 3

Linear models for regression

Linear regression

Assumptions:
- Linearity
- Independence
- Homoscedasticity
- Normality

请添加图片描述

Ridge regression

L2-norm

请添加图片描述

Lasso regression

L1-norm
Feature selection

Elastic-net regression

请添加图片描述

Linear models for classification

Binary classification

请添加图片描述

Logistic Regression

请添加图片描述

Support Vector Machines (SVMs)

Primal mode is preferred when we don’t need to apply kernel trick to the data and the dataset is large but the dimension of each data point is small.
Dual form is preferred when data has a huge dimension and we need to apply the kernel trick.
https://medium.com/geekculture/the-optimization-behind-svm-primal-and-dual-form-5cca1b052f45

Kernel Function

请添加图片描述

Gaussian kernel: If we tune $\gamma$ enough, we can perfectly separate dataset. The intuition behind Gaussian kernel is that Gaussian kernel will project data in indefinite dimensional space.

请添加图片描述

Multi-class classification

OVO, OVR

请添加图片描述

Multinomial Logistic Regression

请添加图片描述

Lecture 4

Decision Trees

Greedy algorithm (Building an optimal decision tree is NP hard.)
Applicable to both classification & regression problems
Easy to interpret & deploy
Non-linear decision boundary
Minimal preprocessing
Invariant to scale of data (Because it is not looking at the absolute value of features.)

Classification Trees: Majority voting.
Regression Trees: Sample mean.
scikit-learn only supports binary splits: https://datascience.stackexchange.com/questions/51983/is-it-possible-to-output-more-than-2-nodes-away-from-a-node-in-a-decision-tree

Classification Trees - Measure of Impurity

请添加图片描述

Classification Trees - Information Gain

请添加图片描述

To understand what the notation means.

https://en.wikipedia.org/wiki/Information_gain_in_decision_trees

Regression Trees - Measure of Impurity

请添加图片描述

Regression Trees - Sum of Squared Error (SSE)

请添加图片描述

Classification v.s. Regression Trees

请添加图片描述

Node splitting

continuous features: An exhaustive search
categorical features: target encoding

请添加图片描述

Overfitting and Preventing Overfitting

All leaf nodes have zero entropy (i.e. pure nodes)
If all nodes are pure, the accuracy of training data should be 100%.

Prevent Overfitting

Pruning
- Reduced error
  - Starting at the leaves, each node is replaced with its most popular class.
  - If the validation metric is not negatively affected, then the change is kept, else it is reverted.
  - Reduced error pruning has the advantage of speed and simplicity.
- Cost complexity
  - https://stats.stackexchange.com/questions/193538/how-to-choose-alpha-in-cost-complexity-pruning

请添加图片描述

Early stopping
- Maximum depth
- Maximum leaf nodes
- Minimum samples split
- Minimum impurity decrease

Ensemble methods

Motivation

The decision trees are highly unstable and can structurally change with slight variation in input data.
Decision trees perform poorly on continuous outcomes (regression) due to limited model capacity. (For example, this tree can only predict 4 values.)

请添加图片描述

Ensemble methods

Several weak/simple learners are combined to make the final prediction
Generally ensemble methods aim to reduce model variance.
Ensemble methods improve performance especially if the individual learners are not correlated.
Depending on training sample construction and output aggregation, there are two categories:
- Bagging (Bootstrap aggregation)
- Boosting

Bagging (Bootstrap aggregation)

Several training samples (of same size) are created by sampling the dataset with replacement
Each training sample is then used to train a model
The outputs from each of the models are averaged to make the final prediction.

Random Forests

Applicable to both classification and regression problems
Smarter bagging for trees
Motivated by theory that generalization improves with uncorrelated trees
Bootstrapped samples and random subset of features are used to train each tree
The outputs from each of the models are averaged to make the final prediction.

Random Forest Hyperparameter tuning

Random Forest hyperparameters:
- of trees
- of features
  - Classification - sqrt(# of features)
  - Regression - # of features
- Decision Tree hyperparameters (splitting criteria, maximum depth, etc. )
- Uses out-of-bag (OOB) error for model selection
- OOB error is the average error of a data point calculated using predictions from the trees that do not contain it in their respective bootstrap sample

Random Forest Feature importances

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node.
The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples.
The higher the value the more important the feature.

Boosting

● Applicable to both regression as well as classification problems
● Includes a family of ML algorithms that convert weak learners to strong ones.
● The weak learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors.
● When an input is misclassified by one tree, its output is adjusted so that next tree is more likely to learn it correctly.

Adaptive Boosting

● First ensemble boosting algorithm applied to classification tasks
● Initially, a decision stump classifier ( just splits the data into two regions) is fit to the data
● The data points correctly classified are given less weightage while misclassified data points are given higher weightage in the next iteration
● A decision stump classifier is now fit to the data with weights determined in previous iteration
● Weights (𝝆t) for each classifier (estimated during the training process) are used to combine the outputs and make the final prediction.

Algorithm and Loss function

请添加图片描述

Exponential loss function

Adaboost - Tuning Parameters

Classification:
- estimators
- learning rate
- base estimator (In early it is decision stump, but we can change it.)
Regression:
- loss function
- learning rate
- of estimators
- base estimator

Gradient Boosting

● Works for both classification and regression tasks
● Trains regression trees in a sequential manner on modified versions of the datasets.
● Every tree is trained on the residuals of the data points obtained by subtracting the predictions from the previous tree.
● Weights for each classifier (estimated during the training process) are used to combine the outputs and make the final prediction.

Algorithm and Loss function

请添加图片描述

Gradient Descent

请添加图片描述

Big learning rate: oscillate before reaching the minimum.

Tuning Parameters

● # of estimators
● Learning rate
● Decision tree parameters (max depth, min number of samples etc.)
● Regularization parameters
● Row sampling
● Column sampling

Gradient Boosting implementations

GradientBoostingClassifier

● Early implementation of Gradient Boosting in sklearn
● Typical slow on large datasets
● Most important parameters are # of estimators and learning rate
● Supports both binary & multi-class classification
● Supports sparse data

HistGradientBoostingClassifier

● Orders of magnitude faster than GradientBoostingClassifier on large datasets
● Inspired by LightGBM implementation
● Histogram-based split finding in tree learning
● Does not support sparse data
● Supports both binary & multi-class classification
● Natively supports categorical features
● Does not support monotonicity constraints

XGBoost

● One of most popular implementations of gradient boosting
● Fast approximate split finding based on histograms
● Supports GPU training, sparse data & missing values
● Adds l1 and l2 penalties on leaf weights
● Monotonicity & feature interaction constraints
● Works well with pipelines in sklearn due to a compatible interface
● Does not support categorical variables natively

LightGBM

● Supports GPU training, sparse data & missing values
● Histogram-based node splitting
● Use Gradient-based One-Sided Sampling (GOSS) for tree learning
● Exclusive feature bundling to handle sparse features
● Generally faster than XGBoost on CPUs
● Supports distributed training on different frameworks like Ray, Spark, Dask etc.
● CLI version

CatBoost

● Optimized for categorical features
● Uses target encoding to handle categorical features
● Uses ordered boosting to build “symmetric” trees
● Overfitting detector
● Tooling support (Jupyter notebook & Tensorboard visualization)
● Supports GPU training, sparse data & missing values
● Monotonicity constraints

Lecture 5

Model Evaluation

● Evaluation metrics are generally used to measure the performance of an ML model
● Evaluation metrics indicate of how well the models would do when deployed
● The choice of metrics is very task-specific and determines what the model
learns
● It is important to know what you are willing to trade off when training ML
models for a task

Model Evaluation Metrics

Binary Classification

Accuracy

请添加图片描述

Limitation
- Classification accuracy could be misleading in case of imbalance datasets
- Accuracy Paradox (Higher accuracy does not necessarily mean better model.)

Confusion Matrix

请添加图片描述

Precision, Recall & F1-score

● Precision, Recall & F1-score are better metrics for imbalance datasets
● Precision is defined as the fraction of relevant instances among retrieved
instances
● Recall is defined as the fraction of relevant instances that were retrieved
● F1-score is the harmonic mean of precision & recall
请添加图片描述
Minority class is considered positive as best practice.

Averaging Metrics - Macro & Weighted

请添加图片描述

Weighted: The majority class will still be weighted very high. It probably not suitable for imbalanced dataset.
Macro: Balanced metrics.

Averaging Metrics - Balanced Accuracy

same as macro-averaged recall
Same as accuracy for balanced datasets
请添加图片描述

Precision-Recall (PR) Curve

请添加图片描述

Make threshold lower than 0.5. It means that for negative prediction we want to make sure it is absolutely negative. This will make false negative to be lower, so increase the recall.

● A precision-recall curve shows the relationship between precision and recall at every cut-off point.
● Visualize effect of selected threshold on performance.

Receiver Operating Curve (ROC)

● Another useful tool to visualize the performance of a classification model
● ROC depicts the relationship between False Positive Rate (FPR) (FP / (TN + FP)) and True Positive Rate/Recall (TPR)

Comparing models using curve

请添加图片描述

0.5: Random, constant
< 0.5: Worse than random. It’s better to use random prediction.

AP v.s. AUROC

AP & AUROC are both ranking metrics
AP indicates whether your model can correctly identify all positive examples
without accidentally marking too many negative examples as positive
**AUROC measures whether the model is able to rank positive examples higher than negative samples (Another way to look at this. You sample two samples from dataset, your positive sample has higher probability than your negative sample. ) **
It is easier to know how well the model is performing than random using AUROC than AP
In case of imbalance datasets, AP is a better estimate indicative of model performance (AUC: TP / P, FP / N, so is P and N change a lot, AUC does not change much.)
For ranking metrics, the exact probability does not ally matter. As long as the ranking does not change, then ranking metrics does not change.

Multi-class Classification

Basically the same as binary-class Classification.
请添加图片描述

Choosing the Right Metric

Problem-specific
Balanced accuracy better than accuracy (most of the times)
Cost associated with misclassification
- Predicting that an individual has no cancer when he/she has cancer (false negative) is far more costlier than the other way round
- Predicting an email as spam when it is not (false positive) has higher cost than predicting email as not spam
Choose precision when cost of false positives is high (Type I error)
Choose recall when cost of false negatives is high (Type II error)

Evaluation Metrics for Regression

请添加图片描述
MSE: Outliers will have high value.

Calibration

Motivation

Many supervised learning algorithms predict probabilities (as a precursor to predicted labels)
In many classification tasks, the probability of belonging to a class is as important as the predicted label itself
However, we need these probabilities to be well calibrated
Calibrated probabilities means that the probability reflects the likelihood of
true events

Automatic Machine Learning

Alex Tech Bolg

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Review 4995 Applied Machine Learning

Table of contentsLecture 1Basic conceptExploratory Data Analysis & VisualizationLecture 2Supervised Learningk-nearest neighborsDevelopment-test splitRandom SplittingStratified SplittingStructured SplittingHyperparameter tuningModel selectionModel selec
复制链接

扫一扫