Review 4995 Applied Machine Learning

Table of contents

Lecture 1

Basic concept

  • Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data.
  • Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so.

  • Until 1990’s…
    Machine Learning: Supervised Learning, UnSupervised Learning, Other (Reinforcement Learning, Active Learning, Self-supervised Learning Transfer Learning…)

  • Recently (Because of the development of Deep Learning)
    Machine Learning: Supervised Learning, UnSupervised Learning, Reinforcement Learning, Other (Active Learning, Self-supervised Learning Transfer Learning…)


  • Supervised learning algorithms learn a function that maps inputs to an output from a set of labeled training data.
  • Unsupervised learning algorithms learn patterns from unlabeled data samples.
  • Deep learning is a class of ML algorithms that uses multiple layers to progressively extract higher-level features/abstractions from raw inputs.

  • Model complexity: Computational power, big data, breakthrough in deep learning

Exploratory Data Analysis & Visualization

Exploratory Data Analysis (EDA) is an approach of analyzing datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

  • Data types:
    ● Quantitative/numerical continuous - 1, 3.5, 100, 10^10, 3.14
    ● Quantitative/numerical discrete - 1, 2, 3, 4
    ● Qualitative/categorical unordered - cat, dog, whale
    ● Qualitative/categorical ordered - good, better, best
    ● Date or time - 09/15/2021, Jan 8th 2020 15:00:00
    ● Text - The quick brown fox jumps over the lazy dog

  • Data Visualization

  • Ugly, bad, wrong

  • A typical data visualization chart
    ● Two position scales:
    ○ displacement (x-axis)
    ○ fuel efficiency (y-axis)
    ● One color scale: ○ power
    ● One shape scale: ○ cylinders
    ● One size scale: ○ weight

Position scale
Nonlinear axes ( Logarithmic scale)

Visualization Collections
Typically, we would like to visualize the following kinds of data:
○ Amounts
○ Distributions
○ Proportions
○ X-Y relationships
○ Uncertainty

Lecture 2

Supervised Learning

  • Supervised learning algorithms learn a function that maps inputs to an output from a set of labeled training data.
  • Supervised Learning goal: Deployed ML model generalizes well on unseen data

  • Supervised learning framework
    请添加图片描述
    ● Development-test split
    ● Hyperparameter tuning
    ● Optimal model training
    ● Model evaluation
    ● Model deployment

k-nearest neighbors

  • A simple non-parametric supervised learning method
  • Assigns the value of the nearest neighbor(s) to the unseen data point
  • Prediction is computationally expensive, while training is trivial
  • Generally performs poorly at high dimensions

Development-test split

  • Typically the dataset is split into development dataset and test dataset in the ratio of 4:1 (also called 80/20 split) or 3:1.
  • The purpose of test dataset is to evaluate the performance of the final optimal model
  • Model evaluation is supposed to give a pulse on how the model would perform in the wild.
  • Splitting strategies:
    • Random splitting
    • Stratified splitting
    • Structured splitting
Random Splitting
  • The dataset is split at random to create development & test datasets
  • The size of test dataset is generally determined by the total number of samples.
Stratified Splitting
  • The stratified splitting ensures that the ratio of indices (classes) in development and test datasets equals that of the original dataset.
  • Generally employed when performing classification tasks on highly imbalanced datasets.
Structured Splitting
  • The structured splitting is generally employed to prevent data leakage.
  • Examples:
    • Stock price predictions
    • Time-series predictions

Hyperparameter tuning

  • Model hyperparameters control the model complexity
  • Model complexity v.s. Model performance
  • Bias-Variance Tradeoff

请添加图片描述

请添加图片描述

  • Model hyperparameters v.s. Model parameters
    • Model hyperparameter (tuned during the training process)
    • Model parameter (learned during the training process)
  • Hyperparameter search
    • Grid search
    • Random search
    • Bayesian optimization
    • Evolutionary optimization

Grid search v.s. Random search

  • Grid search and Random search are both considered as uninformed search strategies
  • Possible to combine both strategies:
    • Search a larger space using random search
    • Find promising areas
    • Perform grid search in the smaller area
    • Continue until optimal score is
      obtained

Bayesian optimization

  • Bayesian optimization works by constructing a probability distribution of possible functions (gaussian process) that best describe the function you want to optimize.
  • A utility function helps explore the parameter space by trading between exploration and exploitation.
  • The probability distribution of functions is updated (bayesian) based on observations so far.
  • Informed search

Model selection

  • Once a hyperparameter search strategy is chosen, we need a way to know how good a hyperparameter value is.
  • Theoretically, we could use the test dataset to evaluate the performance of the model trained using the hyperparameter value.
  • However, this could lead to overfitting and/or providing an overly optimistic model performance estimate.
  • We will need a validation dataset to estimate the effectiveness of a hyperparameter value, thereby eventually helping with the model selection.
Model selection strategies
  • Three-way holdout
    • Model selection corresponds to the hyperparameter with the best performance on validation data.
    • (Repeated) K-fold CV
    • Leave-one-out CV
      • The development data is split into k-folds randomly.
      • For every hyperparameter, model is trained using k-1 folds and evaluated on the kth-fold
      • The process is repeated for all k-folds and an average model performance is estimated.
      • Model selection corresponds to the hyperparameter with the best average performance.
      • In repeated k-fold CV, steps (1)-(3) are repeated for n-times and performance is estimated over k*n iterations.
      • In leave-one-out CV, we set k equal to the number of samples.
    • (Repeated) Stratified K-fold CV
      • Similar to (repeated) k-fold cv, with one difference that each fold (and test data) is now a stratified sample.
    • Random permutation CV
      • A user specified number of train/validation datasets are created.
      • For every pair, the development data is shuffled before creating the training & validation
        datasets
      • The remaining steps are similar to K-fold CV.
Guidelines on model selection strategies
  • Three-way holdout can give reasonable approximation of test performance on large balanced datasets
  • Leave-one-out CV generally has high variance and is applied on small datasets
  • K-fold CV works well in general, more stable and is used if model selection is not
    computationally expensive
  • Stratified sampling is used when working with highly imbalanced datasets
  • Repeated versions of K-fold CV and Stratified K-fold CV have lower variance, but are computationally expensive.

Model evaluation

Model complexity v.s. Model performance

Data preprocessing

Numerical data

  • Standard Scaler
  • Min-max Scaler
  • Max absolute Scaler
  • Robust Scaler (Outliers)
  • Normalizer (Normalization is the process of scaling individual samples to have unit norm.)
  • The scaler from training step should be used to transform the test data. ()

https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35

Handling missing data

  • It is important to know how missing values are encoded in dataset
  • Often missingness is informative (often captured by adding missing indicator
    columns)
  • Several ways to handle missing values:
    • Drop column (typically used as baseline)
    • Drop rows (if there are only a few with missing values)
    • Impute using mean or median (SimpleImputer in sklearn API)
    • kNN (neighbors are found using nan_euclidean_distance metric)
    • Regression models
    • Matrix factorization

Categorical data

  • Ordinal encoding
  • One-hot encoding
    • One-hot encoding introduces multi-collinearity
      • For e.g., x3 = 1 - x1 - x2 (in case when we have three categories)
      • Possible to remove one feature (In practice, we typically use all features and add regularization.)
      • Has implications on model interpretation
    • Could be problematic for some models
      • Non-regularized regression techniques
    • Some modeling techniques handle categorical features as-is
      • Tree-based models
      • Naive Bayes models
    • Leads to high-dimensional datasets
  • Target encoding
    • Generally applicable for high cardinality categorical features
    • The encoding is specific to the problem type.
    • Regression:
      • Average target value for each category
    • Binary classification:
      • Probability of being in class 1
    • Multiclass classification:
      • One feature per class that gives the probability distribution

Lecture 3

Linear models for regression

Linear regression

  • Assumptions:
    • Linearity
    • Independence
    • Homoscedasticity
    • Normality

请添加图片描述

Ridge regression

  • L2-norm

请添加图片描述

Lasso regression

  • L1-norm
  • Feature selection
    请添加图片描述

Elastic-net regression

请添加图片描述

Linear models for classification

Binary classification

请添加图片描述

Logistic Regression

请添加图片描述

请添加图片描述

Support Vector Machines (SVMs)
Kernel Function

请添加图片描述

Gaussian kernel: If we tune γ \gamma γ enough, we can perfectly separate dataset. The intuition behind Gaussian kernel is that Gaussian kernel will project data in indefinite dimensional space.

请添加图片描述

Multi-class classification

OVO, OVR

请添加图片描述
请添加图片描述

Multinomial Logistic Regression

请添加图片描述

Lecture 4

Decision Trees

  • Greedy algorithm (Building an optimal decision tree is NP hard.)
  • Applicable to both classification & regression problems
  • Easy to interpret & deploy
  • Non-linear decision boundary
  • Minimal preprocessing
  • Invariant to scale of data (Because it is not looking at the absolute value of features.)

Classification Trees - Measure of Impurity

请添加图片描述
请添加图片描述

Classification Trees - Information Gain

请添加图片描述

  • To understand what the notation means.

Regression Trees - Measure of Impurity

请添加图片描述

Regression Trees - Sum of Squared Error (SSE)

请添加图片描述

Classification v.s. Regression Trees

请添加图片描述

Node splitting

  • continuous features: An exhaustive search
    请添加图片描述

  • categorical features: target encoding

请添加图片描述

Overfitting and Preventing Overfitting

  • All leaf nodes have zero entropy (i.e. pure nodes)
  • If all nodes are pure, the accuracy of training data should be 100%.

Prevent Overfitting

请添加图片描述

  • Early stopping
    • Maximum depth
    • Maximum leaf nodes
    • Minimum samples split
    • Minimum impurity decrease

Ensemble methods

Motivation

  • The decision trees are highly unstable and can structurally change with slight variation in input data.
  • Decision trees perform poorly on continuous outcomes (regression) due to limited model capacity. (For example, this tree can only predict 4 values.)

请添加图片描述

Ensemble methods

  • Several weak/simple learners are combined to make the final prediction
  • Generally ensemble methods aim to reduce model variance.
  • Ensemble methods improve performance especially if the individual learners are not correlated.
  • Depending on training sample construction and output aggregation, there are two categories:
    • Bagging (Bootstrap aggregation)
    • Boosting

Bagging (Bootstrap aggregation)

  • Several training samples (of same size) are created by sampling the dataset with replacement
  • Each training sample is then used to train a model
  • The outputs from each of the models are averaged to make the final prediction.
Random Forests
  • Applicable to both classification and regression problems
  • Smarter bagging for trees
  • Motivated by theory that generalization improves with uncorrelated trees
  • Bootstrapped samples and random subset of features are used to train each tree
  • The outputs from each of the models are averaged to make the final prediction.

Random Forest Hyperparameter tuning

  • Random Forest hyperparameters:
    • of trees

    • of features

      • Classification - sqrt(# of features)
      • Regression - # of features
    • Decision Tree hyperparameters (splitting criteria, maximum depth, etc. )
    • Uses out-of-bag (OOB) error for model selection
    • OOB error is the average error of a data point calculated using predictions from the trees that do not contain it in their respective bootstrap sample

Random Forest Feature importances

  • Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node.
  • The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples.
  • The higher the value the more important the feature.

Boosting

● Applicable to both regression as well as classification problems
● Includes a family of ML algorithms that convert weak learners to strong ones.
● The weak learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors.
● When an input is misclassified by one tree, its output is adjusted so that next tree is more likely to learn it correctly.

Adaptive Boosting

● First ensemble boosting algorithm applied to classification tasks
● Initially, a decision stump classifier ( just splits the data into two regions) is fit to the data
The data points correctly classified are given less weightage while misclassified data points are given higher weightage in the next iteration
● A decision stump classifier is now fit to the data with weights determined in previous iteration
● Weights (𝝆t) for each classifier (estimated during the training process) are used to combine the outputs and make the final prediction.

Algorithm and Loss function

请添加图片描述
请添加图片描述

  • Exponential loss function
Adaboost - Tuning Parameters
  • Classification:
    • estimators

    • learning rate
    • base estimator (In early it is decision stump, but we can change it.)
  • Regression:
    • loss function
    • learning rate
    • of estimators

    • base estimator

Gradient Boosting

● Works for both classification and regression tasks
● Trains regression trees in a sequential manner on modified versions of the datasets.
● Every tree is trained on the residuals of the data points obtained by subtracting the predictions from the previous tree.
● Weights for each classifier (estimated during the training process) are used to combine the outputs and make the final prediction.

Algorithm and Loss function

请添加图片描述

请添加图片描述

Gradient Descent

请添加图片描述

Big learning rate: oscillate before reaching the minimum.

Tuning Parameters

● # of estimators
● Learning rate
● Decision tree parameters (max depth, min number of samples etc.)
● Regularization parameters
● Row sampling
● Column sampling

Gradient Boosting implementations
GradientBoostingClassifier

● Early implementation of Gradient Boosting in sklearn
● Typical slow on large datasets
● Most important parameters are # of estimators and learning rate
● Supports both binary & multi-class classification
● Supports sparse data

HistGradientBoostingClassifier

● Orders of magnitude faster than GradientBoostingClassifier on large datasets
● Inspired by LightGBM implementation
Histogram-based split finding in tree learning
● Does not support sparse data
● Supports both binary & multi-class classification
Natively supports categorical features
● Does not support monotonicity constraints

XGBoost

● One of most popular implementations of gradient boosting
● Fast approximate split finding based on histograms
● Supports GPU training, sparse data & missing values
● Adds l1 and l2 penalties on leaf weights
● Monotonicity & feature interaction constraints
● Works well with pipelines in sklearn due to a compatible interface
● Does not support categorical variables natively

LightGBM

● Supports GPU training, sparse data & missing values
Histogram-based node splitting
● Use Gradient-based One-Sided Sampling (GOSS) for tree learning
● Exclusive feature bundling to handle sparse features
● Generally faster than XGBoost on CPUs
● Supports distributed training on different frameworks like Ray, Spark, Dask etc.
● CLI version

CatBoost

Optimized for categorical features
● Uses target encoding to handle categorical features
● Uses ordered boosting to build “symmetric” trees
● Overfitting detector
● Tooling support (Jupyter notebook & Tensorboard visualization)
● Supports GPU training, sparse data & missing values
Monotonicity constraints

Lecture 5

Model Evaluation

● Evaluation metrics are generally used to measure the performance of an ML model
● Evaluation metrics indicate of how well the models would do when deployed
● The choice of metrics is very task-specific and determines what the model
learns
● It is important to know what you are willing to trade off when training ML
models for a task

Model Evaluation Metrics

Binary Classification
Accuracy

请添加图片描述

  • Limitation
    • Classification accuracy could be misleading in case of imbalance datasets
    • Accuracy Paradox (Higher accuracy does not necessarily mean better model.)
Confusion Matrix

请添加图片描述

Precision, Recall & F1-score

● Precision, Recall & F1-score are better metrics for imbalance datasets
● Precision is defined as the fraction of relevant instances among retrieved
instances
● Recall is defined as the fraction of relevant instances that were retrieved
● F1-score is the harmonic mean of precision & recall
请添加图片描述
Minority class is considered positive as best practice.

Averaging Metrics - Macro & Weighted

请添加图片描述

  • Weighted: The majority class will still be weighted very high. It probably not suitable for imbalanced dataset.
  • Macro: Balanced metrics.
Averaging Metrics - Balanced Accuracy

same as macro-averaged recall
Same as accuracy for balanced datasets
请添加图片描述

Precision-Recall (PR) Curve

请添加图片描述

Make threshold lower than 0.5. It means that for negative prediction we want to make sure it is absolutely negative. This will make false negative to be lower, so increase the recall.

● A precision-recall curve shows the relationship between precision and recall at every cut-off point.
● Visualize effect of selected threshold on performance.

Receiver Operating Curve (ROC)

● Another useful tool to visualize the performance of a classification model
● ROC depicts the relationship between False Positive Rate (FPR) (FP / (TN + FP)) and True Positive Rate/Recall (TPR)

Comparing models using curve

请添加图片描述
请添加图片描述

0.5: Random, constant
< 0.5: Worse than random. It’s better to use random prediction.

AP v.s. AUROC
  • AP & AUROC are both ranking metrics

  • AP indicates whether your model can correctly identify all positive examples
    without accidentally marking too many negative examples as positive

    **AUROC measures whether the model is able to rank positive examples higher than negative samples (Another way to look at this. You sample two samples from dataset, your positive sample has higher probability than your negative sample. ) **

  • It is easier to know how well the model is performing than random using AUROC than AP

  • In case of imbalance datasets, AP is a better estimate indicative of model performance (AUC: TP / P, FP / N, so is P and N change a lot, AUC does not change much.)

  • For ranking metrics, the exact probability does not ally matter. As long as the ranking does not change, then ranking metrics does not change.

Multi-class Classification

Basically the same as binary-class Classification.
请添加图片描述

Choosing the Right Metric
  • Problem-specific
  • Balanced accuracy better than accuracy (most of the times)
  • Cost associated with misclassification
    • Predicting that an individual has no cancer when he/she has cancer (false negative) is far more costlier than the other way round
    • Predicting an email as spam when it is not (false positive) has higher cost than predicting email as not spam
  • Choose precision when cost of false positives is high (Type I error)
  • Choose recall when cost of false negatives is high (Type II error)
Evaluation Metrics for Regression

请添加图片描述
MSE: Outliers will have high value.

Calibration

Motivation

  • Many supervised learning algorithms predict probabilities (as a precursor to predicted labels)
  • In many classification tasks, the probability of belonging to a class is as important as the predicted label itself
  • However, we need these probabilities to be well calibrated
  • Calibrated probabilities means that the probability reflects the likelihood of
    true events

Automatic Machine Learning

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值