DS Wannabe Prep学习笔记: Machine Learning Algo 1

本文介绍了模型欠拟合和过拟合的概念,强调了正则化技术在防止过拟合中的作用,包括L1、L2正则化以及ElasticNet的区别。还讨论了数据增强、过采样和欠采样等处理不平衡数据的方法,以及监督学习、无监督学习和强化学习的不同应用场景。
摘要由CSDN通过智能技术生成

先复习一下基础

Defining Model Underfitting and Overfitting

TypeDefinitionHow to reduce
Underfittingthe model isn’t able to capture the relationship between the dataset’s independent variables (e.g., weight, height, etc.) and the dependent variables (e.g., price). 

1. adding more variables or model features to help the model learn more patterns from the trainning data and reduce underfitting

2. to increase the no. of iterations the model trains for b4 training is stopped

Overfittingwhen a model fits the training data too closely and very specifically finding patterns that happen to be in the traning set but not elsewhere. REGULARIZATION

Regularization

Regularization in machine learning is a technique used to prevent a model from overfitting. Overfitting occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the model becomes too complex, capturing patterns that may not be present in the test data or in new data it encounters after deployment.

Here are the key points about regularization:

  1. Purpose: Regularization techniques are used to simplify models without substantially decreasing their accuracy. They do this by adding some form of penalty or constraint to the model optimization process.

  2. Types of Regularization:

    • L1 Regularization (Lasso):  Adds a penalty equivalent to the absolute value of the magnitude of coefficients. This can lead to some coefficients being zero, which is useful for feature selection.
    • L2 Regularization (Ridge): Adds a penalty equivalent to the square of the magnitude of coefficients. This doesn't reduce coefficients to zero but makes them smaller, leading to a less complex model.
    • Elastic Net: Combines L1 and L2 regularization and can be used to balance between feature selection (L1) and feature shrinkage (L2).
  3. Effect on Model Complexity: Regularization typically leads to a decrease in model complexity, which can reduce overfitting. This is done by penalizing the weights of the model, thereby discouraging overly complex models that fit the noise in the training data.

  4. Choosing the Regularization Term: The strength of the regularization is controlled by a hyperparameter, often denoted as lambda (λ) or alpha. The higher the value of this hyperparameter, the stronger the regularization effect. Selecting the right value is critical and is usually done using cross-validation.

  5. Bias-Variance Tradeoff: Regularization is a key technique in managing the bias-variance tradeoff in machine learning. By adding regularization, we increase the bias but decrease the variance, hopefully leading to a better overall model performance on unseen data.

  6. Application in Different Algorithms: While regularization is most commonly talked about in the context of linear models (like linear regression and logistic regression), it's also applicable to other algorithms, including neural networks, where techniques like dropout and weight decay are forms of regularization.

Interview question 3-1: What is L1 versus L2 regularization?

Example answer

L1 regularization, also known as lasso regularization, is a type of regularization that shrinks model parameters toward zero. L2 regularization (also known as ridge regularization) adds a penalty term to the objective function that is proportional to the square of the coefficients of the model. This penalty term shrinks the coefficients toward zero, but unlike L1 (lasso) regularization, it does not make any of the coefficients exactly equal to zero.

L2 regularization can help reduce overfitting and improve the stability of the model by keeping coefficients from becoming too large. Both L1 and L2 regularization are commonly used to prevent overfitting and improve the generalization of ML models.

Interview question 3-2: How do you deal with the challenges that come with an imbalanced dataset?

Example answer

Imbalanced datasets in ML refer to datasets in which some classes or categories outweigh others.9 Techniques to deal with imbalanced datasets include data augmentation, oversampling, undersampling, ensemble methods, and so on:

Data augmentation

Data augmentation involves generating more examples for the ML model to train on, such as rotating images so that the dataset includes images of humans turned upside down as well as the normal upright image orientation. Without data augmentation, the model might not be able to correctly recognize images of humans who are laying sideways or doing headstands since the data is imbalanced toward humans in an upright pose. 

数据增强是机器学习中用于增加训练数据集的多样性的一种技术。其核心思想是通过对现有数据进行一系列变换来创造新的、合成的训练样本。这对于改善模型的泛化能力,即在新的、未见过的数据上的表现,非常重要。

Oversampling

Oversampling is a technique to increase the number of data points of a minority class via synthetic generation. As an example, SMOTE (synthetic minority oversampling technique)10 uses the feature vectors of the minority classes to generate synthetic data points that are located between real data points and their k-nearest neighbors. This could synthetically increase the size of the minority class(es) and improve the performance of the ML model trained on a dataset with oversampling treatment.

Undersampling

Undersampling does the opposite: it reduces examples from the majority class to balance the number of data points of the majority class and minority class(es). Oversampling is generally preferred in practice since undersampling may cause useful data to be discarded, which is exacerbated when the dataset is already small.

Ensemble methods

Ensemble methods can also be used to increase model performance when dealing with an imbalanced dataset.11 Each model in the ensemble can be trained on a different subset of the data and can help learn the nuances of each class better.

Interview question 3-3: Explain boosting and bagging and what they can help with.

Example answer

Bagging and boosting are ensemble techniques used to improve the performance of ML models:

Bagging

Bagging trains multiple models on different subsets of the training data and combines their predictions to make a final prediction.

Boosting

Boosting trains a series of models where each model tries to correct the mistakes made by the previous model. The final prediction is made by all the models. Ensemble techniques can help with a variety of issues encountered during ML training. For example, they can help with imbalanced data and reduce overfitting.

Supervised, Unsupervised, and Reinforcement Learning

Defining Labeled Data

An example of unlabeled data is when you have the prices and weights of the apples but not the apple variants, yet you try to deduce commonalities within different variants of apples. Because you don’t initially have the correct or expected “label”—in this case, the apple variant—you would be using unlabeled data and conducting unsupervised learning.

Summarizing Supervised Learning 

Supervised learning is the first type of machine learning as defined by its use of labeled data, illustrated in the figure below. Supervised learning uses correct or expected outcomes of the past to predict the dependent variables for new or future data points. 

​Defining Unsupervised Learning

Unsupervised learning is training a model with unlabeled data: when you do not have the “labels” available (the labels being the correct or expected values that you are looking for).

- likely use unsupervised learning to find patterns, commonalities, or anomalies in the dataset without prior knowledge in the ML model of correct or expected result labels.

Common usage of unsupervised learning includes clustering and dimensionality reduction (see fig 3.6).

Summarizing Semisupervised and Self-Supervised Learning

Semisupervised learning uses a small amount of labeled data (usually manually labeled) to train a separate ML model specifically meant to help with machine labeling previously unlabeled data. The initial labeled dataset is then combined with the machine-generated labels with highest confidence to create a larger labeled dataset, as illustrated in fig 3.8.

Summarizing Reinforcement Learning

RL learns through trial and error.

RL is commonly used in gaming, robotics, and self-driving cars, but RL can also be used for a growing number of applications that previously used supervised learning, such as a system that recommends videos on YouTube.

Sample Interview Questions on Supervised and Unsupervised Learning

Interview question 3-4: What are common algorithms in supervised learning?

Example answer

The regression family of algorithms includes linear regression and logistic regression, among other algorithms such as generalized linear models (GLMs) and various time-series regression models such as autoregressive integrated moving average (ARIMA).

回归算法家族:这类算法主要用于预测连续的输出值。典型的回归算法包括线性回归和逻辑回归。线性回归用于预测一个或多个自变量(输入特征)与因变量(输出)之间的关系。逻辑回归虽然名为“回归”,实际上常用于分类问题,特别是二分类问题。此外,还有广义线性模型(GLMs)和各种时间序列回归模型,如自回归积分滑动平均(ARIMA)。

The decision tree family of algorithms can be used for both classification and regression tasks within supervised learning; these include XGBoost, LightGBM, CatBoost, and so on. Decision trees can be combined in random forest algorithms, which ensemble (combine) a multitude of decision trees. Like decision trees, random forests can be used for both classification and regression tasks under supervised learning.

决策树算法家族:决策树可以用于分类和回归任务。这类算法包括XGBoost、LightGBM和CatBoost等。决策树可以组合成随机森林算法,随机森林是通过集成多个决策树来提高预测准确性和稳定性的方法。像单独的决策树一样,随机森林也可以用于监督学习中的分类和回归任务。

Neural networks can be used for supervised learning tasks as well as unsupervised learning. In terms of supervised learning, these include many tasks in this section, such as image classification, object detection, speech recognition, and natural language processing (NLP).

神经网络:神经网络不仅可以用于监督学习任务,还可以用于非监督学习。在监督学习中,神经网络被广泛应用于图像分类、对象检测、语音识别和自然语言处理(NLP)等任务。

Other algorithms include naive Bayes,which is a supervised classification algorithm that uses Bayes’ theorem. Applications of Bayes’ theorem in ML include Bayesian neural networks, which predict a distribution of results (for example, a normal model might predict the price is $100, but the Bayesian model will predict the price is $100 with a standard deviation of 5).

其他算法:包括朴素贝叶斯,这是一种基于贝叶斯定理的监督分类算法。贝叶斯定理在机器学习中的应用包括贝叶斯神经网络,这类模型预测结果的分布(例如,普通模型可能预测价格是100美元,而贝叶斯模型会预测价格是100美元,标准差为5)。

Interview question 3-5: What are some common algorithms used in unsupervised learning? How do they work?

Example answer

Unsupervised learning is commonly used for clustering, anomaly detection, and dimensionality reduction. I’ll group the algorithms by those categories. 

聚类(Clustering):常用的算法包括k-means聚类和基于密度的聚类(如DBSCAN算法)。

  • k-means聚类:这种算法将数据分为k个群集。算法迭代地将每个数据点标记为最接近的群集中心(质心),然后更新群集中心。这个过程持续进行,直到群集分配达到稳定状态,不再发生变化。
  • DBSCAN:这是一种流行的算法,它将彼此接近(高密度)的数据点分为一组,并根据距离将不同的群集分开。由于非监督学习算法可以处理大规模的类别不平衡问题,因此DBSCAN等算法常用于异常检测。

Clustering is often done with algorithms such as k-means clustering and density-based clustering (DBSCAN algorithm). 

K-means clustering groups the data into k clusters, and the algorithm iteratively labels each data point with the cluster’s centroid. The cluster centroid is then updated, and the algorithm continues until the cluster assignments have reached a stable state and no longer shift or change. 

DBSCAN is a popular algorithm that groups together data points that are close to one another (high density) and also separates those clusters from one another depending on their distance. Because unsupervised learning algorithms can handle large class imbalances, there are common unsupervised learning algorithms that address anomaly detections.

降维(Dimensionality Reduction):用于数据降维的算法有很多。

  • 主成分分析(PCA):PCA可以将数据集“压缩”到低维空间中。这在数据预处理中很有用,因为它可以减少使用的冗余特征数量,同时保留数据中的方差,以便保留足够的信号和模式。
  • 自编码器(Autoencoders):自编码器是一种广泛应用的非监督学习类型,特别是在自然语言处理(NLP)领域,但不仅限于此。它们可以用于编码输入文本的压缩表示,这也是一种降维形式,然后解码压缩表示以生成下一段文本数据。这在文本完成和文本摘要任务中很有用。自编码器也可用于自监督学习,例如用于填补图像的缺失部分或修复音频和视频

There are many algorithms that can be used for dimensionality reduction. Principal component analysis (PCA) can “flatten” datasets into a lower-dimensional space. This is useful for data preprocessing since it can reduce the number of redundant features that are used while keeping the variance in the data so that enough signals and patterns are preserved in the data.

Autoencoders are a type of unsupervised learning with a broad range of applications, notably in NLP—but not limited to NLP. They can be used to encode a compressed representation of input text, which is also a form of dimensionality reduction, and then decode the compressed representation for the generation of the next chunk of text data. This is useful for text completion and text summarization tasks. As a subset of unsupervised learning, self-supervised learning is also a case where autoencoders can be used. Examples include self-supervised learning to fill in missing parts of images or fix audio and video

Interview question 3-6: What are the differences between supervised and unsupervised learning?

Example answer

The major difference between the two types of machine learning is related to the training data that is used.

Supervised learning uses labeled data while unsupervised learning uses unlabeled data. 

Labeled data refers to the correct output or result from the ML model already being inside the training dataset.

Supervised and unsupervised learning also differ in terms of the ML model outputs.

In supervised learning, the ML model aims to predict what the label would be.

Unsupervised learning doesn’t predict specific label(s) but rather tries to find latent patterns and groupings within the dataset, which can be used to cluster new data points.

In terms of evaluation, the two types of ML are assessed differently.

Supervised learning is evaluated by comparing its outputs with the correct output (with the test/holdout/validation datasets). 

In unsupervised learning, the model is evaluated based on how well it groups or captures patterns within the data, via metrics such as the Jaccard score or silhouette index for clustering and receiver operating characteristic curves (ROC)/area under the curve (AUC) metrics for positive rate comparisons for anomaly detection.

Finally, supervised learning and unsupervised learning are generally used for different types of tasks. Supervised learning is often used for classification (predicting the correct category) or regression (predicting the correct value) tasks while unsupervised learning is often used for clustering, anomaly detection, and dimensionality reduction tasks.

Interview question 3-7: What are scenarios where you would use supervised learning but not unsupervised learning, and vice versa? Please illustrate with some real-world examples.

Example answer

Unsupervised learning and supervised learning differ in the usage of results or labels. Hence, unsupervised learning is most suitable for cases where labeled data is not available or when the task isn’t to predict a “correct” output, but rather to find patterns or anomalies in the data.

As a real-world example, supervised learning can be used for classification and object detection, such as in image recognition tasks. In the training dataset I’ll have the correct objects labeled, and the algorithms will then know if they’re learning to detect objects correctly based on comparing their predictions with the ground truth.

In other words, if the algorithm isn’t correctly boxing faces in images, I’d know since I’ll have each image (with the faces correctly boxed) to compare to. Other scenarios for supervised learning could include predicting the price of a rare trading card based on its features, such as its age, its series name, and the condition of the card. Given that I have a dataset with fraudulent data already correctly labeled, fraud detection could also be an application of supervised learning. If I didn’t already have labeled data about fraudulent behavior, I might opt to use unsupervised learning instead, via detecting anomalous behaviors.

【思考:if unsupervised learning is more suitable tha supervised learning for general warning signals such as abnormal bank transaction - t&s产品会用哪种呢?】

个人猜测:Clustering is an unsupervised learning task, and a real-world application could be to group customers into segments based on their features (e.g., behavior, preferences), something that businesses might use to identify how they can tailor products to users in a cluster or to target marketing campaigns. If I investigate a cluster and it shows that young professionals have similar behaviors via the clustering algorithm, then we might know that we can give them similar promotional materials in the company’s next digital ad campaign.

Interview question 3-8: What is a common issue that you might run into while implementing supervised learning, and how would you address it?

Example answer

One common problem that can affect supervised learning is the lack of labeled data. For example, when I want to classify specific cartoon and anime characters in images with ML, I don’t have labeled data available on the internet to download and use. There are open source datasets such as CIFAR, which are labeled for general objects and items, but when it comes to more specific use cases, I would have to acquire and label images myself (for personal usage).

I had to address the issue of not having enough labeled data; in this case, hand labeling a few examples worked as a starting point. However, there still weren’t enough labeled examples, which resulted in an imbalanced dataset. To artificially increase the amount of labeled data, I used data augmentation, creating synthetic data and variations on existing data to make the ML model more robust. An example of data augmentation in image recognition is to randomly flip or rotate images. To illustrate why this can increase samples, if I flip one upright anime character looking to the right, it becomes two data points for the model to learn from: one looking right and one looking left. Rotation can also help: can the ML algorithm correctly identify anime characters who are leaning sideways or who are even upside down, doing a headstand?

  • 20
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值