第四章内容
给无ml工作经验的小白的建议:
APPROXIMATE WORK EXPERIENCE WITH PERSONAL PROJECTS
If you don’t have ML work experience, you can still approximate the answer of someone who built a similar project at work by doing the following:
-
Try out a few simple non-ML baseline rules and heuristics and note their performance.
-
Think about potential ways an ML approach could work over a heuristic approach. Common real-world reasons include saving on manual work or making the work more generalizable since ML can take more features into account after model retrainin
Defining a Machine Learning Problem
Introduction to Feature Engineering
Handling missing data with imputation
There are common imputation techniques for handling missing data that you should be able to mention in an interview, along with their pros and cons. These include filling in with the mean or median value and using a tree-based model.
Technique | Pros | Cons |
---|---|---|
Mean/median/mode | Simple to implement | Might not account for outliers compared to tree-based methods Not as suitable for categorical variables |
Tree-based models | Can capture more underlying patterns Suitable for both numerical and categorical variables | Adds a level of complexity during data preprocessing Model needs to be retrained if the underlying distribution of data changes |
However, if you use the mean of all the available data before splitting it into training, validation, and test sets, then it inevitably captures traits of the test set. Hence, the ML model will be trained on imputed data that contains latent information about the test set, which sometimes causes the accuracy to increase for no reason other than the way the data was imputed.
This is called data leakage. If you want to use imputation, then be sure to split the training, validation, and test sets first, and impute missing values in the training set with the summary statistics of the training set only. If you don’t mention this in the interview or explain it correctly, that’s a pretty obvious oversight in your ML model, unless you can defend your reasoning.
Standardizing data
After you handle missing and duplicate data, the data should be standardized. This includes handling outliers, scaling features, and ensuring that data types and formats are consistent:
Handle outliers
Techniques for handling outliers include removing extreme outliers from the dataset, replacing them with less extreme values (known as winsorizing) and logarithmic scale transforms. I’d caution against removing outliers since doing so really depends on domain knowledge; in some domains, there are more severe consequences; for example, removing horse-carriage image data from a self-driving car training dataset just because they aren’t a common type of vehicle might cause the model to not recognize horse-carriages in the real world. Hence, carefully evaluate the impacts before deciding on a particular technique.
Scale features
For datasets with multiple features with numerical values, larger values might be misconstrued by ML algorithms to have more impact. For example, one column is price, which ranges from $50 to $5,000, while another feature is the amount of time an ad shows up, which ranges from 0 to 10 times. The two features are in different units, but both are numerical, so it is possible that the price column will be represented as having a higher magnitude of impact. Some models, such as gradient-descent-based models, are more sensitive to the scale of features. Hence, it’s better to scale the features so that they range from [-1, 1] or [0, 1].
Data type consistency
I was once working on an ML model and got results that weren’t what I expected, and it took me a while to debug. Finally, I identified the issue: a numerical column was formatted as a string! Surveying your final data types to ensure that they will make sense once fed into your ML model will be useful before you go through the rest of the process; consider it a part of quality assurance (QA).
Data preprocessing
Preprocessing data will allow for features to make sense to the ML model in the context of the type of algorithm you’re using. Preprocessing for structured data can include one-hot encoding, label encoding, binning, feature selection, and so on.
One-hot encoding of categorical data
You may want to represent categorical data as numerical data. Each category becomes a feature, with 0 or 1 representing the state of that feature in each observation. For example, imagine a simple weather dataset where it’s only possible to have sunny or cloudy weather
March 1 | |
|
|
March 2 | |
|
|
March 3 | |
|
|
One-hot encoding is often used because numbers are easier for ML algorithms to understand; some algorithms don’t take in categorical values, but this has improved over the years, where some implementations can take categorical values into account and transform them behind the scenes.
One downside of one-hot encoding is that for features originally with high cardinality (there are lots of unique values in that feature), one-hot encoding can cause the feature count to increase drastically, which can be computationally more expensive.
Sample Interview Questions on Data Preprocessing and Feature Engineering
Interview question 4-1: What’s the difference between feature engineering and feature selection?
Example answer
Feature engineering is about creating or transforming features from raw data. This is done to better represent the data and make the data more suitable for ML compared to its raw format. Common techniques include handling missing data, standardizing data formats, and so on.
Feature selection is about narrowing down relevant ML features to simplify the model and prevent overfitting. Common techniques include PCA (principal component analysis) or using tree-based models’ feature importance to see which features contribute more useful signals.
Interview question 4-2: How do you prevent data leakage issues while conducting data preprocessing?
Example answer
Being cautious with training, validation, and test data splits is one of the most common ways to prevent data leakage. However, things aren’t always so simple. For example, in the case when data imputation is done with the mean value of all observations in the feature, that means the mean value contains information about all observations, not just the training split. In that case, make sure to conduct data imputation with only information about the training split, on the training split. Other examples of data leakage could include time-series splits; we should be careful that we don’t accidentally shuffle and split the time series incorrectly (e.g., using tomorrow to predict today instead of the other way around).
Interview question 4-3: How do you handle a skewed data distribution during feature engineering, assuming that the minority data class is required for the machine learning problem?
Example answer
Sampling techniques, such as oversampling the minority data classes, could help during preprocessing and feature engineering (for example, using techniques like SMOTE). It’s important to note that for oversampling, any duplicate or synthetic instances should be generated only from the training data to avoid data leakage with the validation or test set.
独热编码的一个缺点是,对于原本具有高基数(即该特征中有许多唯一值)的特征,独热编码会导致特征数量大幅增加,这可能在计算上更加昂贵。
面试思考
Interviewers will want to make sure of the following:
-
You are knowledgeable about common ML tasks in their field.
-
You are knowledgeable about common algorithms related to said tasks.
-
You know how to evaluate those models.
Defining the ML Task
Overview of Model Selection
Classification
Algorithms include decision trees, random forest, and the like. Example Python libraries to start with include scikit-learn, CatBoost, and LightGBM.
Regression
Algorithms include logistic regression, decision trees, and the like. Example Python libraries to start with are scikit-learn and statsmodels.
Clustering (unsupervised learning)
Algorithms include k-means clustering, DBSCAN, and the like. An example Python library to start with is scikit-learn.
Time-series prediction
Algorithms include ARIMA, LSTM, and the like. Example Python libraries to start with include statsmodels, Prophet, Keras/TensorFlow, and so on.
Recommender systems
Algorithms include matrix factorization techniques such as collaborative filtering. Example libraries and tools to start with include Spark’s MLlib or Amazon Personalize on AWS.
Reinforcement learning
Algorithms include multiarmed bandit, Q-learning, and policy gradient. Example libraries to start with include Vowpal Wabbit, TorchRL (PyTorch), and TensorFlow-RL.
Computer vision
Deep learning techniques are common starting points for computer vision tasks. OpenCV is an important computer vision library that also supports some ML models. Popular deep learning frameworks include TensorFlow, Keras, PyTorch, and Caffe.
Natural language processing
All the deep learning frameworks mentioned before can also be used for NLP. In addition, it’s common to try out transformer-based methods or find something on Hugging Face. Nowadays, using the OpenAI API and GPT models is also common. LangChain is a fast-growing library for NLP workflows. There is also Google’s recently launched Bard.
Overview of Model Training
Hyperparameter tuning
Hyperparameter tuning is where you select the optimal hyperparameters for the model via manual tweaks, grid search, or even AutoML. Hyperparameters include traits or architecture of the model itself, such as learning rate, batch size, the number of hidden layers in a neural network, and so on. Each specific model might have its own parameters, such as changepoint and seasonality prior scale in Prophet. The goal of hyperparameter tuning is, for example, to see if the learning rate is higher or if the model will converge faster and perform better.
It is important to have a good system to keep track of hyperparameter-tuning experiments so that the experiments can be reproducible. Imagine the pain if you saw a model run that yielded great results, but because the edits were made directly to the script, you lost the exact changes and weren’t able to reproduce the good results! Tracking will be discussed more in “Experiment tracking”.
ML loss functions
Loss functions in ML measure the difference between the model’s predicted outputs and the ground truth. A goal of the model is to minimize the loss function since by doing so, the model is making the most accurate predictions based on your definition of accuracy in the model. Examples of ML loss functions include mean squared error (MSE) and mean absolute error (MAE).
ML optimizers
Optimizers are how the ML model’s parameters are adjusted to minimize the loss function. Sometimes, there are options to change the optimizer; for example, PyTorch has 13 common optimizers to select from. Adam and Adagrad are popular optimizers, and it’s likely the model’s hyperparameters themselves are tuned to improve performance. This could be an additional lever to pull, depending on the structure of your model and any hypothesized reasons why your current optimizer isn’t working out.
Sample Interview Questions on Model Selection and Training
Interview question 4-4: In what scenario would you use a reinforcement learning algorithm rather than, say, a tree-based method?
Example answer
RL algorithms are useful when it’s important to learn from trial and error and the sequence of actions is important. RL is also useful when the outcome can be delayed but we want the RL agent to be continuously improving. Examples include game playing, robotics, recommender systems, and so on.
In contrast, tree-based methods, such as decision trees or random forests, are useful when the problem is static and nonsequential. In other words, it’s not as useful to account for delayed rewards or sequential decision making, and a static dataset (at the time of training) is sufficient.
Interview question 4-5: What are some common mistakes made during model training, and how would you avoid them?
Example answer
Overfitting is a common problem, when the resulting model captures overly complex information in the training data and doesn’t generalize well to new observations. Regularization techniques6 can be used to prevent overfitting.
Not tuning common hyperparameters could cause models to not perform well since the default hyperparameters might (often) not work directly out of the box to be the best solution.
Overengineering the problem could also cause issues during model training; sometimes it’s best to try out a simple baseline model before jumping right into very complex models or combinations of models.
Interview question 4-6: In what scenario might ensemble models be useful?
Example answer
When working with imbalanced datasets, where one class significantly outnumbers the others, ensemble methods can help improve the accuracy of results on minority data classes. By using ensemble models and combining multiple models, we can avoid and reduce model bias toward the majority data class.
Summary of Common ML Evaluation Metrics
Classification metrics
Classification metrics are used to measure the performance of classification models. As a shorthand, note that TP = true positive, TN = true negative, FP = false positive, and FN = false negative, as illustrated in Figure 4-5. Here are some other terms and values to know:
-
Precision = TP / (TP + FP) (as illustrated in Figure 4-6)
-
Recall = TP / (TP + FN) (as illustrated in Figure 4-6)
-
Accuracy = (TP + TN) / (TP + TN + FP + FN)
With these terms, we can then construct various evaluations:
Confusion matrix
A summary of the TP/TN/FP/FN values in matrix form (as illustrated in Figure 4-7).
F1 score
Harmonic mean of precision and recall.
AUC (area under the ROC curve) and ROC (receiver operating characteristic)
The curve plots the true positive rate against the false positive rate at various thresholds.
Regression metrics
Regression metrics are used to measure the performance of regression models. Here are some terms and values to know:
-
MAE: mean absolute error (
)
-
MSE: mean squared error
-
RMSE: root mean squared error
-
R2: R-squared
Clustering metrics
Clustering metrics are used to measure the performance of clustering models. Using clustering metrics may depend on whether you have ground truth labels or not. Here I assume you do not, but if you do, then classification metrics can also be used. Here is a list of terms to be aware of:
Silhouette coefficient
Measures the cohesion of an item to other items in its cluster and separation with items in other clusters; ranges from -1 to 1
Calinski-Harabasz Index
A score meant to determine the quality of clusters; when the score is higher, it means clusters are dense and well separated
Ranking metrics
Ranking metrics are used for recommender or ranking systems. Here are some terms to be aware of:
Mean reciprocal rank (MRR)
Measures the accuracy of a ranking system by how high or low the first relevant document appears
Precision at K
Calculates the proportion of recommended items at the top that are relevant
Normalized discounted cumulative gain (NDCG)
Compares the importance/rank that the ML model predicted to the actual relevance
跳过了其他offline evaluation的部分, 觉得不是最重要的。
Sample Interview Questions on Model Evaluation
Interview question 4-7: What is the ROC metric, and when is it useful?
Example answer
ROC曲线是一种用于评估二元分类模型性能的工具。它通过在不同的阈值下绘制真正例率(True Positive Rate, TPR)和假正例率(False Positive Rate, FPR)来展示模型的性能。
The ROC (receiver operating characteristic) curve can be used to evaluate a binary classification model. The curve plots the true positive rate against the false positive rate at various thresholds—the threshold being the probability between 0 and 1, above which the prediction is considered to be that class. For example, if the threshold is set to 0.6, then the probability predictions of the model that are above 0.6 probability of being class 1 will be labeled as class 1.
Using ROC can help us determine the trade-off in the true positive rate and the false positive rate at various thresholds, and we can then decide what is the optimal threshold to use.
ROC曲线有助于我们理解在不同阈值下真正例率和假正例率之间的权衡。通过分析这个曲线,我们可以决定使用哪个阈值是最优的。一个理想的分类模型会尽量提高真正例率,同时降低假正例率,这意味着ROC曲线会尽可能地靠近左上角。
ROC曲线特别有用于比较不同模型的性能,或在单个模型中选择最佳的阈值。当我们关心模型在区分两个类别(如“正常”和“异常”)时的能力,特别是在两类的成本或严重性不同的情况下,ROC曲线是一个非常有用的工具。通过ROC曲线,我们可以在保持高真正例率的同时,尽量减少假正例率,以达到最佳的分类效果。
Interview question 4-8: What is the difference between precision and recall; when would you use one over the other in a classification task?
Example answer
Precision measures the accuracy of the model at making correct predictions (quality), and recall measures the model’s accuracy in terms of how many relevant items are predicted correctly (quantity). Mathematically, precision is TP / (TP + FP) while recall is TP / (TP + FN).
当减少假正例(FP)并保持它们在较低水平更为关键时,准确率可能比召回率更重要。例如,在恶意软件检测或电子邮件垃圾邮件检测中,过多的假正例可能导致用户不信任。在电子邮件垃圾邮件检测中,假正例可能导致合法的商业电子邮件被错误地移至垃圾邮件文件夹,从而造成延误和业务损失。
Precision can be more important than recall when it is more critical to reduce FPs and keep them low. One example is malware detection or email spam detection, where too many false positives can lead to user distrust. FPs in email spam detection can move legitimate business emails to the spam folder, causing delays and loss of business.
On the other hand, recall can be more important than precision in high-stakes predictions such as medical diagnostics. Increased recall means that there are fewer false negatives, even if that potentially causes some accidental FPs. In this situation, it’s a higher priority to minimize the chances of missing true cases.
另一方面,当进行高风险的预测时,如医学诊断,召回率可能比准确率更重要。提高召回率意味着减少假负例(FN),即使这可能导致一些意外的假正例。在这种情况下,最高优先级是最小化漏检真实案例的机会