

总览 (Overview)

This post will provide a technical guide on machine learning theory within data science interviews. It is by no means comprehensive but aims to highlight key technical points within each topic. The problems discussed are from this data science interview newsletter which features questions from top tech companies and will be involved in an upcoming book.

数学先决条件 (Mathematical Prerequisites)

随机变量 (Random Variables)

Random variables are a core topic within probability and statistics, and interviewers are generally looking for an understanding of the principles and basic ability to manipulate them.


For any given random variable X, it has the following properties (below we assume X is continuous, but the analogous holds for discrete random variables). The expectation (average value) is given by:

Image for post

and the variance is given by:


Image for post

For any given random variables X and Y, the covariance, a linear measure of relationship, is defined by:


Image for post

and normalization of covariance is the correlation between X and Y:


Image for post

概率分布 (Probability Distributions)

There are many probability distributions, and interviewers generally aren’t testing whether you’ve memorized specific properties on each (although it is helpful to know basics), but more so that you can apply them to specific situations properly. Because of this, the most commonly discussed one in Data Science Interviews is the Normal distribution, which has many real-life applications. For a single variable the probability density is given by the following, for a mean and variance parameter:

Image for post

For fitting parameters, there are two general methods. In maximum likelihood estimation (MLE) the goal is estimate the most likely parameters given a likelihood function:

Image for post

Since the values of X are assumed to be i.i.d, then the likelihood function becomes:


Image for post

It is convenient to take logs (since log is a monotonically increasing function, maximizing the log-likelihood is equivalent to maximizing the likelihood):


Image for post

Another way of fitting parameters is through maximum a posteriori estimation (MAP), which assumes a prior distribution.


Image for post

where the similar log-likelihood from before applies.


线性代数 (Linear Algebra)

Generally, interviewers won’t expect you to delve deeply into linear algebra unless there are specific machine learning emphases. However, it is still helpful to review basics since it helps with the understanding of various algorithms and theoretical underpinnings. There are many sub-topics within linear algebra, but one sub-topic worth discussing briefly is eigenvalues and eigenvectors. Mechanically, for some square matrix A, we have a vector x is an eigenvector of A if:

Image for post

Since a matrix is a linear transformation, eigenvectors are cases whereby the resulting transformation of the matrix on that vector results in the same direction as before, although with some scaling factor (the eigenvalues). There are many real-life use cases of eigenvalues and eigenvectors: for example, identifying the orientation of large datasets (discussed in PCA), or for dynamical systems (how a system oscillates and how quickly it will stabilize).

The decomposition of a square matrix into its eigenvectors is called an eigendecomposition. Note that while not all matrices are square, through Singular Value Decomposition (SVD), every matrix has a decomposition:

Image for post

Although the mathematical details are beyond the scope of this discussion, both eigendecomposition and SVD are worth looking into in detail before your technical interview.


偏差-偏差权衡 (Bias-Variance Tradeoff)

This is a topic occasionally asked in interviews due to relevance with overfitting and model selection. With any model, we generally are trying to estimate a true underlying:

Image for post

through data where w is usually noise that is zero-mean and a Gaussian random variable. As mentioned prior, MLE and MAP are reasonable ways to deduce parameters. To assess how well the model fits, we can decompose the error of y as the following:

  1. bias (how well the values come close to the true underlying f(x) values)

  2. variance (how much the prediction changes based on training inputs)

  3. irreducible error (due to inherently noisy observation processes


There is a trade-off between bias and variance, and this is a useful framework for thinking about how different models operate. The overall goal is to control overfitting (and not generalizing well out of sample) to produce stable and accurate models.

线性回归 (Linear Regression)

This method is one of the most frequently taught methods and has many real-life applications, ranging from predicting housing prices to studying the efficacy of medical trials. Interviewers asking this are generally trying to assess your understanding of the basic formulations and occasionally the relevance of knowing some of the theory to real life applications.

In linear regression, the goal is to estimate y = f(x) of the following form:

Image for post

where X is a matrix of data points and β the vector of weights. In the least-squares context, linear regression minimizes the residual sum-of-squares (RSS), which is given by:

Image for post

In regression, one can use MLE to estimate the β values by using a multivariate Gaussian:


Image for post

which leads to results that are the same as minimizing the RSS. For a MAP context, there can be priors for β, of which leads to Ridge Regression, which penalizes the weights to prevent overfitting. In Ridge regression, the objective function becomes minimizing:

Image for post

降维 (Dimensionality Reduction)

主成分分析 (Principal Components Analysis)

This topic is less common in interviews but is often alluded to during discussions about data-preprocessing or feature engineering. Decomposing data into a smaller set of variables is very useful for summarizing and visualizing data. This overall process is called dimensionality reduction. One common method of dimensionality reduction is Principal Components Analysis (PCA), which reconstructs data into a lower dimensional setting. It looks for a small number of linear combinations of a vector x (say it is p-dimensional) to explain the variance within x. More specifically, we want to find the vector w of weights such that we can define the following linear combination:

Image for post

subject to the following:


Image for post

Hence we have the following procedural description where first we find the first component with maximal variance, and then the second that is uncorrelated with the first, and continue this procedure iteratively. The idea is to end with say, k dimension such that

Image for post

Using some algebra, the final result is an eigendecomposition of the covariance matrix of X, whereby the first principal component is the eigenvector corresponding to the largest eigenvalue and so on.


分类 (Classification)

通用框架 (General Framework)

Classification is commonly asked during interviews since due to the the abundance of real-life applications. Tech companies love to ask about classifying customers and users into different segments.

The goal of classification is to assign a given data point to one of K classes, instead of a continuous value (as in regression), and there are two types of models. The first is generative which models the joint probability distribution between X and Y. That is, for an input X, we want to classify an arbitrary data point x with the following class label:

Image for post

This joint distribution between X and Y is given by:


Image for post

and for each given class k we have:


Image for post

The result of maximizing the posterior means there will be decision boundaries between classes where the resulting posterior probability is equal.


The second is discriminative, which directly learn a decision boundary by choosing a class that maximizes the posterior probability distribution:


Image for post

So both methods end up choosing a predicted class that maximize the posterior; the difference is just in the approach.

逻辑回归 (Logistic Regression)

One of the popular classification algorithms is logistic regression, and is often asked in conjunction with linear regression during interviews as a way to assess basic knowledge on classification algorithms. In logistic regression, we take a linear output and convert it to a probability between 0 and 1 using the sigmoid function:

Image for post

In matrix form, the decision looks like the following, where a 1 is the target class if the output is at least 0.5:


Image for post

The loss function for logistic regression is the log-loss:


Image for post

Note that the posterior is being modeled directly and hence logistic regression is a discriminative model.


线性判别分析 (Linear Discriminant Analysis)

Linear Discriminant Analysis (LDA) is not a commonly asked topic during interviews but serves as an interesting topic to know since it is a generative model rather than a discriminative model (which logistic regression was). It assumes that given some class k, the distribution of any data from that class follows a multivariate Gaussian:

Image for post

Recall from Bayes rule that maximizing the joint probability over labels is equivalent to maximizing the posterior probability, so LDA aims to maximize:


Image for post

Particularly, we have:


Image for post

where f(x) for each k is the class density function. LDA assumes that densities are multivariate Gaussian, and additionally assumes that the covariance matrix is common among all classes. The resulting decision boundary is linear (and hence the name), as there is also Quadratic Discriminant Analysis where the boundary is quadratic.

决策树 (Decision Trees)

Decision trees and random forests are commonly asked during interviews since they are flexible and often well-performing models in practice. In particular, it helps to have a basic understanding of how both are trained and used, as well as how the feature splits occur (entropy and information gain).

训练 (Training)

A decision tree is a model that can be represented in a tree fashion whereby at each split, there is a separation based on features, resulting in various leaf nodes whereby there is a result (classification or regression). For this discussion, we will focus on the classification setting. They are trained in a greedy and recursive fashion starting at the root, where the goal is to choose splits that increases the most certainty on which class a particular data point belongs to.

The entropy of a random variable Y quantifies the uncertainty of its values, and is given by the following, for a discrete variable Y which takes on k states:


Image for post

For a simple Bernoulli random variable, this quantity is highest when p = 0.5 and lowest when p = 0 or p = 1, which aligns intuitively with the definition since if p = 0 or 1, then there is no uncertainty on the result. Generally, if a random variable has high entropy, then its distribution is closer to a uniform one than a skewed one.

Consider an arbitrary split. We have H(Y) from the beginning training labels, and say we have some feature X that we want to split on. We can characterize the reduction in uncertainty by the information gain, which is given by:

Image for post

The larger this quantity, the higher the reduction in uncertainty in Y by splitting on X. Therefore, the general process is to assess all features in consideration and choose the feature that maximizes this information gain. Then, recursively continue the process for the two resulting branches.

随机森林 (Random Forests)

Typically an individual decision tree may be prone to overfitting, so in practice, usually random forests yield better out-of-sample predictions. A random forest is an ensemble method that utilizes many decision trees and averages the decision from them. It reduces overfitting and correlation between the trees by two methods: 1) bagging (bootstrap aggregation), whereby some m < n (where n is the total number of) data points are arbitrarily sampled with replacement and used as the training set, 2) a random subset of the features are considered at each split (to prevent always splitting on any particular feature).

聚类 (Clustering)

Clustering is a popular interview topic since there are many real life applications. It is often done for data visualization, and can be used to identify outliers that useful in cases like fraud detection. It also helps to have a basic understanding of how the parameters are learned in this context, versus an MLE/MAP approach from prior.

总览 (Overview)

The goal of clustering is to partition a dataset into various clusters looking only at the input features. This is an example of unsupervised learning. Ideally, the clustering has two properties:

  1. points within a given cluster are similar to one another (high intra-cluster similarity)

  2. points in different clusters are not similar to one another (low inter-cluster similarity).


K均值聚类 (K-means clustering)

K-means clustering partitions data into k clusters and starts by choosing centroids of each of the k clusters arbitrarily. Iteratively, it updates partitions by assigning points to the closest cluster, updating centroids, and repeating until convergence.

Mathematically, K-means solves the following problem by minimizing the following loss function (given points, and centroid values):


Image for post

The iterative process continues until the cluster assignment updates does not further the objective function.


高斯混合模型 (Gaussian Mixture Model)

A Gaussian Mixture Model (GMM) is a model whereby for any given data point x, we assume that it comes from one of k clusters, each with a particular Gaussian Distribution.


That is, among the K classes we have:


Image for post

where the π coefficients are the mixing coefficients on the clusters and are normalized so they sum up to 1. Let θ denote the unknown mean and variance parameters for each of the K classes, along with K the mixing coefficients. Then the likelihood is given by:

Image for post

and therefore the log-likelihood is:


Image for post

The parameters can be calculated iteratively used Expectation-Maximization (EM) which is discussed below.


期望最大化 (Expectation Maximization)

Expectation Maximization (EM) is a method to estimate parameters for latent variables, such as the two examples of K-means and GMMs above, whereby some variables can be observed directly, whereas others are latent and cannot be observed directly. In particular, for clustering, the cluster assignment is the latent variable since that is not directly observed. The general steps are as follows, using Z as the latent variables, X as the observed variables, and unknown parameters θ.

Assume the current parameters are given by: θ’. The first step is to estimate:

Image for post

using the current parameter estimates. The second step is to estimate the most likely θ* that maximizes the log-likelihood of the data, which is given by:

Image for post

And continue iteratively until convergence.


20个机器学习面试问题 (20 Machine Learning Interview Problems)

  1. Assume we have a classifier that produces a score between 0 and 1 for the probability of a particular loan application being fraudulent. In this scenario: a) what are false positives, b) what are false negatives, and c) what are the trade-offs between them in terms of dollars and how should the model be weighted accordingly?

  2. Say you need to produce a binary classifier for fraud detection. What metrics would you look at, how is each defined, and what is the interpretation of each one?

  3. You are given a very large corpus of words. How would you identify synonyms?

  4. Describe both generative and discriminative models and give an example of each.

  5. What is the bias-variance tradeoff? How is it expressed using an equation?

  6. Define the cross validation process. What is the motivation behind using it?

  7. Say you are modeling the yearly revenue of new listings. What kinds of features would you use? What data processing steps need to be taken, and what kind of model would run?

  8. What is L1 and L2 regularization? What are the differences between the two?

  9. Define what it means for a function to be convex. What is an example of a machine learning algorithm that is not convex and describe why that is so.

  10. Describe gradient descent and the motivations behind stochastic gradient descent.

  11. Explain what Information Gain and Entropy are in a Decision Tree.

  12. Describe the idea behind boosting. Give an example of one method and describe one advantage and disadvantage it has.

  13. Say we are running a probabilistic linear regression which does a good job modeling the underlying relationship between some y and x. Now assume all inputs have some noise ε added, which is independent of the training data. What is the new objective function? How do you compute it?

  14. What is the loss function used in k-means clustering for k clusters and n sample points? Compute the update formula using 1) batch gradient descent, 2) stochastic gradient descent for the cluster mean for cluster k using a learning rate ε.

  15. Say we are using a Gaussian Mixture Model (GMM) for anomaly detection on fraudulent transactions to classify incoming transactions into K classes. Describe the model setup formulaically and how to evaluate the posterior probabilities and log likelihood. How can we determine if a new transaction should be deemed fraudulent?

  16. What is Expectation-Maximization and when is it useful? Describe the setup algorithmically with formulas.

  17. Formulate the background behind an SVM, and show the optimization problem it aims to solve.

  18. Describe entropy in the context of machine learning, and show mathematically how to maximize it assuming N states.

  19. Suppose you are running a linear regression and model the error terms as being normally distributed. Show that in this setup, maximizing the likelihood of the data is equivalent to minimizing the sum of squared residuals.

  20. Say X is a univariate Gaussian random variable. What is the entropy of X?

谢谢阅读! (Thanks for reading!)

If you’re interested in further exploring probability and statistics in data science interviews, check out this newsletter that sends you practice problems three times a week. Also, be on the lookout for an upcoming book!

翻译自: https://towardsdatascience.com/data-science-interviews-machine-learning-d9080e7185fb


