

Probabilistic generative algorithms — such as Naive Bayes, linear discriminant analysis, and quadratic discriminant analysis — have become popular tools for classification. These methods can be easily implemented in Python through scikit-learn or in R through e1071. But how do the methods actually work? This article derives them from scratch.

(Note that this article is adapted from a chapter of my book Machine Learning from Scratch, which is available online for free).

We’ll use the following conventions for this article.


  • Let v[i] be the ith entry in a vector v.

  • The target is the variable we are trying to model. The predictors are the variables we use to model the target.

  • The target is a scalar and it is written as y. The predictors are combined into a vector and they are written as x. We also assume that the first entry in x is a 1, corresponding to the intercept term.

  • p(x) refers to the distribution of x, but P(y = k) refers to the probability that y equals k.

Most classification algorithms fall into one of two categories: discriminative and generative classifiers. Discriminative classifiers model the target variable, y, as a direct function of the predictor variables, x. For instance, logistic regression uses the following model, where 𝜷 is a length-D vector of coefficients and x is a length-D vector of predictors:

Image for post

The logistic regression model


Generative classifiers instead view the predictors as being generated according to their class — i.e., they see x as a function of y, rather than the other way around. They then use Bayes’ rule to get from p(x|y = k) to P(y = k|x), as explained below.

Generative models can be broken down into the three following steps. Suppose we have a classification task with K unordered classes, represented by k = 1, 2, …, K.

  1. Estimate the prior probability that a target belongs to any given class. I.e., estimate P(y = k) for k = 1, 2, …, K.

  2. Estimate the density of the predictors conditional on the target belonging to each class. I.e., estimate p(x|y = k) for k = 1, 2, …, K.

  3. Calculate the posterior probability that the target belongs to any given class. I.e., calculate P(y = k|x), which is proportional to p(x|y = k)P(y = k) by Bayes’ rule.

We then classify an observation as belonging to the class k for which the following expression is greatest:


Image for post

Note that we do not need p(x), which would be the denominator in the Bayes’ rule formula, since it would be equal across classes.

A generative classifier models two sources of randomness. First, we assume that out of the 𝐾 possible classes, each observation belongs to class 𝑘 independently with probability given by the kth entry in the vector 𝝅. I.e., 𝝅[k] gives P(y = k).

Second, we assume some distribution of x conditional on y. We typically assume x comes from the same family of distributions regardless of y, though its parameters depend on the class. For instance, we might assume

Image for post

though we wouldn’t assume x is distributed MVN if y = 1 but distributed Multivariate-t otherwise. Note that it is possible, however, for the individual variables within the vector x to follow different distributions. For instance, we might assume the ith and jth variables in x to be distributed as follows

Image for post

The machine learning task is then to estimate the parameters of these distributions — 𝝅 for the target variable y and whatever parameters index the assumed distributions of x|y = k (in the first case above, 𝝁_k and 𝚺_k for k = 1, 2, …, K. Once that’s done, we can calculate P(y = k) and p(x|y = k) for each class. Then through Bayes’ rule, we choose the class k that maximizes P(y = k|x).

3.参数估计 (3. Parameter Estimation)

Now let’s get to estimating the model’s parameters. Recall that we calculate P(y = k|x) with

Image for post

To calculate this probability, we need to first estimate 𝝅 (which tells us P(y = k)) and to second estimate the parameters in the distribution p(x|y = k). These are referred to as the class priors and the data likelihood.

Note: Since we’ll talk about the data across observations, let y_n and x_n be the target and predictors for the nth observation, respectively. (The math below is a little neater in the original book.)

Let’s start by deriving the estimates for 𝝅, the class priors. Let I_nk be an indicator which equals 1 if y_n = k and 0 otherwise. We want to find an expression for the likelihood of 𝝅 given the data. We can write the probability that the first observation has the target value it does as follows:

Image for post

This is equivalent to the likelihood of 𝝅 given a single target variable. To find the likelihood across all our variables, we simply use the product:

Image for post

This gives us the class prior likelihood. To estimate 𝝅 through maximum likelihood, let’s first take the log. This gives

Image for post

where the number of observations in class k is given by


Image for post

Now we are ready to find the MLE of 𝝅 by optimizing the log likelihood. To do this, we’ll need to use a Lagrangian since we have the constraint that the sum of the entries in 𝝅 must equal 1. The Lagrangian for this optimization problem looks as follows:

Image for post

The Lagrangian optimization. The first expression represents the log likelihood and the second represents the constraint.

More on the Lagrangian can be found in the original book. Next, we take the derivative of the Lagrangian with respect to 𝜆 and each entry in 𝝅:

Image for post

This system of equations gives the intuitive solution below, which says that our estimate of P(y = k) is just the sample fraction of the observations from class k.

Image for post

The next step is to model the conditional distribution of x given y so that we can estimate this distribution’s parameters. This of course depends on the family of distributions we choose to model x. Three common approaches are detailed below.

3.2.1 Linear Discriminative Analysis (LDA)


In LDA, we assume the following distribution for x


Image for post

for k = 1, 2, …, K. Note that each class has the same covariance matrix but a unique mean vector.

Let’s derive the parameter estimates in this case. First, let’s find the likelihood and log likelihood. Note that we can write the joint likelihood of all the observations as

Image for post



Image for post

Then, we plug in the Multivariate Normal PDF (dropping multiplicative constants) and take the log:


Image for post

Finally, we have our data likelihood. Now we estimate the parameters by maximizing this expression.

Let’s start with 𝚺. First, simplify the log-likelihood to make the gradient with respect to 𝚺 more apparent.

Image for post

Then, we take the derivative. Note that this uses matrix derivatives (2) and (3) introduced in the “math note” here.

Image for post

Then we set this gradient equal to 0 and solve for 𝚺.


Image for post



Image for post

Half way there! Now to estimate 𝝁_k (the kth class’s mean vector), let’s look at each class individually. Let C_k be the set of observations in class k. Looking only at terms involving 𝝁_k, we get

Image for post

Using equation (4) from the “math note” here, we get the gradient to be

Image for post

Finally, we set this gradient equal to 0 and find our estimate of the mean vector:


Image for post

where the last term gives the sample mean of the x in class k.

3.2.2 Quadratic Discriminant Analysis (QDA)


QDA looks very similar to LDA but assumes each class has its own covariance matrix. I.e.,

Image for post

The log-likelihood is the same in LDA except we the 𝚺 with a 𝚺_k:


Image for post

Again, let’s look at the parameters for the kth class individually. The log-likelihood for class k is given by

Image for post

We could take the gradient of this log-likelihood with respect to 𝝁_k and set it equal to 0 to solve for our estimate of 𝝁_k. However, we can also note that this estimate from the LDA approach will hold since this expression didn’t depend on the covariance term (which is the only thing we’ve changed). Therefore, we again get

Image for post

To estimate the 𝚺_k, we take the gradient of the log-likelihood for class k.


Image for post

Then we set this equal to 0 to get our estimate:


Image for post



Image for post

3.2.3 Naive Bayes


Naive Bayes assumes the random variables within x are independent conditional on the class of the observation. That is, if x is D-dimensional,

Image for post

This makes calculating p(x|y = k) very easy — to estimate the parameters of p(x[j]|y), we can ignore all variables in x other than the jth.

As an example, suppose x is two-dimensional and we use the following model, where for simplicity σ is known.


Image for post

As before, we estimate the parameters in each class by looking only at the terms in that class. Let θ_k = (μ_k, σ_k, p_k) contain the relevant parameters for class k. The likelihood for class k is given by the following,

Image for post

where the two are equal due to the assumed independence between the entries in x. Subbing in the Normal and Bernoulli densities for x_n1 and x_n2, respectively, we get

Image for post

Then we can take the log likelihood as follows


Image for post

Finally we’re ready to find our estimates. Taking the derivative with respect to p_k, we’re left with

Image for post

which will give us the sensible result that


Image for post

Notice this is just the average of the x_2’s. The same process will give us the typical results for μ_k and σ_k.

Regardless of our modeling choices for p(x|y = k), classifying new observations is easy. Consider a test observation x_0. For k = 1, 2, …, K, we use Bayes’ rule to calculate

Image for post

where 𝑝̂ gives the estimated density of x_0 conditional on y_0. We then predict y_0 = k for whichever value k maximizes the above expression.

Generative models like LDA, QDA, and Naive Bayes are among the most common methods for classifications. However, the (albeit arduous) details of their fitting process are often swept under the rug. The aim of this article is to make those details clear.

While the low-level details of estimating parameters for a generative model can be quite complex, the high-level intuition is quite straightforward. Let’s recap this intuition in a few simple steps.

  1. Estimate the prior probability that an observation is from any given class k. In math, estimate p(y = k) for each value of k.

  2. Estimate the density of the predictors conditional on the observation’s class. I.e., estimate p(x|y = k) for each value of k.

  3. Use Bayes’ Rule to obtain the probability that an observation is from any class given its predictors (up to a constant of proportionality): p(y = k|x).

  4. Choose which value k maximizes the probability in step 3 (call it k*) and estimate that y = k*.

And that’s it! To see more derivations from scratch like this one, please check out my free online book! I promise that most of them have less math.

就是这样! 要查看从头开始的更多衍生内容,请查看我的免费在线 ! 我保证他们大多数人数学将减少。

