08_09_3_Dimensionality Reduction_Mixture Models and EM_K-means_Image segmentation_compression

9.1. K-means Clustering

     We begin by considering the problem of identifying groups, or clusters, of data points in a multidimensional space. Suppose we have a data set consisting of N observations of a random D-dimensional Euclidean variable x. Our goal is to partition the data set into some number K of clusters, where we shall suppose for the moment that the value of K is given. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. We can formalize this notion by first introducing a set of D-dimensional vectors , where k = 1, . . . , K, in which is a prototype associated with the kth cluster. As we shall see shortly, we can think of the as representing the centres of the clusters. Our goal is then to find an assignment of data points to clusters, as well as a set of vectors {}, such that the sum of the squares of the distances of each data point to its closest vector , is a minimum.

     It is convenient at this point to define some notation to describe the assignment of data points to clusters. For each data point , we introduce a corresponding set of binary indicator variables ∈ {0, 1}, where k = 1, . . . , K describing which of
the K clusters the data point is assigned to
, so that if data point is assigned to cluster k then = 1, and = 0 for j != k. This is known as the 1-of-K coding scheme. We can then define an objective function, sometimes called a distortion measure, given by 
 (9.1)   OR    https://blog.csdn.net/Linli522362242/article/details/105813977   
     which represents the sum of the squares of the distances of each data point to its assigned vector . Our goal is to find values for the {} and the {} so as to minimize J. We can do this through an iterative procedure in which each iteration involves two successive steps corresponding to successive optimizations with respect to the and the .

  1. First we choose some initial values for the .
    rng = np.random.RandomState(rseed)
    k_centroids_indices = rng.permutation( X.shape[0] )[:n_clusters]
    centroids = X[k_centroids_indices]
    
  2. Then in the first phase we minimize J with respect to the , keeping the fixed.
     X_labeled_centroids = pairwise_distances_argmin(X, centroids)
  3. In the second phase we minimize J with respect to the , keeping fixed.
    new_centers = np.array([ X[X_labeled_centroids==i].mean(axis=0) for i in range(n_clusters) ])

The last two-stage optimization is then repeated until convergence. We shall see that these two stages of updating and updating correspond respectively to the E (expectation) and M (maximization) steps of the EM algorithm, and to emphasize this we shall use the terms E step and M step in the context of the K-means algorithm.

     Consider first the determination of the . Because J in (9.1) is a linear function of , this optimization can be performed easily to give a closed form solution. The terms involving different n are independent and so we can optimize for each n separately by choosing to be 1 for whichever value of k gives the minimum value of . In other words, we simply assign the data point to the closest cluster centre. More formally, this can be expressed as
   (9.2)

     Now consider the optimization of the with the held fixed. The objective function J is a quadratic function of , and it can be minimized by setting its derivative with respect to to zero giving
 (9.3)

which we can easily solve for to give   (9.4)

     The denominator in this expression is equal to the number of points assigned to cluster k, and so this result has a simple interpretation, namely set equal to the mean of all of the data points assigned to cluster k. For this reason, the procedure is known as the K-means algorithm.

     The two phases of re-assigning data points to clusters and re-computing the cluster means are repeated in turn until there is no further change in the assignments

    while True:
        # 2. Assign each sample to the nearest centroid
        # X --> nearest_centroids_indices
        X_labeled_centroids = pairwise_distances_argmin(X, centroids)
        
        #3. Move the centroids to the center of the samples that were assigned to it
        #3.1 Find new centers from means of points        #axis=0: mean( n_rows x 1_column )
        new_centers = np.array([ X[X_labeled_centroids==i].mean(axis=0) for i in range(n_clusters) ])
        
        #4. Repeat steps 2 and 3 until the cluster assignments do not change 
        #   or a user-defined tolerance or a maximum number of iterations is reached.
        #4.1 check for convergence
        if np.all( centroids == new_centers ):
            break
        centroids = new_centers #3.2 assign the new center to the centroids

(or until some maximum number of iterations is exceeded). Because each phase reduces  the value of the objective function J, convergence of the algorithm is assured. However, it may converge to a local rather than global minimum of J. The convergence properties of the K-means algorithm were studied by MacQueen (1967).
 Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green points denote the data set in a two-dimensional Euclidean space. The initial choices for centres μ1 and μ2 are shown by the red and blue crosses, respectively. (b) In the initial E step, each data point is assigned either to the red cluster or to the blue cluster, according to which cluster centre is nearer. This is equivalent to classifying the points according to which side of the perpendicular bisector垂直平分线 of the two cluster centres, shown by the magenta line, they lie on. (c) In the subsequent M step, each cluster centre is re-computed to be the mean of the points assigned to the corresponding cluster. (d)–(i) show successive E and M steps through to final convergence of the algorithm.

     The K-means algorithm is illustrated using the Old Faithful data set in Figure 9.1. For the purposes of this example, we have made a linear re-scaling of the data, known as standardizing, such that each of the variables has zero mean and unit standard deviation. For this example, we have chosen K = 2, and so in this case, the assignment of each data point to the nearest cluster centre is equivalent to a classification of the data points according to which side they lie of the perpendicular
bisector of the two cluster centres. A plot of the cost function J given by (9.1) for the Old Faithful example is shown in Figure 9.2.
 Figure 9.2 Plot of the cost function J given by (9.1) after each E step (blue points) and M step (red points) of the Kmeans algorithm for the example shown in Figure 9.1. The algorithm has converged after the third M step, and the final EM cycle produces no changes in either the assignments or the prototype vectors( is a prototype associated with the kth cluster.).

     Note that we have deliberately故意地 chosen poor initial values for the cluster centres so that the algorithm takes several steps before convergence. In practice, a better initialization procedure would be to choose the cluster centres to be equal to a random subset of K data points.

rng = np.random.RandomState(rseed)
k_centroids_indices = rng.permutation( X.shape[0] )[:n_clusters]
centroids = X[k_centroids_indices]

It is also worth noting that the K-means algorithm itself is often used to initialize the parameters in a Gaussian mixture model before applying the EM algorithm.

     A direct implementation of the K-means algorithm as discussed here can be relatively slow, because in each E step it is necessary to compute the Euclidean distance between every prototype vector and every data point. Various schemes have
been proposed for speeding up the K-means algorithm, some of which are based on precomputing a data structure such as a tree such that nearby points are in the same subtree (Ramasubramanian and Paliwal, 1990; Moore, 2000). Other approaches make use of the triangle inequality for distances(centroid和俩个数据点构成一个三角形,两边之和大于第3边,因此最好在俩数据点的连线上,但是这里需要考虑多个数据点,两两连线相交等), thereby avoiding unnecessary distance calculations (Hodgson, 1998; Elkan, 2003).

     So far, we have considered a batch version of K-means in which the whole data set is used together to update the prototype vectors( is a prototype associated with the kth cluster.). We can also derive an on-line stochastic algorithm (MacQueen, 1967) by applying the Robbins-Monro procedure to the problem of finding the roots of the regression function given by the derivatives of J in (9.1) with respect to . This leads to a sequential update in which, for each data point in turn, we update the nearest prototype using           (9.5)          where is the learning rate parameter, which is typically made to decrease monotonically as more data points are considered.

     The K-means algorithm is based on the use of squared Euclidean distance as the measure of dissimilarity between a data point and a prototype vector. Not only does this limit the type of data variables that can be considered (it would be inappropriate for cases where some or all of the variables represent categorical labels for instance), but it can also make the determination of the cluster means nonrobust to outliers. We can generalize the K-means algorithm by introducing a more general dissimilarity measure between two vectors x and and then minimizing the following distortion measure
   (9.6)  

     which gives the K-medoids algorithm K中心点算法. The E step again involves, for given cluster prototypes , assigning each data point to the cluster for which the dissimilarity to the corresponding prototype is smallest. The computational cost of this is O(KN), as is the case for the standard K-means algorithm. For a general choice of dissimilarity
measure, the M step is potentially more complex than for K-means
, and so it is common to restrict each cluster prototype to be equal to one of the data vectors assigned to that cluster, as this allows the algorithm to be implemented for any choice
of dissimilarity measure V(·, ·) so long as it can be readily evaluated. Thus the M step involves, for each cluster k, a discrete search over the points assigned to that cluster, which requires O( ) evaluations of V(·, ·).

     One notable feature of the K-means algorithm is that at each iteration, every data point is assigned uniquely to one, and only one, of the clusters. Whereas some data points will be much closer to a particular centre than to any other centre, there may be other data points that lie roughly midway between cluster centres. In the latter case, it is not clear that the hard assignment to the nearest cluster is the most appropriate. We shall see in the next section that by adopting a probabilistic
approach
, we obtain ‘soft’ assignments of data points to clusters in a way that reflects the level of uncertainty over the most appropriate assignment. This probabilistic formulation brings with it numerous benefits.

code link : https://blog.csdn.net/Linli522362242/article/details/105813977

9.1.1 Image segmentation and compression

     As an illustration of the application of the K-means algorithm, we consider the related problems of image segmentation and image compression. The goal of segmentation is to partition an image into regions each of which has a reasonably homogeneous visual appearance or which corresponds to objects or parts of objects (Forsyth and Ponce, 2003). Each pixel in an image is a point in a 3-dimensional space comprising the intensities强度/亮度 of the red, blue, and green channels, and our segmentation algorithm simply treats each pixel in the image as a separate data point. Note that strictly this space is not Euclidean because the channel intensities are bounded by the interval [0, 1].

08_Dimensionality Reduction_02_Gaussian mixture_kmeans++_extent_tick_params_silhouette_image segment: https://blog.csdn.net/Linli522362242/article/details/105722461

from matplotlib.image import imread
import os
 
image = imread( os.path.join(os.path.abspath(os.path.pardir+"\images"),
                                             "unsupervised_learning", "ladybug.png") 
              ) #Blue Green Red=BGR, 
# if you want the input image format RGB: imag[..., ::-1]
 
# OR
# import cv2
# Change color to RGB (from BGR)
# image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
 
image.shape

 # 3-->BGR;   Height x Length = 533 x 800

     Nevertheless, we can apply the K-means algorithm without difficulty. We illustrate the result of running K-means to convergence, for any particular value of K, by re-drawing the image replacing each pixel vector with the {R, G,B} intensity triplet亮度三元组 given by the centre to which that pixel has been assigned.

X = image.reshape(-1,3) # Reshape image into a 2D array of pixels and 3 color values (BGR)
kmeans = KMeans( n_clusters=8, random_state=42 ).fit(X)

# np.unique(kmeans.labels_) # array([0, 1, 2, 3, 4, 5, 6, 7]) # since n_clusters=8
# kmeans.labels_            # array([3, 3, 3, ..., 7, 7, 3]) # for each instance in X
# each instance in X ~ each val kmeans.labels_ ~ cluster index ~ cluster_centers_[index]

# kmeans.cluster_centers_  # BGR #[Blue, Green, Red],
# array([[0.20698585, 0.37355077, 0.0522274 ],
#        [0.984066  , 0.9406676 , 0.02602943],
#        [0.34464604, 0.5229904 , 0.15256725],
#        [0.02093545, 0.10603908, 0.00559799],
#        [0.60578346, 0.6335118 , 0.39201766],
#        [0.9010906 , 0.7415247 , 0.03171456],
#        [0.6244905 , 0.38521746, 0.08882673],
#        [0.09142023, 0.24330519, 0.01521175]], dtype=float32)

# for any particular value of K, by re-drawing the image replacing each pixel vector with
# the {R,G,B} intensity triplet given by the centre μk to which that pixel has been 
# assigned.
segmented_img = kmeans.cluster_centers_[ kmeans.labels_ ]
# array([[0.02093545, 0.10603908, 0.00559799],
#        [0.02093545, 0.10603908, 0.00559799],
#        [0.02093545, 0.10603908, 0.00559799],
#        ...,
#        [0.09142023, 0.24330519, 0.01521175],
#        [0.09142023, 0.24330519, 0.01521175],
#        [0.02093545, 0.10603908, 0.00559799]], dtype=float32)
# segmented_img.shape # (426400, 3) # 426400=533*800
segmented_img = segmented_img.reshape( image.shape )
# a 2D array of (pixels and 3 color values (BGR)) convert to a 3D array of (height, width and 3 color values(BGR)
segmented_img.shape # (533, 800, 3) 

    # 3-->BGR;   Height x Length = 533 x 800 

#X = image.reshape(-1,3) # 3D(533, 800, 3) convert to 2D(426400, 3)
segmented_imgs = [] #for saving 3D image
n_colors = (10,8,6,4,2)
for n_clusters in n_colors:
    kmeans = KMeans( n_clusters=n_clusters, random_state=42 ).fit(X)
    segmented_img = kmeans.cluster_centers_[ kmeans.labels_ ]
    segmented_imgs.append( segmented_img.reshape(image.shape) )
 
 
plt.figure( figsize=(10,5) )
plt.subplots_adjust(wspace=0.05, hspace=0.5)
 
plt.subplot(231)
plt.imshow(image)
plt.title("Original")
# plt.axis('off')
 
for idx, n_clusters in enumerate(n_colors):
    plt.subplot(232 + idx)
    plt.imshow( segmented_imgs[idx] )
    plt.title( "{} colors".format(n_clusters) )
    plt.axis('off')
 
plt.show()

 

     Figure 9.3 Two examples of the application of the K-means clustering algorithm to image segmentation showing the initial images together with their K-means segmentations obtained using various values of K. This also illustrates of the use of vector quantization量化 for data compression, in which smaller values of K give higher compression at the expense of poorer image quality.

     Results for various values of K are shown in Figure 9.3. We see that for a given value of K, the algorithm is representing the image using a palette of only K colours. It should be emphasized that this use of K-means is not a particularly sophisticated approach to image segmentation, not least because it takes no account of the spatial proximity空间上的近似性 of different pixels. The image segmentation problem is in general extremely difficult and remains the subject of active research and is introduced here simply to illustrate the behaviour of the K-means algorithm.

     We can also use the result of a clustering algorithm to perform data compression. It is important to distinguish between lossless data compression无损数据压缩, in which the goal is to be able to reconstruct the original data exactly from the compressed representation, and lossy data compression有损数据压缩, in which we accept some errors in the reconstruction in return for higher levels of compression than can be achieved in the lossless case. We can apply the K-means algorithm to the problem of lossy data compression as follows. For each of the N data points, we store only the identity k of the cluster to which it is assigned. We also store the values of the K cluster centres , which typically requires significantly less data, provided we choose K << N. Each data point is then approximated概略估算 by its nearest centre . New data points can similarly be compressed by first finding the nearest and then storing the label k instead of the original data vector. This framework is often called vector quantization向量量子化, and the vectors are called code-book vectors编码书向量.

     The image segmentation problem discussed above also provides an illustration of the use of clustering for data compression. Suppose the original image has N pixels comprising {R=2^8=256, G=2^8=256, B=2^8=256} values each of which is stored with 8 bits of precisionThen to transmit the whole image directly would cost 24N bits ###(8+8+8)*N pixels###. Now suppose we first run K-means on the image data, and then instead of transmitting the original pixel intensity vectors we transmit the identity of the nearest vector . Because there are K such vectors, this requires bits per pixel ### K is the total number of cluster label indices, per pixel is an instance here which is corresponding to a cluster label index, so for each pixel, we need  bits to record which cluster centroid   we used to replace the original pixel ###. We must also transmit the K code book vectors , which requires 24K bits ### (8+8+8)*K cluster centroids ###, and so the total number of bits required to transmit the image is (rounding up to the nearest
integer). The original image shown in Figure 9.3 has 240 × 180 inches = 43, 200 pixels and so requires 24 × 43, 200 = 1,036,800 bits ###(8+8+8)*N pixels### to transmit directly. By comparison, the compressed images require 43, 248 bits (K = 2), 86, 472 bits (K = 3), and 173, 040 bits (K = 10), respectively, to transmit. These represent compression ratios compared to the original image of 4.2%, 8.3%, and 16.7%, respectively. We see that there is a trade-off between degree of compression and image quality. Note that our aim in this example is to illustrate the K-means algorithm. If we had been aiming to produce a good image compressor, then it would be more fruitful效果好的 to consider small blocks of adjacent pixels, for instance 5×5, and thereby exploit the correlations that exist in natural images between nearby pixels.

9.2. Mixtures of Gaussians Models & EM

     In the previous content we saw the k-means algorithm which is considered as a hard clustering technique, such that each point is allocated to only one cluster. In kk-means, a cluster is described only by its centroid. This is not too flexible, as we may have problems with clusters that are overlapping, or ones that are not of circular shape.

     Here, we will introduce a model-based clustering technique, which is Expectation Maximization (EM). We will apply it using Gaussian Mixture Models (GMM).

     With EM Clustering, we can go a step further and describe each cluster by its centroid (mean), covariance (so that we can have elliptical clusters), and weight (the size of the cluster). The probability that a point belongs to a cluster is now given by a multivariate Gaussian probability distribution (multivariate - depending on multiple variables). That also means that we can calculate the probability of a point being under a Gaussian ‘bell’, i.e. the probability of a point belonging to a cluster. A comparison between the performances of k-means and EM clustering on an artificial dataset is shown in Figure 8.1.

Figure 8.1: Comparison of kk-means and EM on artificial data called Mouse dataset. Using the Variances, the EM algorithm can describe the normal distributions exact, while k-means splits the data in Voronoi-Cells.

2.3. The Gaussian Distribution

     The Gaussian, also known as the normal distribution, is a widely used model for the distribution of continuous variables. In the case of a single variable x, the Gaussian distribution can be written in the form
 (2.42)
where μ is the mean and is the variance.

For a D-dimensional vector x, the multivariate多元 Gaussian distribution takes the form   (2.43)

where μ is a D-dimensional mean vector, Σ is a D × D covariance matrix, and |Σ| denotes the determinant行列式 of Σ.

https://blog.csdn.net/Linli522362242/article/details/105196037

     constructing the covariance matrix. The symmetric d × d -dimensional covariance matrix, where d is the number of dimensions in the dataset, stores the pairwise成对地 covariances between the different features. For example, the covariance between two features  and  on the population level can be calculated via the following equation:

     Here,  and  are the sample means of feature j and k , respectively. Note that the sample means are zero if we standardize the dataset. A positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. For example, a covariance matrix of three features can then be written as (note that  stands for the Greek uppercase letter sigma, which is not to be confused with the sum symbol):

The determinant行列式 of a matrix

   and 
https://jingyan.baidu.com/album/ce09321b9177222bff858f30.html?picindex=1
https://blog.csdn.net/jackghq/article/details/90716172
  

     The Gaussian distribution arises in many different contexts and can be motivated from a variety of different perspectives.  For example, we have already seen that for a single real variable, the distribution that maximizes the entropy is the Gaussian. This property applies also to the multivariate Gaussian.

     Another situation in which the Gaussian distribution arises is when we consider the sum of multiple random variables. The central limit theorem (due to Laplace) tells us that, subject to certain mild conditions, the sum of a set of random variables, which is of course itself a random variable, has a distribution that becomes increasingly Gaussian as the number of terms in the sum increases (Walker, 1969).

2.3.9 Mixtures of Gaussians
Figure 2.21 Plots of the ‘old faithful’
data in which the blue curves show contours of constant probability density. On the left is a single Gaussian distribution which has been fitted to the data using maximum likelihood. Note that this distribution fails to capture the two clumps聚集区 in the data and indeed places much of its probability mass in the central region between the clumps where the data are relatively sparse. On the right the distribution is given by a linear combination of two Gaussians which has been fitted
to the data by maximum likelihood using techniques discussed Chapter 9, and which gives a better representation of the data.

     While the Gaussian distribution has some important analytical properties, it suffers from significant limitations when it comes to modelling real data sets. Consider the example shown in Figure 2.21. This is known as the ‘Old Faithful’ data set, and comprises 272 measurements of the eruption of the Old Faithful geyser间歇喷泉 at Yellowstone National Park in the USA. Each measurement comprises the duration of the eruption in minutes (horizontal axis) and the time in minutes to the next eruption (vertical axis). We see that the data set forms two dominant clumps, and that a simple Gaussian distribution is unable to capture this structure, whereas a linear superposition叠加 of two Gaussians gives a better characterization of the data set.

     Such superpositions, formed by taking linear combinations of more basic distributions such as Gaussians, can be formulated as probabilistic models known as mixture distributions (McLachlan and Basford, 1988; McLachlan and Peel, 2000). In Figure 2.22 we see that a linear combination of Gaussians can give rise to very complex densities. By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy.
 Figure 2.22 Example of a Gaussian mixture distribution in one dimension showing three Gaussians (each scaled by a coefficient) in blue and their sum in red.

     We therefore consider a superposition of K Gaussian densities of the form
  (2.188)
     which is called a mixture of Gaussians. Each Gaussian density is called a component of the mixture and has its own mean and covariance . Contour and surface plots for a Gaussian mixture having 3 components are shown in Figure 2.23.
Figure 2.23 Illustration of a mixture of 3 Gaussians in a two-dimensional space. (a) Contours of constant density for each of the mixture components, in which the 3 components are denoted red, blue and green, and the values of the mixing coefficients are shown below each component. (b) Contours of the marginal probability density p(x) of the mixture distribution. (c) A surface plot of the distribution p(x).

     In this section we shall consider Gaussian components to illustrate the framework of mixture models. More generally, mixture models can comprise linear combinations of other distributions. For instance, in Section 9.3.3 we shall consider
mixtures of Bernoulli distributions as an example of a mixture model for discrete variables.

     The parameters in (2.188) are called mixing coefficients. If we integrate求积分 both sides of (2.188) with respect to x, and note that both p(x) and the individual Gaussian components are normalized, we obtain
  (2.189)
#####################################
why  the individual Gaussian components are normalized?

单高斯模型(Gaussian single model, GSM)
     最常见的单高斯模型(或者叫单高斯分布)就是钟形曲线,只不过钟形曲线只是一维下的高斯分布。高斯分布(Gaussian distribution)又叫正态分布(Normal distribution),是一个在数学、物理及工程等领域都非常重要的概率分布,在统计学的许多方面有着重大的影响力。因其曲线呈钟形,因此人们又经常称之为钟形曲线。如下图为标准正态分布图:

     它的基本定义是:若随机变量X(the random variable X)服从一个数学期望为μ、方差为σ^2的高斯分布,则记为N(μ,σ^2)。数学期望μ指的是均值(算术平均值, 均值对应正态分布的中间位置)σ为方标准差(方差开平方后得到标准差, 标准差衡量了数据围绕均值分散的程度)X~N(μ,σ^2) 
高斯分布的概率密度函数pdf为:

where μ and σ are parameters satisfying −∞ < μ < ∞ and 0 < σ < ∞, and also where exp[v] means  Briefly, we say that X is .
     上面的公式是概率密度函数pdf,也就是在已知参数的情况下,输入变量指x,可以获得相对应的概率密度。还要注意一件事,就是在实际使用前,概率分布要先进行归一化,也就是说曲线下面的面积之和需要为1,这样才能确保返回的概率密度在允许的取值范围内。


     如果需要计算指定区间内的分布概率,则可以计算在区间首尾两个取值之间的面积的大小。另外除了直接计算面积,还可以用更简便的方法来获得同样的结果,就是减去区间x对应的累积密度函数(cumulative density function,CDF)。因为CDF表示的是数值小于等于x的分布概率。

#####################################
##############################           
Supposing K=1, then left side is 1 (the area under the Gaussian curve is 1, and the right side,  is single Gaussian distribution ~ , then formula: 1 =  *1,  when K=1;
Then K>1, left side(total area): 1  * *1 ==>

)
############################## 

     The parameters in (2.188 and   is called a component of the mixture and has its own mean and covariance ) are called mixing coefficients. If we integrate求积分 both sides of (2.188) with respect to x, and note that both p(x) and the individual Gaussian components are normalized, we obtain
  (2.189)

Also, the requirement that p(x) >= 0, together with  >=0, implies  >=0 for all k. Combining this with the condition (2.189) we obtain               (2.190)      

     We therefore see that the mixing coefficients  satisfy the requirements to be probabilities.

     From the sum and product rules, the marginal density is given by
    (2.191)

which is equivalent to (2.188) in which we can view as the prior probability of picking the kth component, and the density as the probability of x conditioned on k. As we shall see in later chapters, an important role is played by the posterior probabilities p(k|x), which are also known as responsibilities. From Bayes’ theorem these are given by       
###  and p(x, y) = p(x) p(y|x) = p(y) p(x|y) ### https://blog.csdn.net/Linli522362242/article/details/93034532

https://www.mghassany.com/MLcourse/gaussian-mixture-models-em.html

We shall discuss the probabilistic interpretation of the mixture distribution in greater detail in Chapter 9.

     The form of the Gaussian mixture distribution is governed by the parameters π, μ and Σ, where we have used the notation π ≡ {π1, . . . , πK}, μ ≡ {μ1, . . . ,μK} and Σ ≡ {Σ1, . . .ΣK}. One way to set the values of these parameters is to use
maximum likelihood. From (2.188  ) the log of the likelihood function is given by
   (2.193)

     where . We immediately see that the situation is now much more complex than with a single Gaussian, due to the presence of the summation over k inside the logarithm. As a result, the maximum likelihood solution for the parameters no longer has a closed-form analytical solution. One approach to maximizing the likelihood function is to use iterative numerical optimization techniques (Fletcher, 1987; Nocedal and Wright, 1999; Bishop and Nabney, 2008). Alternatively
we can employ a powerful framework called expectation maximization, which will be discussed at length in Chapter 9.
#######################################################

9.2. Mixtures of Gaussians

     In Section 2.3.9 we motivated the Gaussian mixture model as a simple linear superposition叠加 of Gaussian components, aimed at providing a richer class of density models than the single Gaussian. We now turn to a formulation of Gaussian mixtures in terms of discrete latent variables. This will provide us with a deeper insight into this important distribution, and will also serve to motivate the expectation-maximization algorithm.

     Recall from (2.188) that the Gaussian mixture distribution can be written as a linear superposition of Gaussians in the form
 (9.7)

Figure 9.4 Graphical representation of a mixture model, in which the joint distribution is expressed in the form p(x, z) =p(z) p(x|z).

     Let us introduce a K-dimensional binary random variable z having a 1-of-K representation in which a particular element is equal to 1 and all other elements are equal to 0. The values of therefore satisfy ∈ {0, 1} and , and we see that there are K possible states for the vector z according to which element is nonzero. We shall define the joint distribution p(x, z) in terms of a marginal边缘的 distribution p(z) and a conditional distribution p(x|z), corresponding to the graphical model in Figure 9.4. The marginal distribution over z is specified in terms of the mixing coefficients , such that

where the parameters {} must satisfy   (9.8)                together with  (9.9)
in order to be valid probabilities. Because z uses a 1-of-K representation, we can also write this distribution in the form
  (9.10) 连乘   <==
Similarly, the conditional distribution of x given a particular value for z is a Gaussian
   <== 

which can also be written in the form   (9.11)

     The joint distribution is given by p(z) p(x|z), and the marginal distribution of x   is then obtained by summing the joint distribution over all possible states of z to give    (9.12)

where we have made use of (9.10) and (9.11). Thus the marginal distribution of x is a Gaussian mixture of the form (9.7). If we have several observations x1, . . . , xN, then, because we have represented the marginal distribution in the form p(x) = , it follows that for every observed data point  there is a corresponding latent variable .

     We have therefore found an equivalent formulation of the Gaussian mixture involving an explicit latent variable. It might seem that we have not gained much by doing so. However, we are now able to work with the joint distribution p(x, z) instead of the marginal distribution p(x), and this will lead to significant simplifications, most notably through the introduction of the expectation-maximization (EM) algorithm.

     Another quantity that will play an important role is the conditional probability of z given x. We shall use to denote p( = 1|x), whose value can be found using Bayes’ theorem        ### p(x, y) = p(x) p(y|x) = p(y) p(x|y) ###
    (9.13)

     We shall view as the prior probability of = 1  ###  ###, and the quantity as the corresponding posterior probability once we have observed x. As we shall see later, can also be viewed as the responsibility that component k takes for ‘explaining’ the observation x.

     We can use the technique of ancestral sampling to generate random samples distributed according to the Gaussian mixture model. 

  • To do this, we first generate a value for z, which we denote , from the marginal distribution p(z)
  • and then generate a value for x from the conditional distribution p(x|).

     Techniques for sampling from standard distributions are discussed in Chapter 11. We can depict samples from the joint distribution p(x, z) by plotting points at the corresponding values of x ###( p(x, z) ~ x )### and then colouring them according to the value of z, in other words according to which Gaussian component was responsible for generating them, as shown in Figure 9.5(a). Similarly samples from the marginal distribution p(x) ###( p(x)~x )### are obtained by taking the samples from the joint distribution and ignoring the values of z. These are illustrated in Figure 9.5(b) by plotting the x values without any coloured labels.
Figure 9.5 Example of 500 points drawn from the mixture of 3 Gaussians shown in Figure 2.23. (a) Samples from the joint distribution p(z) p(x|z) in which the three states of z, corresponding to the three components of the mixture, are depicted in red, green, and blue,
and (b) the corresponding samples from the marginal distribution p(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set in (a) is said to be complete, whereas that in (b) is incomplete.
(c) The same samples in which the colours represent the value of the responsibilities associated with data point , obtained by plotting the corresponding point using proportions of red, blue, and green ink given by for k = 1, 2, 3, respectively.

     We can also use this synthetic人造的 data set to illustrate the ‘responsibilities’ by evaluating, for every data point, the posterior probability for each component in the mixture distribution from which this data set was generated. In particular, we can represent the value of the responsibilities associated with data point by plotting the corresponding point using proportions of red, blue, and green ink given by for k = 1, 2, 3, respectively, as shown in Figure 9.5(c). So, for instance, a data point for which = 1 will be coloured red, whereas one for which = 0.5 will be coloured with equal proportions of blue and green ink and so will appear cyan. This should be compared with Figure 9.5(a) in which the data points were labelled using the true identity of the component from which they were generated.
https://www.mghassany.com/MLcourse/gaussian-mixture-models-em.html

9.2.1 Maximum likelihood

  Figure 9.6 Graphical representation of a Gaussian mixture model for a set of N i.i.d. data points {}, with corresponding latent points {}, where n = 1, . . . , N.

     Suppose we have a data set of observations , and we wish to model this data using a mixture of Gaussians. We can represent this data set as an N × D matrix X in which the row is given by . Similarly, the corresponding latent
variables will be denoted by an N × K matrix Z with rows . If we assume that the data points are drawn independently from the distribution, then we can express the Gaussian mixture model for this i.i.d. data set using the graphical representation
shown in Figure 9.6. From (9.7 ) the log of the likelihood function对数似然函数 is given by
  (9.14)

     Before discussing how to maximize this function, it is worth emphasizing that there is a significant problem associated with the maximum likelihood framework applied to Gaussian mixture models, due to the presence of singularities奇异性. For simplicity, consider a Gaussian mixture whose components have covariance matrices given by = , where is the unit matrix, although the conclusions will hold for general covariance matrices. Suppose that one of the components of the mixture
model, let us say the component, has its mean exactly equal to one of the data points so that = for some value of n. This data point will then contribute a term in the likelihood function of the form
  (9.15).
#######################################

     For a D-dimensional vector x, the multivariate多元 Gaussian distribution takes the form   (2.43)

where μ is a D-dimensional mean vector, Σ is a D × D covariance matrix, and |Σ| denotes the determinant行列式 of Σ.
#######################################

     If we consider the limit → 0, then we see that this term goes to infinity and so the log likelihood function will also go to infinity. Thus the maximization of the log likelihood function is not a well posed problem because such singularities will always be present and will occur whenever one of the Gaussian components ‘collapses’ onto a specific data point. Recall that this problem did not arise in the case of a single Gaussian distribution. To understand the difference, note that if a single Gaussian collapses onto a data point it will contribute multiplicative乘法 factors to the likelihood function arising from the other data points and these factors will go to zero exponentially指数速率 fast, giving an overall likelihood that goes to zero rather than infinity. However, once we have (at least) two components in the mixture, one of the components can have a finite variance and therefore assign finite probability to all of the data points while the other component can shrink onto one specific data point and thereby contribute an ever increasing additive value to the log likelihood. This is illustrated in Figure 9.7. These singularities provide another example of the severe over-fitting that can occur in a maximum likelihood approach. We shall see Section 10.1 that this difficulty does not occur if we adopt a Bayesian approach. For the moment,
however, we simply note that in applying maximum likelihood to Gaussian mixture models we must take steps to avoid finding such pathological病态的 solutions and instead seek local maxima of the likelihood function that are well behaved. We can hope to avoid the singularities by using suitable heuristics, for instance by detecting when a Gaussian component is collapsing and resetting its mean to a randomly chosen value while also resetting its covariance to some large value, and then continuing with the optimization.

 Figure 9.7 Illustration of how singularities in the likelihood function arise with mixtures of Gaussians. This should be compared with the case of a single Gaussian shown in Figure 1.14 for which no singularities arise.
########################################
 Figure 1.14 Illustration of the likelihood function for a Gaussian distribution, shown by the red curve. Here the black points denote a data set of values {}, and the likelihood function given by (1.53)
corresponds to the product乘积 of the blue values. Maximizing the likelihood involves adjusting the mean and variance of the Gaussian so as to maximize this product.

     Now suppose that we have a data set of observations = representing N observations of the scalar variable x. Note that we are using the typeface字体 to distinguish this from a single observation of the vector-valued variable , which we denote by x. We shall suppose that the observations are drawn independently from a Gaussian distribution whose mean μ and variance are unknown, and we would like to determine these parameters from the data set. Data points that are drawn independently from the same distribution are said to be independent and identically distributed, which is often abbreviated to i.i.d. We have seen that the joint probability of two independent events is given by the product of the marginal probabilities for each event separately. Because our data set x is i.i.d., we can therefore write the probability of the data set, given μ and , in the form
  (1.53)
     When viewed as a function of μ and , this is the likelihood function for the Gaussian高斯分布的似然函数 and is interpreted diagrammatically in Figure 1.14.
########################################

     A further issue in finding maximum likelihood solutions arises from the fact that for any given maximum likelihood solution, a K-component mixture will have a total of K! equivalent solutions corresponding to the K! ways of assigning K sets of parameters to K components. In other words, for any given (nondegenerate) point in the space of parameter values there will be a further K!−1 additional points all of which give rise to exactly the same distribution. This problem is known as identifiability可区分 (Casella and Berger, 2002) and is an important issue when we wish to interpret the parameter values discovered by a model. Identifiability will also arise when we discuss models having continuous latent variables in Chapter 12. However,
for the purposes of finding a good density model, it is irrelevant because any of the equivalent solutions is as good as any other.

     Maximizing the log likelihood function (9.14 ) for a Gaussian mixture model turns out to be a more complex problem than for the case of a single Gaussian. The difficulty arises from the presence of the summation over k that appears inside the logarithm in (9.14), so that the logarithm function no longer acts directly on the Gaussian. If we set the derivatives of the log likelihood to zero, we will no longer obtain a closed form solution, as we shall see shortly.

     One approach is to apply gradient-based optimization techniques (Fletcher, 1987; Nocedal and Wright, 1999; Bishop and Nabney, 2008). Although gradient-based techniques are feasible, and indeed will play an important role when we discuss
mixture density networks in Chapter 5, we now consider an alternative approach known as the EM algorithm which has broad applicability and which will lay the foundations for a discussion of variational inference techniques in Chapter 10.

9.2.2 EM for Gaussian mixtures

     An elegant and powerful method for finding maximum likelihood solutions for models with latent variables is called the expectation-maximization algorithm, or EM algorithm (Dempster et al., 1977; McLachlan and Krishnan, 1997). Later we shall give a general treatment of EM, and we shall also show how EM can be generalized to obtain the variational inference framework. Initially, we shall motivate the EM algorithm by giving a relatively informal treatment in the context of the Gaussian
mixture model. We emphasize, however, that EM has broad applicability, and indeed it will be encountered in the context of a variety of different models in this book.

     Let us begin by writing down the conditions that must be satisfied at a maximum of the likelihood function. Setting the derivatives of ln p(X|π,μ,Σ) in (9.14 ) with respect to the means  of the Gaussian components to zero, we obtain
  (9.16)

where we have made use of the form (2.43  the derivative of  ) for the Gaussian distribution. Note that the posterior probabilities, or responsibilities, given by (9.13 ) appear naturally on the right-hand side. Multiplying by (which we assume to be nonsingular) and rearranging we obtain 
 (9.17)      where we have defined    (9.18)
We can interpret as the effective number of points assigned to cluster k. Note carefully the form of this solution. We see that the mean for the Gaussian component is obtained by taking a weighted mean of all of the points in the data set, in which the weighting factor for data point is given by the posterior probability that component k was responsible for generating .

     If we set the derivative of ln p(X|π,μ,Σ) with respect to to zero, and follow a similar line of reasoning, making use of the result for the maximum likelihood Section 2.3.4 solution for the covariance matrix of a single Gaussian, we obtain
 (9.19)

     which has the same form as the corresponding result for a single Gaussian fitted to the data set, but again with each data point weighted by the corresponding posterior probability and with the denominator given by the effective number of points
associated with the corresponding component.

     Finally, we maximize ln p(X|π,μ,Σ) with respect to the mixing coefficients . Here we must take account of the constraint (9.9 ), which requires the mixing coefficients to sum to one. This can be achieved using a Lagrange multiplier and maximizing the following quantity  (9.20)
which gives (9.21)   
##################################

note:  and set the derivative ==0 of 9.20
##################################
(9.21) where again we see the appearance of the responsibilities. If we now multiply both sides by and sum over k making use of the constraint (9.9), we find λ = −N ######. Using this to eliminate λ and rearranging we obtain
(9.22)

so that the mixing coefficient for the component is given by the average responsibility which that component takes for explaining the data points.

     It is worth emphasizing that the results (9.17), (9.19), and (9.22) do not constitute a closed-form solution for the parameters of the mixture model because the responsibilities depend on those parameters in a complex way through (9.13). However, these results do suggest a simple iterative scheme for finding a solution to the maximum likelihood problem, which as we shall see turns out to be an instance of the EM algorithm for the particular case of the Gaussian mixture model.

  • We first choose some initial values for the means, covariances, and mixing coefficients.
  • Then we alternate between the following two updates that we shall call the E step and the M step, for reasons that will become apparent shortly.

In the expectation step, or E step, we use the current values for the parameters to evaluate the posterior probabilities, or responsibilities , given by (9.13 ). We then use these probabilities in the maximization step, or M step, to re-estimate the means, covariances, and mixing coefficients using the results (9.17 ), (9.19 ), and (9.22 ) and . Note that in so doing在进行这一步骤时 we first evaluate the new means using (9.17) and then use these new values to find the covariances using (9.19), in keeping with the corresponding result for a single
Gaussian distribution
. We shall show that each update to the parameters resulting from an E step followed by an M step is guaranteed to increase the log likelihood Section 9.4 function. In practice, the algorithm is deemed to have converged when the change in the log likelihood function, or alternatively in the parameters, falls below some threshold. We illustrate the EM algorithm for a mixture of two Gaussians applied to the rescaled Old Faithful data set in Figure 9.8.
Figure 9.8 Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K-means algorithm in Figure 9.1. See the text for details.

Here a mixture of two Gaussians is used, with centres initialized using the same values as for the K-means algorithm in Figure 9.1, and with precision matrices initialized to be proportional to the unit matrix. Plot (a) shows the data points in green, together with the initial configuration of the mixture model in which the one standard-deviation contours for the two Gaussian components are shown as blue and red circles. Plot (b) shows the result of the initial E step, in which each data point is depicted using a proportion of blue ink equal to the posterior probability of having been generated from the blue component, and a corresponding proportion of red ink given by the posterior probability of having been generated by the red component. Thus, points that have a significant probability for belonging to either cluster appear purple. The situation after the first M step is shown in plot (c), in which the mean of the blue Gaussian has moved to the mean of the data set, weighted by the probabilities of each data point belonging to the blue cluster, in other words it has moved to the centre of mass of the blue ink. Similarly, the covariance of the blue Gaussian is set equal to the covariance of the blue ink. Analogous 相似的 results hold for the red component. Plots (d), (e), and (f) show the results after 2, 5, and 20 complete cycles of EM, respectively. In plot (f) the algorithm is close to convergence.

     Note that the EM algorithm takes many more iterations to reach (approximate) convergence compared with the K-means algorithm, and that each cycle requires significantly more computation. It is therefore common to run the K-means algorithm in order to find a suitable initialization for a Gaussian mixture model that is subsequently adapted using EM. The covariance matrices can conveniently be initialized to the sample covariances of the clusters found by the K-means algorithm, and the mixing coefficients can be set to the fractions of data points assigned to the respective clusters. As with gradient-based approaches for maximizing the log likelihood, techniques must be employed to avoid singularities of the likelihood function in which a Gaussian component collapses onto a particular data point. It should be emphasized that there will generally be multiple local maxima of the log likelihood function, and that EM is not guaranteed to find the largest of these maxima. Because the EM algorithm for Gaussian mixtures plays such an important role, we summarize it below.

EM for Gaussian Mixtures

https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html

https://github.com/mcdickenson/em-gaussian

https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html

From (2.188  the log of the likelihood function对数似然函数 is given by
  (9.14)

Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to the parameters (comprising the means and covariances of the components and the mixing coefficients).

1. Initialize the means , covariances and mixing coefficients , and evaluate the initial value of the log likelihood.

2. E step. Evaluate the responsibilities using the current parameter values
 (9.23)

3. M step. Re-estimate the parameters using the current responsibilities

4. Evaluate the log likelihood
 (9.28)

and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2.

https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

https://github.com/scikit-learn/scikit-learn/blob/483cd3eaa/sklearn/mixture/_gaussian_mixture.py#L434

https://github.com/rushter/MLAlgorithms/blob/master/mla/gaussian_mixture.py

08_Dimensionality Reduction

https://blog.csdn.net/Linli522362242/article/details/106214887

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值