高斯混合模型的Matlab实例

326 篇文章 2 订阅
183 篇文章 6 订阅

Introduction to Gaussian Mixture Models

Gaussian mixture models are formed by combining multivariate normal density components. For information on individual multivariate normal densities, see Multivariate Normal Distribution and related distribution functions listed under Multivariate Distributions.

In Statistics Toolbox software, use the gmdistribution class to fit data using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation.

Gaussian mixture models are often used for data clustering. Clusters are assigned by selecting the component that maximizes the posterior probability. Like k-means clustering, Gaussian mixture modeling uses an iterative algorithm that converges to a local optimum. Gaussian mixture modeling may be more appropriate than k-means clustering when clusters have different sizes and correlation within them. Clustering using Gaussian mixture models is sometimes considered a soft clustering method. The posterior probabilities for each point indicate that each data point has some probability of belonging to each cluster.

Creation of Gaussian mixture models is described in the Gaussian Mixture Models section of Probability Distributions. This section describes their application in cluster analysis.

Clustering with Gaussian Mixtures

Gaussian mixture distributions can be used for clustering data, by realizing that the multivariate normal components of the fitted model can represent clusters.

  1. To demonstrate the process, first generate some simulated data from a mixture of two bivariate Gaussian distributions using the mvnrnd function:

    mu1 = [1 2];
    sigma1 = [3 .2; .2 2];
    mu2 = [-1 -2];
    sigma2 = [2 0; 0 1];
    X = [mvnrnd(mu1,sigma1,200);mvnrnd(mu2,sigma2,100)];
    
    scatter(X(:,1),X(:,2),10,'ko')
    

  2. Fit a two-component Gaussian mixture distribution. Here, you know the correct number of components to use. In practice, with real data, this decision would require comparing models with different numbers of components.

    options = statset('Display','final');
    gm = gmdistribution.fit(X,2,'Options',options);

    This displays

     49 iterations, log-likelihood = -1207.91
  3. Plot the estimated probability density contours for the two-component mixture distribution. The two bivariate normal components overlap, but their peaks are distinct. This suggests that the data could reasonably be divided into two clusters:

    hold on
    ezcontour(@(x,y)pdf(gm,[x y]),[-8 6],[-8 6]);
    hold off
    

  4. Partition the data into clusters using the cluster method for the fitted mixture distribution. The cluster method assigns each point to one of the two components in the mixture distribution.

    idx = cluster(gm,X);
    cluster1 = (idx == 1);
    cluster2 = (idx == 2);
    
    scatter(X(cluster1,1),X(cluster1,2),10,'r+');
    hold on
    scatter(X(cluster2,1),X(cluster2,2),10,'bo');
    hold off
    legend('Cluster 1','Cluster 2','Location','NW')
    

    Each cluster corresponds to one of the bivariate normal components in the mixture distribution. cluster assigns points to clusters based on the estimated posterior probability that a point came from a component; each point is assigned to the cluster corresponding to the highest posterior probability. The posterior method returns those posterior probabilities.

    For example, plot the posterior probability of the first component for each point:

    P = posterior(gm,X);
    
    scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
    hold on
    scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
    hold off
    legend('Cluster 1','Cluster 2','Location','NW')
    clrmap = jet(80); colormap(clrmap(9:72,:))
    ylabel(colorbar,'Component 1 Posterior Probability')
    

Soft Clustering Using Gaussian Mixture Distributions

An alternative to the previous example is to use the posterior probabilities for "soft clustering". Each point is assigned a membership score to each cluster. Membership scores are simply the posterior probabilities, and describe how similar each point is to each cluster's archetype, i.e., the mean of the corresponding component. The points can be ranked by their membership score in a given cluster:

[~,order] = sort(P(:,1));
plot(1:size(X,1),P(order,1),'r-',1:size(X,1),P(order,2),'b-');
legend({'Cluster 1 Score' 'Cluster 2 Score'},'location','NW');
ylabel('Cluster Membership Score');
xlabel('Point Ranking');

Although a clear separation of the data is hard to see in a scatter plot of the data, plotting the membership scores indicates that the fitted distribution does a good job of separating the data into groups. Very few points have scores close to 0.5.

Soft clustering using a Gaussian mixture distribution is similar to fuzzy K-means clustering, which also assigns each point to each cluster with a membership score. The fuzzy K-means algorithm assumes that clusters are roughly spherical in shape, and all of roughly equal size. This is comparable to a Gaussian mixture distribution with a single covariance matrix that is shared across all components, and is a multiple of the identity matrix. In contrast, gmdistribution allows you to specify different covariance options. The default is to estimate a separate, unconstrained covariance matrix for each component. A more restricted option, closer to K-means, would be to estimate a shared, diagonal covariance matrix:

gm2 = gmdistribution.fit(X,2,'CovType','Diagonal',...
  'SharedCov',true);

This covariance option is similar to fuzzy K-means clustering, but provides more flexibility by allowing unequal variances for different variables.

You can compute the soft cluster membership scores without computing hard cluster assignments, using posterior, or as part of hard clustering, as the second output from cluster:

P2 = posterior(gm2,X); % equivalently [idx,P2] = cluster(gm2,X)
[~,order] = sort(P2(:,1));
plot(1:size(X,1),P2(order,1),'r-',1:size(X,1),P2(order,2),'b-');
legend({'Cluster 1 Score' 'Cluster 2 Score'},'location','NW');
ylabel('Cluster Membership Score');
xlabel('Point Ranking');

Assigning New Data to Clusters

In the previous example, fitting the mixture distribution to data using fit, and clustering those data using cluster, are separate steps. However, the same data are used in both steps. You can also use the cluster method to assign new data points to the clusters (mixture components) found in the original data.

  1. Given a data set X, first fit a Gaussian mixture distribution. The previous code has already done that.

    gm
    
    gm = 
    Gaussian mixture distribution with 2 components in 2 dimensions
    Component 1:
    Mixing proportion: 0.312592
    Mean:    -0.9082   -2.1109
    
    Component 2:
    Mixing proportion: 0.687408
    Mean:     0.9532    1.8940
    
  2. You can then use cluster to assign each point in a new data set, Y, to one of the clusters defined for the original data:

    Y = [mvnrnd(mu1,sigma1,50);mvnrnd(mu2,sigma2,25)];
    
    idx = cluster(gm,Y);
    cluster1 = (idx == 1);
    cluster2 = (idx == 2);
    
    scatter(Y(cluster1,1),Y(cluster1,2),10,'r+');
    hold on
    scatter(Y(cluster2,1),Y(cluster2,2),10,'bo');
    hold off
    legend('Class 1','Class 2','Location','NW')
    

    As with the previous example, the posterior probabilities for each point can be treated as membership scores rather than determining "hard" cluster assignments.

For cluster to provide meaningful results with new data, Y should come from the same population as X, the original data used to create the mixture distribution. In particular, the estimated mixing probabilities for the Gaussian mixture distribution fitted to X are used when computing the posterior probabilities for Y.

当然可以,下面是一个使用MATLAB实现高斯混合模型(Gaussian Mixture Model)的示例代码: ```matlab % 生成一些随机数据用于训练 rng(1); % 设置随机数种子,以确保可重复性 data = [mvnrnd([1 2], [1 0.5], 200); mvnrnd([4 5], [1 0.5], 200)]; % 设置高斯混合模型的参数 numComponents = 2; % 高斯分量的数量 options = statset('MaxIter', 500); % 设置迭代次数上限 % 训练高斯混合模型 gmModel = fitgmdist(data, numComponents, 'Options', options); % 绘制原始数据散点图 figure; scatter(data(:, 1), data(:, 2), 'filled'); hold on; % 绘制高斯混合模型的概率密度函数 x = min(data(:, 1)):0.1:max(data(:, 1)); y = min(data(:, 2)):0.1:max(data(:, 2)); [X, Y] = meshgrid(x, y); probs = pdf(gmModel, [X(:), Y(:)]); contour(X, Y, reshape(probs(:, 1), size(X)), 'LineColor', 'r'); contour(X, Y, reshape(probs(:, 2), size(X)), 'LineColor', 'b'); % 绘制高斯分量的均值和协方差椭圆 mu = gmModel.mu; Sigma = gmModel.Sigma; for i = 1:numComponents plot(mu(i, 1), mu(i, 2), 'ko', 'MarkerSize', 8, 'LineWidth', 2); plot(mu(i, 1) + Sigma(1, 1, i), mu(i, 2) + Sigma(2, 2, i), 'r-', 'LineWidth', 2); end hold off; title('Gaussian Mixture Model'); xlabel('X'); ylabel('Y'); legend('Data', 'Component 1', 'Component 2', 'Location', 'best'); ``` 这段代码中,首先生成一些随机数据用于训练。然后,通过`fitgmdist`函数训练一个`numComponents`个高斯分量的高斯混合模型。最后,使用`pdf`函数计算模型的概率密度函数,并使用`contour`函数绘制概率密度函数的等高线及各个高斯分量的均值和协方差椭圆。 希望以上代码对您有帮助!如有任何问题,请随时提问。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值