讲解：CSCI 567 Python、Python、Python、Python

最新推荐文章于 2024-09-13 21:13:41 发布

荆州克莱

最新推荐文章于 2024-09-13 21:13:41 发布

阅读量907

点赞数 12

本文链接：https://blog.csdn.net/kelai_6792/article/details/141967231

版权

面试题汇总与解析专栏收录该内容

20 篇文章 0 订阅

订阅专栏

CSCI 567, Fall’18Haipeng LuoHomework #4 Programming AssignmentDue: 11:59 pm, November 4, 2018General instructionsYour repository will have now a directory “P4/”. Please do not change the name of this repository or thenames of any files we have added to it. Please perform. a git pull to retrieve these files. You will findwithin it: baboon.tiff, data loader.py, gmm.py, gmm.sh, gmmTest.py, kmeans.py, kmeans.sh,kmeansTest.py, utils.py. Depending on your environment you may need to install the python librarynamed, ”pillow”, which is used by matplotlib to process some of the images needed for this assignment.You do not need to import pillow in the files we provide, it only needs to be installed in your environment.You can install it by running ’pip install pillow’ in your command line.High Level Description0.1 Tasks In this assignment you are asked to implement K-means clustering to identify main clusters inthe data, use the discovered centroid of cluster for classification, and implement Gaussian Mixture Modelsto learn a generative model of the data. Specifically, you willImplement K-means clustering algorithm to identify clusters in a two-dimensional toy-dataset.Implement image compression using K-means clustering algorithm.Implement classification using the centroids identified by clustering.Implement Gaussian Mixture Models to learn a generative model and generate samples from a mix-ture distribution.0.2 Running the code We have provided two scripts to run the code. Run kmeans.sh after you fin-ish implementation of k-means clustering, classification and compression. Run gmm.sh after you finishimplementing Gaussian Mixture Models.0.3 Dataset Through out the assignment we will use two datasets (See fig. 1) — Toy Dataset and DigitsDataset (you do not need to download).Toy Dataset is a two-dimensional dataset generated from 4 Gaussian distributions. We will use thisdataset to visualize the results of our algorithm in two dimensions.We will use digits dataset from sklearn [1] to test K-means based classifier and generate digits usingGaussian Mixture model. Each data point is a 8 8 image of a digit. This is similar to MNIST but lesscomplex.1(a) Digits (b) 2-D Toy DatasetFigure 1: Datasets0.4 Cautions Please DO NOT import packages that are not listed in the provided code. Follow the in-structions in each section strictly to code up your solutions. Do not change the output format. Do notmodify the code unless we instruct you to do so. A homework solution that does not match the providedsetup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make surethat your code runs with the provided commands and scripts on the VM. Finally, make sure that you gitadd, commit, and push all the required files, including your code and generated output files.0.5 Final submission After you have solved problem 1 and 2, execute bash kmeans.sh command andbash gmm.sh command. Git add, commit and push plots and results folder and all the *.py files.Problem 1 K-means ClusteringRecall that for a dataset x1, . . . , xN 2RD, the K-means distortion objective is2: Outputs:Array of size K D of means, fmkgKk=1Membership vector R of size N, where R[i]2[K] is the index of the cluster thatexample i belongs to.3: Initialize:Set meansfmkgKk=1 to be K points selected from x uniformly at random (withreplacement), and J to be a large number (e.g. 1010)4: repeat5: Compute membership rik using eq. 26: Compute distortion measure Jnew using eq. 17: ifjJ Jnewj e thenSTOP8: end if9: Set J := Jnew10: Compute means mk using eq. 311: until maximum iteration is reached1.1 Implementing K-means clustering algorithm Implement Algorithm 1 by filling out the TODO partsin class KMeans of file kmeans.py. Note the following:Use numpy.random.choice for the initialization step.If at some iteration, there exists a cluster k with no points assigned to it, then do not update thecentroid of this cluster for this round.While assigning a sample to a cluster, if there’s a tie (i.e. the sample is equidistant from two centroids),you should choose the one with smaller index (like what numpy.argmin does).After you complete the implementation, execute bash kmeans.sh command to run k-means on toydataset. You should be able to see three images generated in plots folder. In particular, you can seetoy dataset predicted labels.png and toy dataset real labels.png and compare the clus-ters identified by the algorithm against the real clusters. Your implementation should be able to recover thecorrect clusters sufficiently well. Representative images are shown in fig. 2. Red dots are cluster centroids.Note that color coding of recovered clusters may not match that of correct clusters. This is due to mis-matchin ordering of retrieved clusters and correct clusters (which is fine).3(a) Predicted Clusters (b) Real ClustersFigure 2: Clustering on toy dataset1.2 Image compression with K-means In the next part, we will look at lossy image compression as anapplication of clustering. The idea is simply to treat each pixel of an image as a point xi, then pe代写CSCI 567 Python实验、Python程序代做、代做Python程序、Python语言代做留学生rformK-means algorithm to cluster these points, and finally replace each pixel with its centroid.What you need to implement is to compress an image with K centroids given. Specifically, complete thefunction transform. image in the file kmeansTest.py.After your implementation, execute bash kmeans.sh again and you should be able to see an imagebaboon compressed.png in the plots folder. You can see that this image is distorted as compared tothe original baboon.tiff.1.3 Classification with k-means Another application of clustering is to obtain a faster version of the near-est neighbor algorithm. Recall that nearest neighbor evaluates the distance of a test sample from everytraining point to predict its class, which can be very slow. Instead, we can compress the entire trainingset to just the K centroids, where each centroid is now labeled as the majority class of the correspondingcluster. After this compression the prediction time of nearest neighbor is reduced from O(N) to just O(K)(see Algorithm 2 for the pseudocode).Algorithm 2 Classification with K-means clustering1: Inputs:Training Data :fX,YgParameters for running K-means clustering2: Training:Run K-means clustering to find centroids and membership (reuse your code from Problem 1.1)Label each centroid with majority voting from its members. i.e. arg maxc i rikIfyi = cg3: Prediction:Predict the same label as the nearest centroid (that is, 1-NN on centroids).Note: 1) break ties in the same way as in previous problems; 2) if some centroid doesn’t contain anypoint, set the label of this centroid as 0.Complete the fit and predict function in KMeansClassifier in file kmeans.py. Once completed,run kmeans.sh to evaluate the classifier on a test set. For comparison, the script. will also print accuracy ofa logistic classifier and a nearest neighbor classifier. (Note: a naive K-means classifier may not do well butit can be an effective unsupervised method in a classification pipeline [2].)4Problem 2 Gaussian Mixture ModelNext you will implement Gaussian Mixture Model (GMM) for clustering and also generate data after learn-ing the model. Recall the key steps of EM algorithm for learning GMMs on Slide 52 of Lec 8 (we change thenotation wk to pk):2.1 Implementing EM Implement EM algorithm (class Gaussian pdf, function fit and function compute log likelihood) in file gmm.py to estimate mixture model parameters. Please note the following:For K-means initialization, the inputs of K-means are the same as those of EM’s.When computing the density of a Gaussian with covariance matrix S, use S0 = S+ 10 3I when S isnot invertible (in case it’s still not invertible, keep adding 10 3I until it is invertible).After implementation execute bash gmm.sh command to estimate mixture parameters for toy dataset.You should see a Gaussian fitted to each cluster in the data. A representative image is shown in fig. 3.We evaluate both initialization methods and you should observe that initialization with K-means usuallyconverges faster.Figure 3: Gaussian Mixture model on toy datasetAlgorithm 3 EM algorithm for estimating GMM parameters1: Inputs:An array of size N D denoting the training set, xMaximum number of iterations, max iterNumber of clusters, KError tolerance, eInit method — K-means or random2: Outputs:Array of size K D of means, fmkgKk=1Variance matrix Sk of size K D DA vector of size K denoting the mixture weights, pi k3: Initialize:For “random” init method: initialize means uniformly at random from [0,1)for each dimension (use numpy.random.rand), initialize variance to beidentity matrix for each component, initialize mixture weight to be uniform.For “K-means” init method: run K-means, initialize means as the centroidsfound by K-means, and initialize variance and mixture weight according toEq. (7) and Eq. (8) where gik is the binary membership found by K-means.4: Compute the log-likelihood l using Eq. (9)5: repeat6: E Step: Compute responsibilities using Eq. (4)7: M Step:Estimate means using Eq. (6)Estimate variance using Eq. (7)Estimate mixture weight using Eq. (8)8: Compute new log-likelihood lnew9: ifjl lnewj e thenSTOP10: end if11: Set l := lnew12: until maximum iteration is reached2.2 Implementing sampling We also fit a GMM with K = 30 using the digits dataset. An advantage ofGMM compared to K-means is that we can sample from the learned distribution to generate new syntheticexamples which look similar to the actual data.To do this, implement sample function in gmm.py which uses self.means, self.variances andself.pi k to generate digits. Recall that sampling from a GMM is a two step process: 1) first sample acomponent k according to the mixture weight; 2) then sample from a Gaussian distribution with mean mkand variance Sk. Use numpy.random for these sampling steps.After implementation, execute bash gmm.sh again. This should produce visualization of means mkand some generated samples for the learned GMM. Representative images are shown in fig. 4.6(a) Means of GMM learnt on digits (b) Random digits sample generated from GMM转自：http://ass.3daixie.com/2018110120083459.html