聚类是一种无监督学习方法,无监督学习方法没有训练过程。所谓无监督学习是指事先并不知道要寻找的内容,即没有目标变量。聚类将数据点归到多个族中,其中相似数据点处于同一族,而不相似数据点处在不同族。聚类可以使用多种不同的方法来计算相似度。
一种广泛使用的聚类算法是K-均值算法,其中k是用户指定的要创建的簇的数目。 K-均值聚类算法以k个随机质心开始。算法会计算每个点到质心的距离。每个点会被分配到距其最近的簇 质心,然后紧接着基于新分配到簇的点更新簇质心。以上过程重复数次,直到簇质心不再改变。这个简单的算法非常有效但是也容易受到初始簇质心的影响。为了获得更好的聚类效果,可以使用另一种称为二分K-均值的聚类算法。二分K-均值算法首先将所有点作为一个簇,然后使用K-均值算法(k = 2)对其划分。下一次迭代时,选择有最大误差的簇进行划分。该过程重复直到k个簇创建成功为止。二分K-均值的聚类效果要好于K-均值算法。K-均值算法以及变形的K-均值算法并非仅有的聚类算法,另外称为层次聚类的方法也被广泛使用。
附上二分k均值的算法实现(python代码)
from numpy import * #加载数据集 def loadDataSet(fileName): #general function to parse tab -delimited floats dataMat = [] #assume last column is target value fr = open(fileName) for line in fr.readlines(): curLine = line.strip().split('\t') dataMat.append(curLine) return dataMat #计算欧拉距离,两个向量之间的距离 def distEclud(vecA, vecB): return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB) #随机选择K个聚类中心 def randCent(dataSet, k): n = shape(dataSet)[1] centroids = mat(zeros((k,n))) #创建k个聚类中心 for j in range(n):#create random cluster centers, within bounds of each dimension minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1)) return centroids #k均值算法 def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2)))#create mat to assign data points #to a centroid, also holds SE(误差平方和) of each point centroids = createCent(dataSet, k) #创建k个聚类中心 clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m): #将数据点i归类到与它最近的集群 minDist = inf; minIndex = -1 for j in range(k): distJI = distMeas(centroids[j,:],dataSet[i,:]) if distJI < minDist: minDist = distJI; minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 print (centroids) for cent in range(k):#recalculate centroids ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean return centroids, clusterAssment #二分k均值聚类算法 def biKmeans(dataSet, k, distMeas=distEclud): m = shape(dataSet)[0] #得到多少行 clusterAssment = mat(zeros((m,2))) centroid0 = mean(dataSet, axis=0).tolist()[0] #初始化一个聚类中心 centList =[centroid0] #create a list with one centroid for j in range(m): #计算初始化的误差平方和 clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2 while (len(centList) < k): #每次增加一个聚类中心,直到达到k个聚类中心为止 lowestSSE = inf #初始化为无穷大 for i in range(len(centList)): ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:] #得到聚类中心i中的所有数据点 centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas) sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1]) #得到不属于聚类中心i的所有数据点的误差平方和 print ("sseSplit, and notSplit: ",sseSplit,sseNotSplit) if (sseSplit + sseNotSplit) < lowestSSE: bestCentToSplit = i bestNewCents = centroidMat bestClustAss = splitClustAss.copy() lowestSSE = sseSplit + sseNotSplit bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #重新调整聚类数据(注意是数据) bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit print ('the bestCentToSplit is: ',bestCentToSplit) print ('the len of bestClustAss is: ', len(bestClustAss)) centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] #重新调整聚类中心(注意是中心) centList.append(bestNewCents[1,:].tolist()[0]) clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE return mat(centList), clusterAssment data=loadDataSet('testSet2.txt') print(data) dealData=[] print('___________') for i in range(len(data)): subData=[] subData.append(float(data[i][0])) subData.append(float(data[i][1])) dealData.append(subData) data=mat(dealData) print(data) centList,clusterAssment=biKmeans(data,3) print(centList)
testSet2.txt中的数据
3.275154 2.957587 -3.344465 2.603513 0.355083 -3.376585 1.852435 3.547351 -2.078973 2.552013 -0.993756 -0.884433 2.682252 4.007573 -3.087776 2.878713 -1.565978 -1.256985 2.441611 0.444826 -0.659487 3.111284 -0.459601 -2.618005 2.177680 2.387793 -2.920969 2.917485 -0.028814 -4.168078 3.625746 2.119041 -3.912363 1.325108 -0.551694 -2.814223 2.855808 3.483301 -3.594448 2.856651 0.421993 -2.372646 1.650821 3.407572 -2.082902 3.384412 -0.718809 -2.492514 4.513623 3.841029 -4.822011 4.607049 -0.656297 -1.449872 1.919901 4.439368 -3.287749 3.918836 -1.576936 -2.977622 3.598143 1.975970 -3.977329 4.900932 -1.791080 -2.184517 3.914654 3.559303 -1.910108 4.166946 -1.226597 -3.317889 1.148946 3.345138 -2.113864 3.548172 0.845762 -3.589788 2.629062 3.535831 -1.640717 2.990517 -1.881012 -2.485405 4.606999 3.510312 -4.366462 4.023316 0.765015 -3.001270 3.121904 2.173988 -4.025139 4.652310 -0.559558 -3.840539 4.376754 4.863579 -1.874308 4.032237 -0.089337 -3.026809 3.997787 2.518662 -3.082978 2.884822 0.845235 -3.454465 1.327224 3.358778 -2.889949 3.596178 -0.966018 -2.839827 2.960769 3.079555 -3.275518 1.577068 0.639276 -3.412840