Stanford Algorithms学习：Clustering 2

最新推荐文章于 2019-03-22 03:42:00 发布

likecool21

最新推荐文章于 2019-03-22 03:42:00 发布

阅读量2.4k

点赞数

分类专栏： Java 算法数据结构文章标签： clustering Hashmap 算法 Union-Find JAVA

本文链接：https://blog.csdn.net/likecool21/article/details/11859621

版权

Java 同时被 3 个专栏收录

50 篇文章 0 订阅

订阅专栏

算法

26 篇文章 0 订阅

订阅专栏

数据结构

3 篇文章 0 订阅

订阅专栏

这是紧接上面的一道题，比较有趣

Question 2

In this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph. So big, in fact, that the distances (i.e., edge costs) are only defined implicitly, rather than being provided as an explicit list.

The data set is here. The format is:
[# of nodes] [# of bits for each node's label]
[first bit of node 1] ... [last bit of node 1]
[first bit of node 2] ... [last bit of node 2]
...
For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.

The distance between two nodes u and v in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of k such that there is a k -clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably can't write it out explicitly, let alone sort the edges by cost. So you will have to be a little creative to complete this part of the question. For example, is there some way you can identify the smallest distances without explicitly looking at every pair of nodes?

这道题也是要用到Kruskal's MST的算法的思想，只不过，这次没有给出每条边的cost，因此也无法对cost进行排序。事实上，这次的cost的定义就不大一样（见题目中的Hamming distance)。

首先这些点的存储是二进制的，因此用一个BitSet来表示每一个节点是一个很直接的选择。

这道题的问题可以解释为：把所有Hamming Distance为0，1，2的点都聚合到一起，剩下多少个点？很自然地，利用Union Find，把距离为0，1，2的点都union到一起，剩下的节点数就是结果。

剩下的问题就是如何快速找到距离为0，1，2的点。对每个pair进行暴力搜索显然不大行，因为文本中一共有200000个点，每个点有24个bit，这耗时太长了。其实仔细想想，对每一个节点，只需要找到与其每个bit都相同，只有一个bit不相同以及有两个bit不同的节点即可。这其实并没有很多种可能性（1 + 24 + 276 = 301）。所以启用作弊神器hashmap就可以进行“暴力”搜索了，耗时不长。

建立一个hashmap，利用每一个BitSet作key，相应的index（即这个节点在union find结构中的编号）作为value，可以很快地得到答案。其中用到的UnionFind和上篇文章中得一样。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.BitSet;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Set;
import java.util.StringTokenizer;

public class Clustering2 {
	private HashMap<BitSet, Integer> dataSet;
	private UnionFind uf;
	private int numberOfBits;
	private int numberOfNodes;
	private String filename = "/Users/Zhusong/Documents/Study/AlgorithmsDesignAndAnalysis/Assignments/Ass2/Ass2Prob2/clustering_big.txt";
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		Clustering2 clustering2 = new Clustering2();
		clustering2.run();
	}
	
	private void run(){
		dataSet = new HashMap<BitSet, Integer>();
		build();
		calDis1();
		calDis2();
		System.out.println(uf.count());
	}
	
	/**
	 * find nodes that have a distance of 2
	 */
	private void calDis2(){
		Set<BitSet> keySet = dataSet.keySet();
		Iterator<BitSet> iterator = keySet.iterator();
		while (iterator.hasNext()) {
			BitSet bitSet = (BitSet) iterator.next();
			for (int i = 0; i < numberOfBits; i++) {
				for (int j = 0; (j < numberOfBits) && (j != i); j++) {
					BitSet temp = (BitSet) bitSet.clone();
					temp.flip(i);
					temp.flip(j);
					if (dataSet.containsKey(temp)) {
						uf.union(dataSet.get(bitSet), dataSet.get(temp));
					}
				}
			}
		}
	}
	
	/**
	 * find nodes that have a distance of 1
	 */
	private void calDis1(){
		Set<BitSet> keySet = dataSet.keySet();
		Iterator<BitSet> iterator = keySet.iterator();
		while (iterator.hasNext()) {
			BitSet bitSet = (BitSet) iterator.next();
			for (int i = 0; i < numberOfBits; i++) {
				BitSet temp = (BitSet) bitSet.clone();
				temp.flip(i);
				if (dataSet.containsKey(temp)) {
					uf.union(dataSet.get(bitSet), dataSet.get(temp));
				}
			}
		}
	}
	
	/**
	 * 1. read in the text file
	 * 2. create a union find structure
	 * 3. build the hash map and union the nodes that have 0 distances
	 */
	private void build(){
		File file = new File(filename);
		try {
			BufferedReader rd = new BufferedReader(new FileReader(file));
			String line = rd.readLine();
			StringTokenizer tokenizer = new StringTokenizer(line);
			numberOfNodes = Integer.parseInt(tokenizer.nextToken());
			uf = new UnionFind(numberOfNodes);
			numberOfBits = Integer.parseInt(tokenizer.nextToken());
			//
			int index = 0;
			while ((line = rd.readLine()) != null) {
				tokenizer = new StringTokenizer(line);
				BitSet bitSet = new BitSet(numberOfBits);
				//creates a bit set demonstrating the current node
				for (int i = 0; i < numberOfBits; i++) {
					if (tokenizer.nextToken().equals("1")) {
						bitSet.set(i);
					}
				}
				//put it in the hash map if no identical nodes are already in there
				if (!dataSet.containsKey(bitSet)) {
					dataSet.put(bitSet, index);
				}
				//union the two nodes if the current node has a duplicate in the hash map
				else {
					uf.union(index, dataSet.get(bitSet));										
				}
				index++;		
			}
			rd.close();
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

likecool21

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Stanford Algorithms学习：Clustering 2

这是紧接上面的一道题，比较有趣Question 2In this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph. So big, in fact, that the distances (i.e., edge costs
复制链接

扫一扫

专栏目录