Stanford Algorithms学习:Clustering 2

26 篇文章 0 订阅
3 篇文章 0 订阅

这是紧接上面的一道题,比较有趣

Question 2

In this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph. So big, in fact, that the distances (i.e., edge costs) are only defined  implicitly, rather than being provided as an explicit list.

The data set is here. The format is:
[# of nodes] [# of bits for each node's label]
[first bit of node 1] ... [last bit of node 1]
[first bit of node 2] ... [last bit of node 2]
...
For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.

The distance between two nodes  u  and  v  in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of  k  such that there is a  k -clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably can't write it out explicitly, let alone sort the edges by cost. So you will have to be a little creative to complete this part of the question. For example, is there some way you can identify the smallest distances without explicitly looking at every pair of nodes?

这道题也是要用到Kruskal's MST的算法的思想,只不过,这次没有给出每条边的cost,因此也无法对cost进行排序。事实上,这次的cost的定义就不大一样(见题目中的Hamming distance)。


首先这些点的存储是二进制的,因此用一个BitSet来表示每一个节点是一个很直接的选择。

这道题的问题可以解释为:把所有Hamming Distance为0,1,2的点都聚合到一起,剩下多少个点?很自然地,利用Union Find,把距离为0,1,2的点都union到一起,剩下的节点数就是结果。

剩下的问题就是如何快速找到距离为0,1,2的点。对每个pair进行暴力搜索显然不大行,因为文本中一共有200000个点,每个点有24个bit,这耗时太长了。其实仔细想想,对每一个节点,只需要找到与其每个bit都相同,只有一个bit不相同以及有两个bit不同的节点即可。这其实并没有很多种可能性(1 + 24 + 276 = 301)。所以启用作弊神器hashmap就可以进行“暴力”搜索了,耗时不长。


建立一个hashmap,利用每一个BitSet作key,相应的index(即这个节点在union find结构中的编号)作为value,可以很快地得到答案。其中用到的UnionFind和上篇文章中得一样。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.BitSet;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Set;
import java.util.StringTokenizer;

public class Clustering2 {
	private HashMap<BitSet, Integer> dataSet;
	private UnionFind uf;
	private int numberOfBits;
	private int numberOfNodes;
	private String filename = "/Users/Zhusong/Documents/Study/AlgorithmsDesignAndAnalysis/Assignments/Ass2/Ass2Prob2/clustering_big.txt";
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		Clustering2 clustering2 = new Clustering2();
		clustering2.run();
	}
	
	private void run(){
		dataSet = new HashMap<BitSet, Integer>();
		build();
		calDis1();
		calDis2();
		System.out.println(uf.count());
	}
	
	/**
	 * find nodes that have a distance of 2
	 */
	private void calDis2(){
		Set<BitSet> keySet = dataSet.keySet();
		Iterator<BitSet> iterator = keySet.iterator();
		while (iterator.hasNext()) {
			BitSet bitSet = (BitSet) iterator.next();
			for (int i = 0; i < numberOfBits; i++) {
				for (int j = 0; (j < numberOfBits) && (j != i); j++) {
					BitSet temp = (BitSet) bitSet.clone();
					temp.flip(i);
					temp.flip(j);
					if (dataSet.containsKey(temp)) {
						uf.union(dataSet.get(bitSet), dataSet.get(temp));
					}
				}
			}
		}
	}
	
	/**
	 * find nodes that have a distance of 1
	 */
	private void calDis1(){
		Set<BitSet> keySet = dataSet.keySet();
		Iterator<BitSet> iterator = keySet.iterator();
		while (iterator.hasNext()) {
			BitSet bitSet = (BitSet) iterator.next();
			for (int i = 0; i < numberOfBits; i++) {
				BitSet temp = (BitSet) bitSet.clone();
				temp.flip(i);
				if (dataSet.containsKey(temp)) {
					uf.union(dataSet.get(bitSet), dataSet.get(temp));
				}
			}
		}
	}
	
	/**
	 * 1. read in the text file
	 * 2. create a union find structure
	 * 3. build the hash map and union the nodes that have 0 distances
	 */
	private void build(){
		File file = new File(filename);
		try {
			BufferedReader rd = new BufferedReader(new FileReader(file));
			String line = rd.readLine();
			StringTokenizer tokenizer = new StringTokenizer(line);
			numberOfNodes = Integer.parseInt(tokenizer.nextToken());
			uf = new UnionFind(numberOfNodes);
			numberOfBits = Integer.parseInt(tokenizer.nextToken());
			//
			int index = 0;
			while ((line = rd.readLine()) != null) {
				tokenizer = new StringTokenizer(line);
				BitSet bitSet = new BitSet(numberOfBits);
				//creates a bit set demonstrating the current node
				for (int i = 0; i < numberOfBits; i++) {
					if (tokenizer.nextToken().equals("1")) {
						bitSet.set(i);
					}
				}
				//put it in the hash map if no identical nodes are already in there
				if (!dataSet.containsKey(bitSet)) {
					dataSet.put(bitSet, index);
				}
				//union the two nodes if the current node has a duplicate in the hash map
				else {
					uf.union(index, dataSet.get(bitSet));										
				}
				index++;		
			}
			rd.close();
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值