Week 1 Assignment - Wordnet - Princeton - Algorithms Part II

题注

本来想4月2日把手头事情干完了以后回来写一写Princeton大学《Algorithm Part II》的Programming Assignment的,不过今天点开自己的博客,发现有个朋友“催稿”呢~ 今天正好心情非常好,就来写一写Week 1的题目吧~ 希望能给大家带来帮助。

有关Week2的,真心等我4月2日忙完了再来研究吧~

题目

WordNet is a semantic lexicon for theEnglish language that is used extensively by computational linguistsand cognitive scientists; for example, it was a key component in IBM'sWatson.WordNet groups words into sets of synonyms called synsets and describes semantic relationships between them.One such relationship is the is-a relationship, which connects a hyponym(more specific synset) to a hypernym (more general synset).For example, locomotion is a hypernym of runningand running is a hypernym of dash.

The WordNet digraph.Your first task is to build the wordnet digraph: each vertex v is an integer that represents a synset, and each directed edge v→w represents that w is a hypernym of v.The wordnet digraph is a rooted DAG: it is acylic and has one vertex thatis an ancestor of every other vertex.However, it is not necessarily a tree because a synset can have more than onehypernym. A small subgraph of the wordnet digraph is illustrated below.

The WordNet input file formats.We now describe the two data files that you will use to create the wordnet digraph.The files are in CSV format: each line contains a sequence of fields,separated by commas.

  • List of noun synsets.The file synsets.txtlists all the (noun) synsets in WordNet.The first field is the synset id (an integer),the second field is the synonym set (or synset), and thethird field is its dictionary definition (or gloss).For example, the line

    36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire  
    
    means that the synset { AND_circuit, AND_gate }has an id number of 36 and it's gloss isa circuit in a computer that fires only when all of its inputs fire.The individual nouns that comprise a synset are separatedby spaces (and a synset element is not permitted to contain a space).The S synset ids are numbered 0 through S − 1;the id numbers will appear consecutively in the synset file.
  • List of hypernyms.The file hypernyms.txtcontains the hypernym relationships:The first field is a synset id; subsequent fields are the id numbersof the synset's hypernyms. For example, the following line
    164,21012,56099
    

    means that the the synset 164 ("Actifed") has two hypernyms:21012 ("antihistamine") and56099 ("nasal_decongestant"),representing that Actifed is both an antihistamine and a nasal decongestant.The synsets are obtained from the corresponding lines in the file synsets.txt.

    164,Actifed,trade name for a drug containing an antihistamine and a decongestant...
    21012,antihistamine,a medicine used to treat allergies...
    56099,nasal_decongestant,a decongestant that provides temporary relief of nasal...
    

WordNet data type.Implement an immutable data type WordNet with the following API:

// constructor takes the name of the two input files
public WordNet(String synsets, String hypernyms)

// the set of nouns (no duplicates), returned as an Iterable
public Iterable<String> nouns()

// is the word a WordNet noun?
public boolean isNoun(String word)

// distance between nounA and nounB (defined below)
public int distance(String nounA, String nounB)

// a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB
// in a shortest ancestral path (defined below)
public String sap(String nounA, String nounB)

// for unit testing of this class
public static void main(String[] args)
The constructor should throw a java.lang.IllegalArgumentExceptionif the input does not correspond to a rooted DAG.The distance() and sap() methodsshould throw a java.lang.IllegalArgumentExceptionunless both of the noun arguments are WordNet nouns.

Your data type should use space linear in the input size(size of synsets and hypernyms files).The constructor should take time linearithmic (or better) in the input size.The method isNoun() should run in time logarithmic (or better) inthe number of nouns.The methods distance() and sap() should run in time linear in thesize of the WordNet digraph.

Shortest ancestral path.An ancestral path between two verticesv and w in a digraph is a directed path fromv to a common ancestor x, together witha directed path from w to the same ancestor x. A shortest ancestral path is an ancestral path of minimum total length.For example, in the digraph at left(digraph1.txt),the shortest ancestral path between3 and 11 has length 4 (with common ancestor 1).In the digraph at right (digraph2.txt),one ancestral path between 1 and 5 has length 4(with common ancestor 5), but the shortest ancestral path has length 2(with common ancestor 0).


SAP data type.Implement an immutable data type SAP with the following API:

// constructor takes a digraph (not necessarily a DAG)
public SAP(Digraph G)

// length of shortest ancestral path between v and w; -1 if no such path
public int length(int v, int w)

// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int v, int w)

// length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
public int length(Iterable<Integer> v, Iterable<Integer> w)

// a common ancestor that participates in shortest ancestral path; -1 if no such path
public int ancestor(Iterable<Integer> v, Iterable<Integer> w)

// for unit testing of this class (such as the one below)
public static void main(String[] args)
All methods should throw a java.lang.IndexOutOfBoundsException if one (or more) of the input arguments is not between 0 and G.V() - 1.You may assume that the iterable arguments contain at least one integer.All methods (and the constructor) should take time at mostproportional to E + Vin the worst case, where E and V are the number of edges and verticesin the digraph, respectively.Your data type should use space proportional to E + V.

Test client.The following test client takes the name of a digraph input file asas a command-line argument, constructs the digraph,reads in vertex pairs from standard input,and prints out the length of the shortest ancestral path between the two verticesand a common ancestor that participates in that path:

public static void main(String[] args) {
    In in = new In(args[0]);
    Digraph G = new Digraph(in);
    SAP sap = new SAP(G);
    while (!StdIn.isEmpty()) {
        int v = StdIn.readInt();
        int w = StdIn.readInt();
        int length   = sap.length(v, w);
        int ancestor = sap.ancestor(v, w);
        StdOut.printf("length = %d, ancestor = %d\n", length, ancestor);
    }
}
Here is a sample execution:
% more digraph1.txt             % java SAP digraph1.txt
13                              3 11
11                              length = 4, ancestor = 1
 7  3                            
 8  3                           9 12
 3  1                           length = 3, ancestor = 5
 4  1
 5  1                           7 2
 9  5                           length = 4, ancestor = 0
10  5
11 10                           1 6
12 10                           length = -1, ancestor = -1
 1  0
 2  0

Measuring the semantic relatedness of two nouns.Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, most of us agree that George Bush and John Kennedy (two U.S. presidents)are more related than are George Bushand chimpanzee (two primates). However, not most of us agree that George Bush and Eric Arthur Blair are related concepts. But if one is aware that George Bush and Eric Arthur Blair (aka George Orwell) are both communicators, then it becomes clear that the two concepts might be related.

We define the semantic relatednessof two wordnet nouns A and B as follows:

  • distance(A, B) = distance is the minimum length of any ancestral path betweenany synset v of A and any synset w of B.

This is the notion of distance that you will use to implement thedistance() and sap() methods in the WordNet data type.

Outcast detection.Given a list of wordnet nouns A1, A2,..., An, which nounis the least related to the others? To identify an outcast,compute the sum of the distances between each noun and every other one:

d i   =  dist( A i, A 1)   +  dist( A i, A 2)   +   ...   +   dist( A i, A n)
and return a noun A tfor which d t is maximum.

Implement an immutable data type Outcast with the following API:

// constructor takes a WordNet object
public Outcast(WordNet wordnet)

// given an array of WordNet nouns, return an outcast
public String outcast(String[] nouns)

// for unit testing of this class (such as the one below)
public static void main(String[] args)
Assume that argument array to the outcast() methodcontains only valid wordnet nouns (and that it contains at least two such nouns).

The following test client takes from the command line the name of a synset file, the name of a hypernym file, followed by thenames of outcast files, and prints out an outcast in each file:

public static void main(String[] args) {
    WordNet wordnet = new WordNet(args[0], args[1]);
    Outcast outcast = new Outcast(wordnet);
    for (int t = 2; t < args.length; t++) {
        In in = new In(args[t]);
        String[] nouns = in.readAllStrings();
        StdOut.println(args[t] + ": " + outcast.outcast(nouns));
    }
}
Here is a sample execution:
% more outcast5.txt
horse zebra cat bear table

% more outcast8.txt
water soda bed orange_juice milk apple_juice tea coffee

% more outcast11.txt
apple pear peach banana lime lemon blueberry strawberry mango watermelon potato


% java Outcast synsets.txt hypernyms.txt outcast5.txt outcast8.txt outcast11.txt
outcast5.txt: table
outcast8.txt: bed
outcast11.txt: potato

分析

首先,这道题是一个有向图(Directed Graph)应用,所使用到的基础工具自然而然就是algs4.jar中提供的有向图类Digraph了。由于这道题会使用广度优先搜索,所以也会用到有向图类的BreadthFirstPaths啦。后面还有一个功能需要用到第三个有向图类,我们放到后面说。

这道题如何开始起手做呢?我觉得Princeton给出的官方指南(Checklist)给出了一个很好的思路,大家可以参考http://coursera.cs.princeton.edu/algs4/checklists/wordnet.html中Possible Process Steps。在此我把官方给出的处理过程放在博客里面提供参考,当然了,我也是按照这样的过程来逐步实现的。

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  • Download wordnet-testing.zip.It contains sample input files for testing.
  • Create the data type SAP. This part of the assignment involves only graph algorithms(and you don't need to know anything about WordNet nouns, synsets, or hypernyms).First, think carefully about designing a correct and efficientalgorithm for computing the shortest ancestral path.Ask in the Discussion Forums if you're unsure.In addition to the digraph*.txt files, design small DAGs to test and debug your code.
  • Read in and parse the files described in the assignment, synsets.txt and hypernyms.txt.Don't worry about storing the data in any data structures yet.Test that you are parsing the input correctly before proceeding.
  • Create a data type WordNet. Divide the constructor into two subtasks.
    • Read in the synsets.txt file and buildappropriate data structures.You shouldn't need to design any data structures here, butchoosing how to represent the data for efficient access is important.Think about what operations you need to support.
    • Read in the hypernyms.txt file and build aDigraph.

    If you read in synsets.txt first, you can identify the largest idbefore constructing the Digraph.Check that it is 82,191 but do not hardwire this number into yourprogram because your program must work with any valid input file.

  • Create the client Outcast. This is probably the easiest of thethree components.
按照参考的步骤,我们一个一个来实现。

DeluxeDFS.java

找两个点共同祖先的方法其实很显然:把两个给定点的祖先全都列出来,然后去找两个点共同的祖先,共同祖先中到两个给定点距离和最小的就是我们要的结果。

那么,首先我们需要解决的问题是:哪些点是给定点的祖先呢?我们观察一下Wordnet图的特性:这实际上可以说是一个单向图,上层节点不可能有指向下节点的链接(如果有的话就造成了回路,是一个不合法的Wordnet)。因此,如果我们对给定点进行广度优先搜索,那么(1)所有连通点都是给定点的祖先;(2)不可能存在一个给定点的祖先,使得其不与给顶点连通。因此,我们需要稍微修改一下广度优先搜索算法。修改的地方很简单:我们需要增加一个方法,来返回搜索算法后,搜索到点的集合。这个集合,实际上就是广度优先算法中的boolean[] marked数组!

在此,我们给出修改后的DeluxeDFS.java,唯一的修改就是增加了getMarked方法。

/**
 *  A modified BFS for solving WordNet puzzle.
 *  
 *	@author Weiran Liu
 *  @author Robert Sedgewick
 *  @author Kevin Wayne
 */

public class DeluxeBFS {
    private static final int INFINITY = Integer.MAX_VALUE;
    private boolean[] marked;  // marked[v] = is there an s->v path?
    private int[] edgeTo;      // edgeTo[v] = last edge on shortest s->v path
    private int[] distTo;      // distTo[v] = length of shortest s->v path

    /**
     * Computes the shortest path from <tt>s</tt> and every other vertex in graph <tt>G</tt>.
     * @param G the digraph
     * @param s the source vertex
     */
    public DeluxeBFS(Digraph G, int s) {
        marked = new boolean[G.V()];
        distTo = new int[G.V()];
        edgeTo = new int[G.V()];
        for (int v = 0; v < G.V(); v++) distTo[v] = INFINITY;
        bfs(G, s);
    }

    /**
     * Computes the shortest path from any one of the source vertices in <tt>sources</tt>
     * to every other vertex in graph <tt>G</tt>.
     * @param G the digraph
     * @param sources the source vertices
     */
    public DeluxeBFS(Digraph G, Iterable<Integer> sources) {
        marked = new boolean[G.V()];
        distTo = new int[G.V()];
        edgeTo = new int[G.V()];
        for (int v = 0; v < G.V(); v++) distTo[v] = INFINITY;
        bfs(G, sources);
    }

    // BFS from single source
    private void bfs(Digraph G, int s) {
        Queue<Integer> q = new Queue<Integer>();
        marked[s] = true;
        distTo[s] = 0;
        q.enqueue(s);
        while (!q.isEmpty()) {
            int v = q.dequeue();
            for (int w : G.adj(v)) {
                if (!marked[w]) {
                    edgeTo[w] = v;
                    distTo[w] = distTo[v] + 1;
                    marked[w] = true;
                    q.enqueue(w);
                }
            }
        }
    }

    // BFS from multiple sources
    private void bfs(Digraph G, Iterable<Integer> sources) {
        Queue<Integer> q = new Queue<Integer>();
        for (int s : sources) {
            marked[s] = true;
            distTo[s] = 0;
            q.enqueue(s);
        }
        while (!q.isEmpty()) {
            int v = q.dequeue();
            for (int w : G.adj(v)) {
                if (!marked[w]) {
                    edgeTo[w] = v;
                    distTo[w] = distTo[v] + 1;
                    marked[w] = true;
                    q.enqueue(w);
                }
            }
        }
    }

    /**
     * Is there a directed path from the source <tt>s</tt> (or sources) to vertex <tt>v</tt>?
     * @param v the vertex
     * @return <tt>true</tt> if there is a directed path, <tt>false</tt> otherwise
     */
    public boolean hasPathTo(int v) {
        return marked[v];
    }
    
    public boolean[] getMarked(){
    	return this.marked;
    }

    /**
     * Returns the number of edges in a shortest path from the source <tt>s</tt>
     * (or sources) to vertex <tt>v</tt>?
     * @param v the vertex
     * @return the number of edges in a shortest path
     */
    public int distTo(int v) {
        return distTo[v];
    }

    /**
     * Returns a shortest path from <tt>s</tt> (or sources) to <tt>v</tt>, or
     * <tt>null</tt> if no such path.
     * @param v the vertex
     * @return the sequence of vertices on a shortest path, as an Iterable
     */
    public Iterable<Integer> pathTo(int v) {
        if (!hasPathTo(v)) return null;
        Stack<Integer> path = new Stack<Integer>();
        int x;
        for (x = v; distTo[x] != 0; x = edgeTo[x])
            path.push(x);
        path.push(x);
        return path;
    }
}

SAP.java

SAP中的算法就很简单了:分别对两个给定点v和w进行广度优先搜索,并且得到两个搜索结果中的boolean[] marked。随后,找相同的被标记marked的点,分别计算这些点与v和w的距离和,距离和最小的就是答案和祖先。

那么,如何搜索给定一组点v(Iterable<Integer> v)和一组点w(Iterable<Integer> w)的最小祖先呢?很简单:我们对v和w的各个点依次计算最小距离以及最小祖先,然后把结果再对比,返回其中更最小的距离和最小祖先。同时,我们注意到一共有四个函数需要实现:public int length(int v, int w),public int ancestor(int v, int w),public int length(Iterable<Integer> v, Iterable<Integer> w),public int ancestor(Iterable<Integer> v, Iterable<Integer> w),而且这四个函数实现过程很像,为了避免代码重复,我们增加两个private函数,名字同样为shortest,来把共同代码封装到一个函数里面。这个函数返回两个int,第一个int是shortestLength,第二个int是shortestAncestor。

public class SAP {
	private Digraph G;
	
	// constructor takes a digraph (not necessarily a DAG)
	public SAP(Digraph G){
		this.G = new Digraph(G);
	}

	// length of shortest ancestral path between v and w; -1 if no such path
	public int length(int v, int w){
		int[] result = shortest(v, w);
		return result[0];
	}

	// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
	public int ancestor(int v, int w){
		int[] result = shortest(v, w);
		return result[1];
	}

	// length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
	public int length(Iterable<Integer> v, Iterable<Integer> w){
		int[] result = shortest(v, w);
		return result[0];
	}

	// a common ancestor that participates in shortest ancestral path; -1 if no such path
	public int ancestor(Iterable<Integer> v, Iterable<Integer> w){
		int[] result = shortest(v, w);
		return result[1];
	}
	
	private int[] shortest(int v, int w){
		int[] result = new int[2];
		DeluxeBFS vDeluexBFS = new DeluxeBFS(G, v);
		DeluxeBFS wDeluexBFS = new DeluxeBFS(G, w);
		boolean[] vMarked = vDeluexBFS.getMarked();
		boolean[] wMarked = wDeluexBFS.getMarked();
		int shortestLength = Integer.MAX_VALUE;
		int tempLength = Integer.MAX_VALUE;
		int shortestAncestor = Integer.MAX_VALUE;
		for (int i=0; i<vMarked.length; i++){
			if (vMarked[i] && wMarked[i]){
				tempLength = vDeluexBFS.distTo(i) + wDeluexBFS.distTo(i);
				if (tempLength < shortestLength){
					shortestLength = tempLength;
					shortestAncestor = i;
				}
			}
		}
		if (shortestLength == Integer.MAX_VALUE){
			result[0] = -1;
			result[1] = -1;
			return result;
		}
		result[0] = shortestLength;
		result[1] = shortestAncestor;
		return result;			
	}
	
	private int[] shortest(Iterable<Integer> v, Iterable<Integer> w){
		int shortestAncestor = Integer.MAX_VALUE;
		int shortestLength = Integer.MAX_VALUE;
		int[] result = new int[2];
		for (int vNode : v){
			for (int wNode : w){
				int[] tempResult = shortest(vNode, wNode);
				if (tempResult[0] != -1 && tempResult[0] < shortestLength){
					shortestLength = tempResult[0];
					shortestAncestor = tempResult[1];
				}
			}
		}
		if (shortestLength == Integer.MAX_VALUE){
			result[0] = -1;
			result[1] = -1;
			return result;
		}
		result[0] = shortestLength;
		result[1] = shortestAncestor;
		return result;
	}
}

Wordnet.java

Wordnet是最难实现的一个类,里面有很多特别细节的问题。我们一个一个来解决。

数据类型选择

我们先来看看noun和id的关系。给定的输入实际上是类似一种数据库的输入方法:用id来标记每一个点,一个id可能有多个nouns,有一个nouns的解释(这道题里面这个数据可以不存储)。但是,因为后面所有的函数输入都是noun,而非id,因此我们数据结构的Key的选择一定是noun。

接下来,我们考虑用什么数据类型。在增加noun的时候,我们需要频繁地找是否已经构造了noun(因为多个id可能link到一个noun):如果已经构造了,那么noun中id要增加;如果没有构造过,则需要把新的noun加在数据结构里面。也就是说,我们需要非常快的插入速度,非常快的提取速度,以及数据的不重复性。看到这些特性,相信大家都能够想象到应该选择的数据结构了吧?对了,就是SET。这是一个插入、删除、提取都是log n的数据结构,完全满足题目的要求。因此,我们也可以得到我们构造方法(Constructor)的复杂度为nlog n,满足题目要求。

Invalid Graph的发现

另一个很麻烦的问题是发现Invalid Graph。Wordnet必须要是一个Wordnet才能满足SAP.java中提到的特性。那么,什么算是一个invalid Wordnet呢?根据题目解释,只有两种情况:1.Graph中有cycle;2.Graph中有两个以上的root。第一个问题的解决方法是调用algs4.jar给出的查找有向图Cycle的类来进行判断,就是DirectedCycle类。第二个问题比较麻烦,我自己写想了半天如何找root节点…我们仔细考虑一下,如果一个节点是root,那么其必然是图的终点,也即,这个点不指向任何一个其他节点。既然是这样,那么在hypernyms.txt中,根本不会出现这个节点id开头的项!所以,我们需要用排除法:把所有hypernyms.txt中id开头的项都删掉,最后看看剩下的id到底有几个:超过了1个就是不合法的(条件2,多于一个root),没有也是不合法的(条件1,没有root,也就是有回路)。在实现的时候,我还是分开进行测试:先测cycle,再测more than one root。

剩下的就比较简单啦,上代码!

public class WordNet {
	private class Noun implements Comparable<Noun>{
		private String noun;
		private ArrayList<Integer> id = new ArrayList<Integer>();
		
		public Noun(String noun){
			this.noun = noun;
		}

		@Override
		public int compareTo(Noun that) {
			return this.noun.compareTo(that.noun);
		}
		
		public ArrayList<Integer> getId(){
			return this.id;
		}
		
		public void addId(Integer x){
			this.id.add(x);
		}
	}
	
	private SET<Noun> nounSET;
	private Digraph G;
	private SAP sap;
	private ArrayList<String> idList;
	
	// constructor takes the name of the two input files
	public WordNet(String synsets, String hypernyms){
		In inSynsets = new In(synsets);
		In inHypernyms = new In(hypernyms);
		//counting the total number of vertex
		int maxVertex = 0; 
		idList = new ArrayList<String>();
		nounSET = new SET<Noun>();
		 
		//start to read hypernyms.txt
		String line = inSynsets.readLine();
		
		while (line != null) {
			maxVertex++;
			String[] synsetLine = line.split(",");
			//String[0] is id
			Integer id = Integer.parseInt(synsetLine[0]);
			//String[1] is noun, split it and add to the set
			String[] nounSet = synsetLine[1].split(" ");
			for (String nounName : nounSet){
				Noun noun = new Noun(nounName);
				if (nounSET.contains(noun)){
					noun = nounSET.ceil(noun);
					noun.addId(id);
				} else {
					noun.addId(id);
					nounSET.add(noun);
				}
				
			}
			//add it to the idList
			idList.add(synsetLine[1]);
			//continue reading synsets
			line = inSynsets.readLine();
	    }
		
		G = new Digraph(maxVertex);
		//the candidate root
		boolean[] isNotRoot = new boolean[maxVertex];
		//start to read hypernyms.txt
		line = inHypernyms.readLine();
		while (line != null){
			String[] hypernymsLine = line.split(",");
			//String[0] is id
			int v = Integer.parseInt(hypernymsLine[0]);
			isNotRoot[v] = true;
			//String[1] and others is its ancestor, constructing G
			for (int i=1; i<hypernymsLine.length;i++){
				G.addEdge(v, Integer.parseInt(hypernymsLine[i]));
			}
			line = inHypernyms.readLine();
		}
		//test for root: no cycle
		DirectedCycle directedCycle = new DirectedCycle(G);
		if (directedCycle.hasCycle()){
			throw new java.lang.IllegalArgumentException();
		}
		//test for root: no more than one candidate root
		int rootCount = 0;
		for (boolean notRoot : isNotRoot){
			if (!notRoot){
				rootCount++;
			}
		}
		if (rootCount > 1){
			throw new java.lang.IllegalArgumentException();
		}
		sap = new SAP(G);

	}

	// the set of nouns (no duplicates), returned as an Iterable
	public Iterable<String> nouns(){
		Queue<String> nouns = new Queue<String>();
		for (Noun noun : nounSET){
			nouns.enqueue(noun.noun);
		}
		return nouns;
	}

	// is the word a WordNet noun?
	public boolean isNoun(String word){
		Noun noun = new Noun(word);
		return nounSET.contains(noun);
	}

	// distance between nounA and nounB (defined below)
	public int distance(String nounA, String nounB){
		if (!isNoun(nounA)){
			throw new java.lang.IllegalArgumentException();
		}
		if (!isNoun(nounB)){
			throw new java.lang.IllegalArgumentException();
		}
		Noun nodeA = nounSET.ceil(new Noun(nounA));
		Noun nodeB = nounSET.ceil(new Noun(nounB));
		return sap.length(nodeA.getId(), nodeB.getId());
	}

	// a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB
	// in a shortest ancestral path (defined below)
	public String sap(String nounA, String nounB){
		if (!isNoun(nounA)){
			throw new java.lang.IllegalArgumentException();
		}
		if (!isNoun(nounB)){
			throw new java.lang.IllegalArgumentException();
		}
		Noun nodeA = nounSET.ceil(new Noun(nounA));
		Noun nodeB = nounSET.ceil(new Noun(nounB));
		return idList.get(sap.ancestor(nodeA.getId(), nodeB.getId()));
	}
}

Outcast.java

最后一个类,Outcast.java,看似特别简单,但是测试时候非常蛋疼的会遇到超出时间的问题… 这问题是这么来的:根据题目描述

Given a list of wordnet nouns A1, A2,..., An, which nounis the least related to the others? To identify an outcast,compute the sum of the distances between each noun and every other one:

d i   =  dist( A i, A 1)   +  dist( A i, A 2)   +   ...   +   dist( A i, A n)
and return a noun A tfor which d t is maximum.

所以,我们可以就按照题目给的方法这么算。不过仔细考虑一下,这么算的话复杂度是多少呢?很显然,这就像填一个N*N上下对称的大表格,一共需要计算N*N次shortestLength。但是,如果我们真就这么实现,然后提交的话,time测试会失败,失败结果如下:

Running 1 total tests.

5.00 seconds to build WordNet

Computing time to find outcasts. Total time must not exceed 5 seconds.


    filename       N     time
-----------------------------
   outcast4.txt    4     0.95
   outcast5.txt    5     0.27
  outcast5a.txt    5     0.12
   outcast5.txt    5     0.15
   outcast7.txt    7     0.17
   outcast8.txt    8     0.33
  outcast8a.txt    8     0.21
  outcast8b.txt    8     0.20
  outcast8c.txt    8     0.22
   outcast9.txt    9     0.20
  outcast9a.txt    9     0.38
  outcast10.txt   10     0.52
 outcast10a.txt   10     0.34
  outcast11.txt   11     0.34
  outcast12.txt   12     0.62
 outcast12a.txt   12     0.45
  outcast20.txt   20     0.84
  outcast29.txt   29     2.51
=> FAILED, total elapsed time: 8.80

Total: 0/1 tests passed!
也就是说,要求是5s以内,结果计算了9s左右。

那么,怎么优化呢?我们观察一下,其实填这个N*N的表是一个对称表,所以我们完全可以只计算N*N/2次,把表都填完了以后再说嘛~ 这样,预估时间会缩小一半,事实也正是这样:

Running 1 total tests.

5.69 seconds to build WordNet

Computing time to find outcasts. Total time must not exceed 5 seconds.


    filename       N     time
-----------------------------
   outcast4.txt    4     0.59
   outcast5.txt    5     0.02
  outcast5a.txt    5     0.09
   outcast5.txt    5     0.02
   outcast7.txt    7     0.05
   outcast8.txt    8     0.07
  outcast8a.txt    8     0.11
  outcast8b.txt    8     0.16
  outcast8c.txt    8     0.08
   outcast9.txt    9     0.11
  outcast9a.txt    9     0.07
  outcast10.txt   10     0.07
 outcast10a.txt   10     0.14
  outcast11.txt   11     0.17
  outcast12.txt   12     0.21
 outcast12a.txt   12     0.41
  outcast20.txt   20     0.54
  outcast29.txt   29     0.94
=> PASSED, total elapsed time: 3.85

Total: 1/1 tests passed!
这就是Outcast中唯一可能遇到的问题了。

public class Outcast {
	private WordNet wordnet;
	// constructor takes a WordNet object
	public Outcast(WordNet wordnet){
		this.wordnet = wordnet;
	}

	// given an array of WordNet nouns, return an outcast
	public String outcast(String[] nouns){
		int[] distance = new int[nouns.length];
		for (int i=0; i<nouns.length; i++){
			for (int j=i; j<nouns.length; j++){
				int dist = wordnet.distance(nouns[i], nouns[j]);
				distance[i] += dist;
				if (i != j){
					distance[j] += dist;
				}
			}
		}
		int maxDistance = 0;
		int maxIndex = 0;
		for (int i=0; i<distance.length; i++){
			if (distance[i] > maxDistance){
				maxDistance = distance[i];
				maxIndex = i;
			}
		}
		return nouns[maxIndex];
	}
}


评论 11
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值