【CS 61b study notes 4】Asymptotics&DisjointSets

present1012

于 2024-08-09 19:46:53 发布

阅读量517

点赞数 10

文章标签：开发语言 java

本文链接：https://blog.csdn.net/present1012/article/details/140034134

版权

Asymptotics

Intuitive Runtime Characterizations

Technique1 : Measure execution time in seconds using a client program

Technique2 : Counting possible operations

2-1 : Count possible operations for an array of size N = 10000
- Pro : Machine independent. Input dependence captured in model
- Con : Array size was arbitrary. Does not tell you actual time
2-2 : Count possible operations in terms of input array size N
- Pro : Machine independent. Input dependence captured in model. Tell us how algorithm scales
- Con : Does not tell you actual time.

//dup1 : compare everthing
for (int i = 0 , i < A.length; i += 1){
	for (int j = i+1 ; j < A>length; j+=1){
		if (A[i]==A[j]）{
			return true;
		}
	}
}
return false;

operation <dup1>	2-1 Count ,N=10000	2-2 Symbolic count
i = 0	1	1
j = i+1	1 to 10000	1 to N
less than (<)	2 to 50015001	2 to $\frac{(N^2+3N+2)}{2}$
increment (+=1)	0 to 50005000	0 to $\frac{N^2+N}{2}$
equals(==)	1 to 49995000	1 to $\frac{N^2-N}{2}$
array accesses	2 to 99990000	2 to $N^2-N$

// dup2 : compare only neighbors
for (int i = 0; i < A.length -1 ; i += 1){
	if(A[i]==A[i+1]){
		return true;
	}
}
return false;

operation <dup2>	2-1 Count ,N=10000	2-2 Symbolic count
i = 0	1	1
less than (<)	0 to 10000	0 to N
increment (+=1)	0 to 9999	0 to N-1
equals(==)	1 to 9999	1 to N-1
array accesses	2 to 19998	2 to 2N-2

If we want to choose the better algorithm , we need to consider :
- Fewer operations to do the same work
- Algorithm scales better in the worst case

Worst Case Order of Growth

Intuitive Simplification 1 : Consider Only the Worst Case
Intuitive Simplification 2 : Restric Attention to One Operation ( Pick sime representative operation to act as a proxy for the overall runtime)
Intuitive Simplification 3 : Eliminate low order terms
Intuitive Simplification 4 : Eliminate multiplicative constants

Big Theta

Suppose we have a function R(N) with order of growth f(N)
In “Big Theta” notation we write this as $\in \Theta(f(N))$ , eg : $N^3 + N^4 \in \Theta (N^4)$
$\in \Theta(f(N))$ means there exist positive cosntants k1 and k2 such that:
$k_1 \times f(N) \leq R(N) \leq k_2 \times f(N)$
for all values N greater than some $N_0$

Using Big-Theta does not change anything about runtime analysis . The only difference is that we use the $\Theta$ symbol anywhere we would have said “order of growth”.

Big O Notation

Whereas Big Theta can informally be thought of as something like “equals”, Big O can be thought of as “less than or equal”.

$\in O(f(N))$ means there exist positive cosntants k1 and k2 such that:
$\leq k_2 \times f(N)$
for all values N greater than some $N_0$ , i.e. very large N.

Big Omega

While Big Theta can be informally thought of as runtime equality and Big O represents " less than or equal", Big Omega can be thought as the " greater than or equal " .

All of the following statements are true:
- $N^3+N^4 \in \Theta(N^4)$
- $N^3+N^4 \in \Omega(N^4)$
- $N^3+N^4 \in \Omega(N^3)$
- $N^3+N^4 \in \Omega(1)$
Common uses for Big Omega:
- It is used to prove Big Theta runtime .
  - If $R (N) = O (f (N))$ and $R(N)=\Omega(N)$ then $\Theta(N)$
- It is used to prove the difficulty of a problem. eg : Any duplicate-finding algorithm must be $\Omega(N)$ , because the algorithm must at least look at each element.

Big Theta vs. Big O vs. Big Omega

notation	Informal meaning	Family	Familymember
Big Theta $\Theta(f(N))$	Order of growth is f(N)	$\Theta(N^2)$	$N^2/2 , 2N^2, N^2+38N+N$
Big O $O (f (N))$	Order of growth is less than or equal to f(N)	$O(N^2)$	$N^2/2 , 2N^2, lg(N)$
Big Omega $\Omega(f(N))$	Order of growth is greater than or equal to f(N)	$\Omega(N^2)$	$N^2/2 , 2N^2, 5^N$

Amortized Analysis

A more rigorous examination of amortized analysis is in three steps :
1. Pick a cost model (like in regular runtime analysis)
2. Compute the average cost od the ith operation
3. Show that this average(amortized) cost is bounded by a constant

Disjoint Sets

Problem abstraction

Basic Problem : Deriving the Disjoint Sets data structure for solving the "Dynamic Connectivity Problem
- How a data structrue design can evolve from basic to sophisticated
- How our choice of underlying abstraction can affect asympotic runtime (using the formal Big-Theta notation) and code complexity
Two operations of the Disjoint Sets data structure
- connect(x, y) : Connects x and y
- isConnected(x, y) : Returns true if x and y are connected. Connection can be transitive , i.e. they do not need to be direct.
Two assumptions to keep things simple
- Force all items to be integers instead of arbitrary data ( which means we discuss Disjoint Sets on Integers) eg : ListOfSetsDS List<Set<Integer>>
  - For instance , if we have N = 6 elements and nothing has bee connected yet , our list of sets looks like [{0},{1},{2},{3},{4},{5},{6}]. Then, isConnected(5,6) requires iterating through N-1 sets to find 5 , then N sets to find 6 . It is the worst case ,and the overall runtime is $\Theta(N)$ .
- Declare the number of items in advance , everything is disconnected at start.

Design an efficient DisjointSets implementation.

Number of elements N can be huge
Number of method calls M can be huge
Calls to methods maybe interspersed ( can not assume the connect operations followed by only isConnected operations)

The Disjoint Set Interface

public interface DisjointSets{
	/** connects two items P and Q*/
	void connect(int p, int q);
	/** check to see if two items are connected*/
	boolean isConnected(int p, int q);
}

Naive Approach v.s. Connected Components

Naive Approach
- Connecting two things : Record every single connecting line in some data structure
- Checking Connectedness : Do some sort of iteration over the lines to see if one thing can be reached from the other.
Connected Components
- For each item, its connected component is the set of all items that are connected to that them. Only record the sets that each items belongs to
- Model connectedness in terms of sets
  - How things are connected is not something we need to know
  - Only need to keep track of which connected component each item belongs to
- {0, 1, 2, 4}, {3, 5}, {6}

Quick Find

Challenge : Pick Data Structure to support tracking of sets.

Let’s consider another approach using a single array of integers.
- The indices of the array represent the elements of our set.
- The value at the index is the set number it belongs to.
- eg : we represent [{0, 1, 2, 4}, {3, 5}, {6}] as int[] : [4 , 4, 4, 5, 4, 5, 6]. The array indices (0,…6) are the elements, the value at id[i] is the set it belongs to .
- The specific set number does not matter as long as all elements in the same set share the same id. So the int array can be [2, 2, 2, 3, 2, 3, 6]
connect(x, y)
- Now we represent [{0, 1, 2, 4}, {3, 5}, {6}] as int[] : [4 , 4, 4, 5, 4, 5, 6]. So id[2] = 4 and id[3] = 5. After calling connect(2, 3) , all the elements with id = 4 and 5 should have the same id . It becomes [{0, 1, 2, 4, 3, 5}, {6}] and id: [5, 5, 5, 5, 5, 5, 6].
- Need to iterate through the whole array , so the overall runtime is $\Theta(N)$
isConnected(x, y)
- To check isConnected(x, y) , we simply check if id[x] == id[y] .Notice that this is a constant time operation, so the overall runtime is $\Theta(1)$ .

public class QuickFindDS implements DisjointSets{
	private int[] id;
	/** Constructor : Theta(N) */
	public QuickFindDS(int N){
		id = new int[N];
		for (int i = 0 ; i < N; i++){
			id[i] = i;
		}
	}

	/** connect : Theta(N) */
	public void connect(int p, int q){
		int pid = id[p];
		int qid = id[q];
		for(int i = 0; i < id.length ; i++){
			if (id[i]==pid){
				id[i] = qid;
			}
		}
	}
	
	/** isConnect : Theta(1) */
	public boolean isConnected(int p, int q){
		return (id[p]==id[q]);
	}
}

Quick Union

Basic Idea :This approach allows us to imagine each of our sets as a tree. Instead of an id , we assign each item the index of its parent. If an item has no parent , then it is a ‘root’ and we assign it a negative value
So we represent [{0, 1, 2, 4}, {3, 5}, {6}] as int[] parent [-1, 0, 1, -1, 0, 3, -1]. At this time we represent the sets using only an array.
For this method , we can define a helper function called find(int item) , which returns the root of the tree item, i.e. find(5) == 3 , and find(2) == 0

connect(x, y)
- To connect two items, we find the set that each item belongs to ( i.e. find the roots of their respective trees) , and make one the child of the other
- eg : connect(5, 2)
  1. find(5) -> 3
  2. find(2) -> 0
  3. Set find(5)'s value to find(2) , that is parent[3] = 0
  - Now the element 3 points to the element 0 , combining the two trees/sets into one.
isConnected(x, y)
- If two elements are part of the same set, then they will be in the same tree. So for isConnected(x, y) , we simply check if find(x) == find(y)
Performance / defect
- There is a potential performance is issue with QuickUnion : the tree can become very long. In this case , finding the root of an item (find(item)) becomes very expensive .
- In the worst case ,we have to traverse all the items to get to the root , which is a $\Theta(N)$ runtime. Since we have to call find(item) for both connect and isConnected method , so the runtime for both is upper bounded by $O (N)$ .

public class QuickUnionDS implements DisjointSets{
	private int[] parent;
	
	/** constructor : Theta(N)*/
	public QuickUnionDS(int num){
		parent = new int[num];
		for(int i = 0; i < num ; i++){
			parent[i] = -1;
		}
	}
	
	/** hepler funtion -- find O(N), the worst case is Theta(N)*/
	private int find(int p){
		int r = p;
		while(parent[r] >= 0){
			r = parent[r];
		}
		return r;
	}

	/**connect : call the find method so the runtime is O(N)*/
	@Override
	public void connect(int p, int q){
		int i = find(p);
		int j = find(q);
		parent[i] = j;
	}
	
	
	/**isConnected : call the find method so the runtime is O(N) */
	@Override
	public boolean isConnected(int p , int q){
		return find(p) == find(q);
	}
}

Weight Quick Union

Improving on Quick Union relies on a key insight : whenever we call find(int item) , we have to climb the root of a tree. Thus , the shorter the tree the faster it takes.

New rule
- Whenever we call connect , we always link the root of the smaller tree to the larger tree.
- Need to calculate the maximum height of the tree , the worst case is $\Theta(log(N)$ .
Maximum height : LogN
- N is the number of elements in our Disjoint Sets
- The runtimes of connect and isConnected are bounded by O(logN)
- Why logN?
  - Imagine any elements $x$ in tree $T 1$ . The depth of $x$ increase by 1 only we tree $T 1$ is placed below another tree $T 2$ .
  - When that happens , the size of the resulting tree will be at least double the size of $T 1$ because $s i ze (T 2) > s i ze (T 1)$ .
  - The tree with $x$ can double at most $log_{2}N$ times until we have reached a total of $N$ items ( $2^{log_{2}N} = N$ ) .
  - So we can double up to $log_{2}N$ times and each time, our tree adds a level $\rightarrow$ maximum $log_{2}N$

Path Compression

Performing M operations on a DisjointSet object with N elements :
- For naive implementation , runtime is $O (MN)$
- For the best implementation , runtime is $O (N + Ml o g N)$
Path compression results in a union/ connected operations that are very close to amortized constant time (amortized constant means constant on average )
- M operations on N nodes is $O (N + Ml g * N)$ \
- Clever idea : when we do isConnected(x, y) , tie all nodes seen to the root. (that is put the x , y, and x, y’ parents directly point to the root)
- A tigter bound: $O(N+M(\alpha(N))$ , where $\alpha$ is the inverse Ackermann function
- The inverse Ackermann function is less than 5 for all pratical inputs

Summary

Method Summary
N is the number of elements in Disjoint Sets

Implementation	Constructor	connect	isConnected
ListOfSets	$\Theta(N)$	$O (N)$	$O (N)$
QuickFInd	$\Theta(N)$	$\Theta(N)$	$\Theta(1)$
QuickUnion	$\Theta(N)$	$O (N)$	$O (N)$
Weighted QuickUnion	$\Theta(N)$	$O (l o g N)$	$O (l o g (N)$
WQU with path comprehension	$\Theta(N)$	$O(\alpha(N))$	$O(\alpha(N))$

A summary of Our Iterative Design Process
- Represent sets as connected components ( do not track individual connections)
  - ListOfSetsDS : Store connected components as a List of Sets
  - QuickFindDS : Store connected components as set ids
  - QuickUnionDS : Store conneted components as parent ids
    - WeightedQuickUnionDS : also track the size of each set, and use size to decide on new tree root
      - WeightedQuickUnionWithPathCompressionDS : On calls to connect and isConnected ,set parent id to the root for all items seen.
Performance Summary
- Runtimes are given assuming
  - we have a DisjointSets object of size N
  - We perform M operations , where an operation is defined as either a call to connected or isConnected

Implementation	runtime
ListOfSetsDS	$O (NM)$
QuickFIndDS	$\Theta(NM)$
QuickUnionDS	$O (NM)$
WeightedQuickUnionDS	$O (N + Ml o g N)$
WeightedQuickUnionWithPathCompressionDS	$O(N+M\alpha(N))$