Java
Asymptotics
Intuitive Runtime Characterizations
Technique1 : Measure execution time in seconds using a client program
Technique2 : Counting possible operations
-
2-1 : Count possible operations for an array of size N = 10000
- Pro : Machine independent. Input dependence captured in model
- Con : Array size was arbitrary. Does not tell you actual time
-
2-2 : Count possible operations in terms of input array size N
- Pro : Machine independent. Input dependence captured in model. Tell us how algorithm scales
- Con : Does not tell you actual time.
//dup1 : compare everthing
for (int i = 0 , i < A.length; i += 1){
for (int j = i+1 ; j < A>length; j+=1){
if (A[i]==A[j]){
return true;
}
}
}
return false;
operation <dup1> | 2-1 Count ,N=10000 | 2-2 Symbolic count |
---|---|---|
i = 0 | 1 | 1 |
j = i+1 | 1 to 10000 | 1 to N |
less than (<) | 2 to 50015001 | 2 to ( N 2 + 3 N + 2 ) 2 \frac{(N^2+3N+2)}{2} 2(N2+3N+2) |
increment (+=1) | 0 to 50005000 | 0 to N 2 + N 2 \frac{N^2+N}{2} 2N2+N |
equals(==) | 1 to 49995000 | 1 to N 2 − N 2 \frac{N^2-N}{2} 2N2−N |
array accesses | 2 to 99990000 | 2 to N 2 − N N^2-N N2−N |
// dup2 : compare only neighbors
for (int i = 0; i < A.length -1 ; i += 1){
if(A[i]==A[i+1]){
return true;
}
}
return false;
operation <dup2> | 2-1 Count ,N=10000 | 2-2 Symbolic count |
---|---|---|
i = 0 | 1 | 1 |
less than (<) | 0 to 10000 | 0 to N |
increment (+=1) | 0 to 9999 | 0 to N-1 |
equals(==) | 1 to 9999 | 1 to N-1 |
array accesses | 2 to 19998 | 2 to 2N-2 |
If we want to choose the better algorithm , we need to consider :
- Fewer operations to do the same work
- Algorithm scales better in the worst case
Worst Case Order of Growth
- Intuitive Simplification 1 : Consider Only the Worst Case
- Intuitive Simplification 2 : Restric Attention to One Operation ( Pick sime representative operation to act as a proxy for the overall runtime)
- Intuitive Simplification 3 : Eliminate low order terms
- Intuitive Simplification 4 : Eliminate multiplicative constants
Big Theta
Suppose we have a function R(N) with order of growth f(N)
In “Big Theta” notation we write this as
R
(
N
)
∈
Θ
(
f
(
N
)
)
R(N) \in \Theta(f(N))
R(N)∈Θ(f(N)) , eg :
N
3
+
N
4
∈
Θ
(
N
4
)
N^3 + N^4 \in \Theta (N^4)
N3+N4∈Θ(N4)
R
(
N
)
∈
Θ
(
f
(
N
)
)
R(N) \in \Theta(f(N))
R(N)∈Θ(f(N)) means there exist positive cosntants k1 and k2 such that:
k
1
×
f
(
N
)
≤
R
(
N
)
≤
k
2
×
f
(
N
)
k_1 \times f(N) \leq R(N) \leq k_2 \times f(N)
k1×f(N)≤R(N)≤k2×f(N)
for all values N greater than some
N
0
N_0
N0
Using Big-Theta does not change anything about runtime analysis . The only difference is that we use the Θ \Theta Θ symbol anywhere we would have said “order of growth”.
Big O Notation
Whereas Big Theta can informally be thought of as something like “equals”, Big O can be thought of as “less than or equal”.
R
(
N
)
∈
O
(
f
(
N
)
)
R(N) \in O(f(N))
R(N)∈O(f(N)) means there exist positive cosntants k1 and k2 such that:
R
(
N
)
≤
k
2
×
f
(
N
)
R(N) \leq k_2 \times f(N)
R(N)≤k2×f(N)
for all values N greater than some
N
0
N_0
N0 , i.e. very large N.
Big Omega
While Big Theta can be informally thought of as runtime equality and Big O represents " less than or equal", Big Omega can be thought as the " greater than or equal " .
-
All of the following statements are true:
- N 3 + N 4 ∈ Θ ( N 4 ) N^3+N^4 \in \Theta(N^4) N3+N4∈Θ(N4)
- N 3 + N 4 ∈ Ω ( N 4 ) N^3+N^4 \in \Omega(N^4) N3+N4∈Ω(N4)
- N 3 + N 4 ∈ Ω ( N 3 ) N^3+N^4 \in \Omega(N^3) N3+N4∈Ω(N3)
- N 3 + N 4 ∈ Ω ( 1 ) N^3+N^4 \in \Omega(1) N3+N4∈Ω(1)
-
Common uses for Big Omega:
- It is used to prove Big Theta runtime .
- If R ( N ) = O ( f ( N ) ) R(N) = O(f(N)) R(N)=O(f(N)) and R ( N ) = Ω ( N ) R(N)=\Omega(N) R(N)=Ω(N) then R ( N ) = Θ ( N ) R(N) = \Theta(N) R(N)=Θ(N)
- It is used to prove the difficulty of a problem. eg : Any duplicate-finding algorithm must be Ω ( N ) \Omega(N) Ω(N) , because the algorithm must at least look at each element.
- It is used to prove Big Theta runtime .
Big Theta vs. Big O vs. Big Omega
notation | Informal meaning | Family | Familymember |
---|---|---|---|
Big Theta Θ ( f ( N ) ) \Theta(f(N)) Θ(f(N)) | Order of growth is f(N) | Θ ( N 2 ) \Theta(N^2) Θ(N2) | N 2 / 2 , 2 N 2 , N 2 + 38 N + N N^2/2 , 2N^2, N^2+38N+N N2/2,2N2,N2+38N+N |
Big O O ( f ( N ) ) O(f(N)) O(f(N)) | Order of growth is less than or equal to f(N) | O ( N 2 ) O(N^2) O(N2) | N 2 / 2 , 2 N 2 , l g ( N ) N^2/2 , 2N^2, lg(N) N2/2,2N2,lg(N) |
Big Omega Ω ( f ( N ) ) \Omega(f(N)) Ω(f(N)) | Order of growth is greater than or equal to f(N) | Ω ( N 2 ) \Omega(N^2) Ω(N2) | N 2 / 2 , 2 N 2 , 5 N N^2/2 , 2N^2, 5^N N2/2,2N2,5N |
Amortized Analysis
- A more rigorous examination of amortized analysis is in three steps :
- Pick a cost model (like in regular runtime analysis)
- Compute the average cost od the ith operation
- Show that this average(amortized) cost is bounded by a constant
Disjoint Sets
Problem abstraction
-
Basic Problem : Deriving the Disjoint Sets data structure for solving the "Dynamic Connectivity Problem
- How a data structrue design can evolve from basic to sophisticated
- How our choice of underlying abstraction can affect asympotic runtime (using the formal Big-Theta notation) and code complexity
-
Two operations of the Disjoint Sets data structure
- connect(x, y) : Connects x and y
- isConnected(x, y) : Returns true if x and y are connected. Connection can be transitive , i.e. they do not need to be direct.
-
Two assumptions to keep things simple
- Force all items to be integers instead of arbitrary data ( which means we discuss Disjoint Sets on Integers) eg : ListOfSetsDS List<Set<Integer>>
- For instance , if we have N = 6 elements and nothing has bee connected yet , our list of sets looks like [{0},{1},{2},{3},{4},{5},{6}]. Then, isConnected(5,6) requires iterating through N-1 sets to find 5 , then N sets to find 6 . It is the worst case ,and the overall runtime is Θ ( N ) \Theta(N) Θ(N).
- Declare the number of items in advance , everything is disconnected at start.
- Force all items to be integers instead of arbitrary data ( which means we discuss Disjoint Sets on Integers) eg : ListOfSetsDS List<Set<Integer>>
Design an efficient DisjointSets implementation.
- Number of elements N can be huge
- Number of method calls M can be huge
- Calls to methods maybe interspersed ( can not assume the connect operations followed by only isConnected operations)
The Disjoint Set Interface
public interface DisjointSets{
/** connects two items P and Q*/
void connect(int p, int q);
/** check to see if two items are connected*/
boolean isConnected(int p, int q);
}
Naive Approach v.s. Connected Components
-
Naive Approach
- Connecting two things : Record every single connecting line in some data structure
- Checking Connectedness : Do some sort of iteration over the lines to see if one thing can be reached from the other.
-
Connected Components
- For each item, its connected component is the set of all items that are connected to that them. Only record the sets that each items belongs to
- Model connectedness in terms of sets
- How things are connected is not something we need to know
- Only need to keep track of which connected component each item belongs to
- {0, 1, 2, 4}, {3, 5}, {6}
Quick Find
Challenge : Pick Data Structure to support tracking of sets.
-
Let’s consider another approach using a single array of integers.
- The indices of the array represent the elements of our set.
- The value at the index is the set number it belongs to.
- eg : we represent [{0, 1, 2, 4}, {3, 5}, {6}] as int[] : [4 , 4, 4, 5, 4, 5, 6]. The array indices (0,…6) are the elements, the value at id[i] is the set it belongs to .
- The specific set number does not matter as long as all elements in the same set share the same id. So the int array can be [2, 2, 2, 3, 2, 3, 6]
-
connect(x, y)
- Now we represent [{0, 1, 2, 4}, {3, 5}, {6}] as int[] : [4 , 4, 4, 5, 4, 5, 6]. So id[2] = 4 and id[3] = 5. After calling connect(2, 3) , all the elements with id = 4 and 5 should have the same id . It becomes [{0, 1, 2, 4, 3, 5}, {6}] and id: [5, 5, 5, 5, 5, 5, 6].
- Need to iterate through the whole array , so the overall runtime is Θ ( N ) \Theta(N) Θ(N)
-
isConnected(x, y)
- To check isConnected(x, y) , we simply check if id[x] == id[y] .Notice that this is a constant time operation, so the overall runtime is Θ ( 1 ) \Theta(1) Θ(1) .
public class QuickFindDS implements DisjointSets{
private int[] id;
/** Constructor : Theta(N) */
public QuickFindDS(int N){
id = new int[N];
for (int i = 0 ; i < N; i++){
id[i] = i;
}
}
/** connect : Theta(N) */
public void connect(int p, int q){
int pid = id[p];
int qid = id[q];
for(int i = 0; i < id.length ; i++){
if (id[i]==pid){
id[i] = qid;
}
}
}
/** isConnect : Theta(1) */
public boolean isConnected(int p, int q){
return (id[p]==id[q]);
}
}
Quick Union
Basic Idea :This approach allows us to imagine each of our sets as a tree. Instead of an id , we assign each item the index of its parent. If an item has no parent , then it is a ‘root’ and we assign it a negative value
So we represent [{0, 1, 2, 4}, {3, 5}, {6}] as int[] parent [-1, 0, 1, -1, 0, 3, -1]. At this time we represent the sets using only an array.
For this method , we can define a helper function called find(int item) , which returns the root of the tree item, i.e. find(5) == 3 , and find(2) == 0
-
connect(x, y)
- To connect two items, we find the set that each item belongs to ( i.e. find the roots of their respective trees) , and make one the child of the other
- eg : connect(5, 2)
- find(5) -> 3
- find(2) -> 0
- Set find(5)'s value to find(2) , that is parent[3] = 0
- Now the element 3 points to the element 0 , combining the two trees/sets into one.
-
isConnected(x, y)
- If two elements are part of the same set, then they will be in the same tree. So for isConnected(x, y) , we simply check if find(x) == find(y)
-
Performance / defect
- There is a potential performance is issue with QuickUnion : the tree can become very long. In this case , finding the root of an item (find(item)) becomes very expensive .
- In the worst case ,we have to traverse all the items to get to the root , which is a Θ ( N ) \Theta(N) Θ(N) runtime. Since we have to call find(item) for both connect and isConnected method , so the runtime for both is upper bounded by O ( N ) O(N) O(N).
public class QuickUnionDS implements DisjointSets{
private int[] parent;
/** constructor : Theta(N)*/
public QuickUnionDS(int num){
parent = new int[num];
for(int i = 0; i < num ; i++){
parent[i] = -1;
}
}
/** hepler funtion -- find O(N), the worst case is Theta(N)*/
private int find(int p){
int r = p;
while(parent[r] >= 0){
r = parent[r];
}
return r;
}
/**connect : call the find method so the runtime is O(N)*/
@Override
public void connect(int p, int q){
int i = find(p);
int j = find(q);
parent[i] = j;
}
/**isConnected : call the find method so the runtime is O(N) */
@Override
public boolean isConnected(int p , int q){
return find(p) == find(q);
}
}
Weight Quick Union
Improving on Quick Union relies on a key insight : whenever we call find(int item) , we have to climb the root of a tree. Thus , the shorter the tree the faster it takes.
-
New rule
- Whenever we call connect , we always link the root of the smaller tree to the larger tree.
- Need to calculate the maximum height of the tree , the worst case is Θ ( l o g ( N ) \Theta(log(N) Θ(log(N).
-
Maximum height : LogN
- N is the number of elements in our Disjoint Sets
- The runtimes of connect and isConnected are bounded by O(logN)
- Why logN?
- Imagine any elements x x x in tree T 1 T1 T1 . The depth of x x x increase by 1 only we tree T 1 T1 T1 is placed below another tree T 2 T2 T2.
- When that happens , the size of the resulting tree will be at least double the size of T 1 T1 T1 because s i z e ( T 2 ) > s i z e ( T 1 ) size(T2) > size(T1) size(T2)>size(T1) .
- The tree with x x x can double at most l o g 2 N log_{2}N log2N times until we have reached a total of N N N items ( 2 l o g 2 N = N 2^{log_{2}N} = N 2log2N=N) .
- So we can double up to l o g 2 N log_{2}N log2N times and each time, our tree adds a level → \rightarrow → maximum l o g 2 N log_{2}N log2N
Path Compression
- Performing M operations on a DisjointSet object with N elements :
- For naive implementation , runtime is O ( M N ) O(MN) O(MN)
- For the best implementation , runtime is O ( N + M l o g N ) O(N+MlogN) O(N+MlogN)
- Path compression results in a union/ connected operations that are very close to amortized constant time (amortized constant means constant on average )
- M operations on N nodes is O ( N + M l g ∗ N ) O(N + Mlg*N) O(N+Mlg∗N)\
- Clever idea : when we do isConnected(x, y) , tie all nodes seen to the root. (that is put the x , y, and x, y’ parents directly point to the root)
- A tigter bound: O ( N + M ( α ( N ) ) O(N+M(\alpha(N)) O(N+M(α(N)), where α \alpha α is the inverse Ackermann function
- The inverse Ackermann function is less than 5 for all pratical inputs
Summary
- Method Summary
N is the number of elements in Disjoint Sets
Implementation | Constructor | connect | isConnected |
---|---|---|---|
ListOfSets | Θ ( N ) \Theta(N) Θ(N) | O ( N ) O(N) O(N) | O ( N ) O(N) O(N) |
QuickFInd | Θ ( N ) \Theta(N) Θ(N) | Θ ( N ) \Theta(N) Θ(N) | Θ ( 1 ) \Theta(1) Θ(1) |
QuickUnion | Θ ( N ) \Theta(N) Θ(N) | O ( N ) O(N) O(N) | O ( N ) O(N) O(N) |
Weighted QuickUnion | Θ ( N ) \Theta(N) Θ(N) | O ( l o g N ) O(logN) O(logN) | O ( l o g ( N ) O(log(N) O(log(N) |
WQU with path comprehension | Θ ( N ) \Theta(N) Θ(N) | O ( α ( N ) ) O(\alpha(N)) O(α(N)) | O ( α ( N ) ) O(\alpha(N)) O(α(N)) |
-
A summary of Our Iterative Design Process
- Represent sets as connected components ( do not track individual connections)
- ListOfSetsDS : Store connected components as a List of Sets
- QuickFindDS : Store connected components as set ids
- QuickUnionDS : Store conneted components as parent ids
- WeightedQuickUnionDS : also track the size of each set, and use size to decide on new tree root
- WeightedQuickUnionWithPathCompressionDS : On calls to connect and isConnected ,set parent id to the root for all items seen.
- WeightedQuickUnionDS : also track the size of each set, and use size to decide on new tree root
- Represent sets as connected components ( do not track individual connections)
-
Performance Summary
- Runtimes are given assuming
- we have a DisjointSets object of size N
- We perform M operations , where an operation is defined as either a call to connected or isConnected
- Runtimes are given assuming
Implementation | runtime |
---|---|
ListOfSetsDS | O ( N M ) O(NM) O(NM) |
QuickFIndDS | Θ ( N M ) \Theta(NM) Θ(NM) |
QuickUnionDS | O ( N M ) O(NM) O(NM) |
WeightedQuickUnionDS | O ( N + M l o g N ) O(N+MlogN) O(N+MlogN) |
WeightedQuickUnionWithPathCompressionDS | O ( N + M α ( N ) ) O(N+M\alpha(N)) O(N+Mα(N)) |