Problem 1
Complete the proof T(n)=nlogh and find the best constant in big-O
Proof 1
As stated in the lecture,
, where h=h1+h2+1 and tangent(n/2,n/2)=O(n) .
Since
tangent(n/2,n/2)=O(n)
, then we have
Assume
T(n,h)≤C2nlogh
holds for all
n≤N,h≤H
, then we have
Then if
C2=max(C1,T(3,2))
, we have
By the principle of induction, T(n,h)=O(nlogh) is true for all n,h .
Problem 2
Consider a set of 2D linear constrains {aix+biy≤ci,i=1,2,⋯} . Given a point (x∗,y∗) How do you prove it satisfies all the constrains or find a violating inequality?
Proof 2
Since we need to look through all the constrains before making a solution, the complexity is at least O(N) , where N is the number of constrains.
Then for one query, we simply check all the constrains. If the point
Problem 3
What is the time complexity of the above question?
Proof 3
Since we simply check all the constrains, there is no recursive function or loop, and the time complexity is O(N) .
Problem 4
Consider a computer system of memory size n√ and hard disk size n . How do you maintain a database which always maintains the operations of finding-median, insertion and deleting median operations. Or do it with the best complexity you can achieve.
Proof 4
First, Since the computer system involves hard disks, the time consumed by the program is mainly the time spent on communicating with the hard disk. Then the time complexity is roughly linear to the number of communicating with the hard disk.
Moreover, since we need to analyze the complexity, some assumptions about the database are need. We assume that the input of the database is random, which can derive that the probability of inserting a sorted array is very small. We also assume that the operations are randomly taken, which can derive that the probablity of successive deleting median operations is small. Both assumptions are some kind of uniform distribution, which is always used when no prior knowledge exists.
The algorithm is designed as following. Since memory operations spent far less time than hard disk operations, we want to maintain more useful data in memory. Specifically, the database is divided into three parts:
For finding-median operations, the complexity is
For insertion operations, as stated in lecture, the probability of
||L|−|H||≥|S|−1
is small, if
|S|
is roughly
n√
. For most usual cases, we simply inserting the number into either
L,S
or
H
, depending on the comparision between the inserted number and
For the worst cases, where
||L|−|H||≥|S|−1
, we can either sacrifice the accuracy for efficiency or spend more time to guarantee a correct answer. If we prefer efficiency, we simply ignore the worst cases and the output of the algorithm is still with high probability correct. If we need absolutely correct answer, after inserting the number into either
L
or
For deleting median operations, we simply find the median and delete it in the memory. Since the operations are taken randomly, the probability of reducing the size of
S
from
In conclusion, if we need absolutely correct answer, the average complexities of three operations are O(1) , but the worst complexities for insertion and deleting median operations are O(logN) . If we can accept correct answer with high probability, the complexities for all three operations are O(1) .
Problem 6
Consider a series-parallel graph, design your database for shortest path query on this graph
Proof 6
Firstly, the professor said in the weixin group that designing a database focused on designing a data structure, so this solution will only focus on the data structure design.
Since only the information of shortest paths is need, all parallel edges and self loops can be removed. Then the maximum size of storing a graph is
O(N2)
, where
N
is the number of vertices. A simple way to design the database is to store the adjacency matrix and to calculate the shortest path for every query. The space needed is
However, the previous design doesn’t make use of the properties of series-parallel graphs. Since there is no formal definition of series-parallel graph in the lecture, I searched on the Internet and found two equivalent definitions. The first one defines two combining operations: combining in series and combining in parallel. The combining of two series-parallel graphs is also a series-parallel graph. And the base case is one single edge. Another definition constructs a series-parallel graph by applying a sequence of following operations starting from a single edge: 1. add a self loop, 2. add parallel edges, 3. subdivide an edge (http://www.cse.iitd.ernet.in/~naveen/courses/CSL851/jatin.pdf}).
From the second definition, we can derive that the edge number of a series-parallel graph with no paralle edges and self loops is at most
2N
. The proof is as following. Because subdividing an edge is the only operation that can add a vertex in the graph, the number of subdividing an edge operations is at most
N−2
. Moreover, since no self loops and parallel edges exist, for each operation of adding a self loop or adding parallel edges, there must exist a subdividing an edge operation to remove self loops and parallel edges. Then the total number of the first two operations is at most
N
. Each operation creates one edge, then the edge number of a series-parallel graph is at most
Now, using the first definition of series-parallel graph, we can derive that the number of combining operations is at most 3N . The proof is as following. Since the edge number is at most 2N , then before any combining operations, the graph contains at most 2N edges with 4N vertices. Combining in series operations reduce the vertex number by 1 and combining in parallel operations reduce the vertex number by 2. Then the maximum operation number is 3N .
Finally, we will begin to design the database. The database should only contains the sequences of combining operations, which can be easily recorded by three numbers: the operation type and the components’ number. The number of combining operations’ output is the smaller one of its two components for simplicity. Then, the graph can be stored in O(N) space, which is much smaller than O(N2) .
Moreover, by storing the sequences of combining operations, the time complexity for each shortest path query is reduced to
O(N)
. Suppose we want to find the shortest path between
v
and
In conclusion, by storing the sequences of combining operations, the database requires only O(N) space, and the time complexity for each query is O(N) .
The update rules are simple but tedious, so only the time complexity is stated in the context. And we will describe it here in detail.
For combining in series operations of sp-graphs i andj ( i<j ), suppose we combine the sink of sp-graph i ,ti and the source of sp-graph j ,sj (vice versa). The shortest path between the new combined sp-graph’s source and sink s′i and t′i is the combination of two components’ shortest path, which means P(s′i,t′i)=P(si,ti)∪P(sj,tj) . If v is in sp-graphi (similar for j ), thenP(v,tv) is updated by P(v,ti)∪P(sj,tj) and P(v,sv) remains the same as P(v,si) . It’s similar if w is in sp-graphi or j . Ifv is in sp-graph i andw is in sp-graph j (vice versa),P(v,w) is then updated by P(v,ti)∪P(sj,w) .
For combining in parallel operations of sp-graphs i andj (i<j) , suppose we combine the sources si,sj and the sinks ti,tj . The shortest path between the new combined sp-graph’s source and sink s′i and t′i is updated by the shorter one of its two components’ shortest path, which means P(s′i,t′i)=min(P(si,ti),P(sj,tj)) . If v is in sp-graphi (similar for j ),P(v,sv) is updated by min(P(v,si),P(v,ti)∪P(tj,sj)) . It’s similar if w is in sp-graphi or j . Ifv is in sp-graph i andw is in sp-graph j (vice versa), the shortest path betweenv,w is updated by the shorter path through the common source and the common sink, which means P(v,w)=min(P(v,si)∪P(sj,w),P(v,ti)∪P(tj,w)) . If v andw are all in sp-graph i (similar forj ), P(v,w) should be updated by min(P(v,w),P(v,si)∪P(sj,tj)∪P(ti,w),P(v,ti)∪P(tj,sj)∪P(sj,w)) .