大数据算法作业笔记1

Problem 1

Complete the proof T(n)=nlogh and find the best constant in big-O

Proof 1

As stated in the lecture,

T(n,h)T(n/2,h1)+T(n/2,h2)+tangent(n/2,n/2)

, where h=h1+h2+1 and tangent(n/2,n/2)=O(n) .

Since tangent(n/2,n/2)=O(n) , then we have

C1, tangent(n/2,n/2)C1n

Assume T(n,h)C2nlogh holds for all nN,hH , then we have

T(N,H)===C2N2logh1+C2N2logh2+C1NC2N2log(h1h2)+C1NC2N2log(H2H2)+C1NC2N(logH1)+C1NC2NlogH+(C1C2)N

Then if C2=max(C1,T(3,2)) , we have

T(N,H)C2NlogH

By the principle of induction, T(n,h)=O(nlogh) is true for all n,h .

Problem 2

Consider a set of 2D linear constrains {aix+biyci,i=1,2,} . Given a point (x,y) How do you prove it satisfies all the constrains or find a violating inequality?

Proof 2

Since we need to look through all the constrains before making a solution, the complexity is at least O(N) , where N is the number of constrains.

Then for one query, we simply check all the constrains. If the point (x,y) violating any constrain, the program outputs that constrains. Otherwise, the point satisfies all the constrains. The complexity of this program is O(N) .

Problem 3

What is the time complexity of the above question?

Proof 3

Since we simply check all the constrains, there is no recursive function or loop, and the time complexity is O(N) .

Problem 4

Consider a computer system of memory size n and hard disk size n . How do you maintain a database which always maintains the operations of finding-median, insertion and deleting median operations. Or do it with the best complexity you can achieve.

Proof 4

First, Since the computer system involves hard disks, the time consumed by the program is mainly the time spent on communicating with the hard disk. Then the time complexity is roughly linear to the number of communicating with the hard disk.

Moreover, since we need to analyze the complexity, some assumptions about the database are need. We assume that the input of the database is random, which can derive that the probability of inserting a sorted array is very small. We also assume that the operations are randomly taken, which can derive that the probablity of successive deleting median operations is small. Both assumptions are some kind of uniform distribution, which is always used when no prior knowledge exists.

The algorithm is designed as following. Since memory operations spent far less time than hard disk operations, we want to maintain more useful data in memory. Specifically, the database is divided into three parts: L,S,H, where L<S<H and ||L||H|||S|1 . Then the median of all data is the |L|+|H|+|S|2|L| biggest number in S .

For finding-median operations, the complexity is O(1), since no communicating with hard disk is need.

For insertion operations, as stated in lecture, the probability of ||L||H|||S|1 is small, if |S| is roughly n . For most usual cases, we simply inserting the number into either L,S or H , depending on the comparision between the inserted number and max(S),min(S). If we insert the number into S and |S|==n, we need to move to the disk either max(S) or min(S) to make the size of H and L more balanced.
For the worst cases, where ||L||H|||S|1 , we can either sacrifice the accuracy for efficiency or spend more time to guarantee a correct answer. If we prefer efficiency, we simply ignore the worst cases and the output of the algorithm is still with high probability correct. If we need absolutely correct answer, after inserting the number into either L or H, the maximum of L or the minimum of H should be inserted back to memory and furthermore the maximum of S or the minimum of S should be moved to the disk to guarantee that ||L|=|H|||S|1 and L<S<H . The complexity of finding the maximum or minimum is at most O(N) , where N is the total number of data. If we construct a data structure such as heap, the complexity is reduced to O(logN). In conclusion, since the probability of the worst cases is very small, the average complexity of insertion operations is O(1) . But the worst complexity is O(logN) .

For deleting median operations, we simply find the median and delete it in the memory. Since the operations are taken randomly, the probability of reducing the size of S from n to n/2 is (23)n2 , which is pretty small when n is large. But if we need the correct answer, we need to insert the maximum of L or the minimum of H into memory to maintain ||L||H|||S|1 in the worst cases. So the average complexity of deleting median operations is O(1) , while in the worst cases, the complexity is O(logN) .

In conclusion, if we need absolutely correct answer, the average complexities of three operations are O(1) , but the worst complexities for insertion and deleting median operations are O(logN) . If we can accept correct answer with high probability, the complexities for all three operations are O(1) .

Problem 6

Consider a series-parallel graph, design your database for shortest path query on this graph

Proof 6

Firstly, the professor said in the weixin group that designing a database focused on designing a data structure, so this solution will only focus on the data structure design.

Since only the information of shortest paths is need, all parallel edges and self loops can be removed. Then the maximum size of storing a graph is O(N2) , where N is the number of vertices. A simple way to design the database is to store the adjacency matrix and to calculate the shortest path for every query. The space needed is O(N2) and the time complexity for each query is roughly O(NlogN) .

However, the previous design doesn’t make use of the properties of series-parallel graphs. Since there is no formal definition of series-parallel graph in the lecture, I searched on the Internet and found two equivalent definitions. The first one defines two combining operations: combining in series and combining in parallel. The combining of two series-parallel graphs is also a series-parallel graph. And the base case is one single edge. Another definition constructs a series-parallel graph by applying a sequence of following operations starting from a single edge: 1. add a self loop, 2. add parallel edges, 3. subdivide an edge (http://www.cse.iitd.ernet.in/~naveen/courses/CSL851/jatin.pdf}).

From the second definition, we can derive that the edge number of a series-parallel graph with no paralle edges and self loops is at most 2N . The proof is as following. Because subdividing an edge is the only operation that can add a vertex in the graph, the number of subdividing an edge operations is at most N2 . Moreover, since no self loops and parallel edges exist, for each operation of adding a self loop or adding parallel edges, there must exist a subdividing an edge operation to remove self loops and parallel edges. Then the total number of the first two operations is at most N . Each operation creates one edge, then the edge number of a series-parallel graph is at most 2N.

Now, using the first definition of series-parallel graph, we can derive that the number of combining operations is at most 3N . The proof is as following. Since the edge number is at most 2N , then before any combining operations, the graph contains at most 2N edges with 4N vertices. Combining in series operations reduce the vertex number by 1 and combining in parallel operations reduce the vertex number by 2. Then the maximum operation number is 3N .

Finally, we will begin to design the database. The database should only contains the sequences of combining operations, which can be easily recorded by three numbers: the operation type and the components’ number. The number of combining operations’ output is the smaller one of its two components for simplicity. Then, the graph can be stored in O(N) space, which is much smaller than O(N2) .

Moreover, by storing the sequences of combining operations, the time complexity for each shortest path query is reduced to O(N) . Suppose we want to find the shortest path between v and w. Before applying the combining operations, there are at most 3N sp-graphs (series-parallel graphs), which are all single edges. We firstly initialize the following variables in O(N) time: the shortest path from si to ti for each sp-graph, the shortest path from v to its sp-graph’s source and sink, sv and tv and the shortest path from w to its sp-graph’s source and sink, sw and tw . The shortest path from v to w is also initialized with NULL. For each combining operation, the previous variables are updated in O(1) time. Then the total time complexity for each shortest path query is O(N) .

In conclusion, by storing the sequences of combining operations, the database requires only O(N) space, and the time complexity for each query is O(N) .

The update rules are simple but tedious, so only the time complexity is stated in the context. And we will describe it here in detail.
For combining in series operations of sp-graphs i and j ( i<j ), suppose we combine the sink of sp-graph i , ti and the source of sp-graph j , sj (vice versa). The shortest path between the new combined sp-graph’s source and sink si and ti is the combination of two components’ shortest path, which means P(si,ti)=P(si,ti)P(sj,tj) . If v is in sp-graph i (similar for j ), then P(v,tv) is updated by P(v,ti)P(sj,tj) and P(v,sv) remains the same as P(v,si) . It’s similar if w is in sp-graph i or j . If v is in sp-graph i and w is in sp-graph j (vice versa), P(v,w) is then updated by P(v,ti)P(sj,w) .
For combining in parallel operations of sp-graphs i and j (i<j) , suppose we combine the sources si,sj and the sinks ti,tj . The shortest path between the new combined sp-graph’s source and sink si and ti is updated by the shorter one of its two components’ shortest path, which means P(si,ti)=min(P(si,ti),P(sj,tj)) . If v is in sp-graph i (similar for j ), P(v,sv) is updated by min(P(v,si),P(v,ti)P(tj,sj)) . It’s similar if w is in sp-graph i or j . If v is in sp-graph i and w is in sp-graph j (vice versa), the shortest path between v,w is updated by the shorter path through the common source and the common sink, which means P(v,w)=min(P(v,si)P(sj,w),P(v,ti)P(tj,w)) . If v and w are all in sp-graph i (similar for j), P(v,w) should be updated by min(P(v,w),P(v,si)P(sj,tj)P(ti,w),P(v,ti)P(tj,sj)P(sj,w)) .

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值