大数据算法作业笔记1

最新推荐文章于 2023-06-20 00:02:23 发布

silent56_th

最新推荐文章于 2023-06-20 00:02:23 发布

阅读量797

点赞数

分类专栏：大数据算法文章标签：大数据算法 convex-hul 时间复杂度时间复杂度和空间复杂度

本文链接：https://blog.csdn.net/silent56_th/article/details/78177158

版权

大数据算法专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Problem 1

Complete the proof $T(n) = n\log h$ and find the best constant in big-O

Proof 1

As stated in the lecture,

T (n, h) \leq T (n / 2, h 1) + T (n / 2, h 2) + t a n g e n t (n / 2, n / 2)

$T(n,h) \le T(n/2,h_1) + T(n/2,h_2) + tangent(n/2,n/2)$
, where

h=h1+h2+1 $h = h1+h2+1$ and

tangent(n/2,n/2)=O(n) $tangent(n/2,n/2) = O(n)$ .

Since $tangent(n/2,n/2) = O(n)$ , then we have

\exists C 1, t a n g e n t (n / 2, n / 2) \leq C 1 n

$\exists C_1,~ tangent(n/2,n/2) \le C_1n$

Assume $T(n,h) \le C_2n\log h$ holds for all $n\le N, h \le H$ , then we have

T (N, H) \leq = \leq = = C 2 N 2 log h 1 + C 2 N 2 log h 2 + C 1 N C 2 N 2 log (h 1 h 2) + C 1 N C 2 N 2 log (H 2 H 2) + C 1 N C 2 N (log H - 1) + C 1 N C 2 N log H + (C 1 - C 2) N

$\begin{eqnarray*} T(N,H) & \le & C_2\frac{N}{2}\log h_1+C_2\frac{N}{2}\log h_2+C_1N\\ & = & C_2\frac{N}{2}\log (h_1h_2)+C_1 N \\ & \le & C_2\frac{N}{2}\log (\frac{H}{2}\frac{H}{2})+C_1N \\ & = & C_2N(\log H-1)+C_1 N \\ & = & C_2N\log H+(C_1-C_2)N \end{eqnarray*}$

Then if $C_2=\max(C_1,T(3,2))$ , we have

T (N, H) \leq C 2 N log H

$T(N,H) \le C_2 N\log H$

By the principle of induction, $T(n,h) = O(n\log h)$ is true for all $n,h$ .

Problem 2

Consider a set of 2D linear constrains $\{a_ix+b_iy\le c_i,i=1,2,\cdots\}$ . Given a point $(x^*,y^*)$ How do you prove it satisfies all the constrains or find a violating inequality?

Proof 2

Since we need to look through all the constrains before making a solution, the complexity is at least $O(N)$ , where $N$ is the number of constrains.

Then for one query, we simply check all the constrains. If the point $(x^*,y^*)$ violating any constrain, the program outputs that constrains. Otherwise, the point satisfies all the constrains. The complexity of this program is $O(N)$ .

Problem 3

What is the time complexity of the above question?

Proof 3

Since we simply check all the constrains, there is no recursive function or loop, and the time complexity is $O(N)$ .

Problem 4

Consider a computer system of memory size $\sqrt n$ and hard disk size $n$ . How do you maintain a database which always maintains the operations of finding-median, insertion and deleting median operations. Or do it with the best complexity you can achieve.

Proof 4

First, Since the computer system involves hard disks, the time consumed by the program is mainly the time spent on communicating with the hard disk. Then the time complexity is roughly linear to the number of communicating with the hard disk.

Moreover, since we need to analyze the complexity, some assumptions about the database are need. We assume that the input of the database is random, which can derive that the probability of inserting a sorted array is very small. We also assume that the operations are randomly taken, which can derive that the probablity of successive deleting median operations is small. Both assumptions are some kind of uniform distribution, which is always used when no prior knowledge exists.

The algorithm is designed as following. Since memory operations spent far less time than hard disk operations, we want to maintain more useful data in memory. Specifically, the database is divided into three parts: $L,S,H$ , where $L<S<H$ and $||L|-|H||\le|S|-1$ . Then the median of all data is the $\frac{|L|+|H|+|S|}{2}-|L|$ biggest number in $S$ .

For finding-median operations, the complexity is $O(1)$ , since no communicating with hard disk is need.

For insertion operations, as stated in lecture, the probability of $||L|-|H||\ge |S|-1$ is small, if $|S|$ is roughly $\sqrt n$ . For most usual cases, we simply inserting the number into either $L,S$ or $H$ , depending on the comparision between the inserted number and $\max(S),\min(S)$ . If we insert the number into $S$ and $|S|==\sqrt n$ , we need to move to the disk either $\max(S)$ or $\min(S)$ to make the size of $H$ and $L$ more balanced.
For the worst cases, where $||L|-|H||\ge |S|-1$ , we can either sacrifice the accuracy for efficiency or spend more time to guarantee a correct answer. If we prefer efficiency, we simply ignore the worst cases and the output of the algorithm is still with high probability correct. If we need absolutely correct answer, after inserting the number into either $L$ or $H$ , the maximum of $L$ or the minimum of $H$ should be inserted back to memory and furthermore the maximum of $S$ or the minimum of $S$ should be moved to the disk to guarantee that $||L|=|H||\le |S|-1$ and $L<S<H$ . The complexity of finding the maximum or minimum is at most $O(N)$ , where $N$ is the total number of data. If we construct a data structure such as heap, the complexity is reduced to $O(\log N)$ . In conclusion, since the probability of the worst cases is very small, the average complexity of insertion operations is $O(1)$ . But the worst complexity is $O(\log N)$ .

For deleting median operations, we simply find the median and delete it in the memory. Since the operations are taken randomly, the probability of reducing the size of $S$ from $\sqrt n$ to $\sqrt n/2$ is $(\frac{2}{3})^{\frac{\sqrt n}{2}}$ , which is pretty small when $n$ is large. But if we need the correct answer, we need to insert the maximum of $L$ or the minimum of $H$ into memory to maintain $||L|-|H||\le |S|-1$ in the worst cases. So the average complexity of deleting median operations is $O(1)$ , while in the worst cases, the complexity is $O(\log N)$ .

In conclusion, if we need absolutely correct answer, the average complexities of three operations are $O(1)$ , but the worst complexities for insertion and deleting median operations are $O(\log N)$ . If we can accept correct answer with high probability, the complexities for all three operations are $O(1)$ .

Problem 6

Consider a series-parallel graph, design your database for shortest path query on this graph

Proof 6

Firstly, the professor said in the weixin group that designing a database focused on designing a data structure, so this solution will only focus on the data structure design.

Since only the information of shortest paths is need, all parallel edges and self loops can be removed. Then the maximum size of storing a graph is $O(N^2)$ , where $N$ is the number of vertices. A simple way to design the database is to store the adjacency matrix and to calculate the shortest path for every query. The space needed is $O(N^2)$ and the time complexity for each query is roughly $O(N\log N)$ .

However, the previous design doesn’t make use of the properties of series-parallel graphs. Since there is no formal definition of series-parallel graph in the lecture, I searched on the Internet and found two equivalent definitions. The first one defines two combining operations: combining in series and combining in parallel. The combining of two series-parallel graphs is also a series-parallel graph. And the base case is one single edge. Another definition constructs a series-parallel graph by applying a sequence of following operations starting from a single edge: 1. add a self loop, 2. add parallel edges, 3. subdivide an edge (http://www.cse.iitd.ernet.in/~naveen/courses/CSL851/jatin.pdf}).

From the second definition, we can derive that the edge number of a series-parallel graph with no paralle edges and self loops is at most $2N$ . The proof is as following. Because subdividing an edge is the only operation that can add a vertex in the graph, the number of subdividing an edge operations is at most $N-2$ . Moreover, since no self loops and parallel edges exist, for each operation of adding a self loop or adding parallel edges, there must exist a subdividing an edge operation to remove self loops and parallel edges. Then the total number of the first two operations is at most $N$ . Each operation creates one edge, then the edge number of a series-parallel graph is at most $2N$ .

Now, using the first definition of series-parallel graph, we can derive that the number of combining operations is at most $3N$ . The proof is as following. Since the edge number is at most $2N$ , then before any combining operations, the graph contains at most $2N$ edges with $4N$ vertices. Combining in series operations reduce the vertex number by 1 and combining in parallel operations reduce the vertex number by 2. Then the maximum operation number is $3N$ .

Finally, we will begin to design the database. The database should only contains the sequences of combining operations, which can be easily recorded by three numbers: the operation type and the components’ number. The number of combining operations’ output is the smaller one of its two components for simplicity. Then, the graph can be stored in $O(N)$ space, which is much smaller than $O(N^2)$ .

Moreover, by storing the sequences of combining operations, the time complexity for each shortest path query is reduced to $O(N)$ . Suppose we want to find the shortest path between $v$ and $w$ . Before applying the combining operations, there are at most $3N$ sp-graphs (series-parallel graphs), which are all single edges. We firstly initialize the following variables in $O(N)$ time: the shortest path from $s_i$ to $t_i$ for each sp-graph, the shortest path from $v$ to its sp-graph’s source and sink, $s_v$ and $t_v$ and the shortest path from $w$ to its sp-graph’s source and sink, $s_w$ and $t_w$ . The shortest path from $v$ to $w$ is also initialized with NULL. For each combining operation, the previous variables are updated in $O(1)$ time. Then the total time complexity for each shortest path query is $O(N)$ .

In conclusion, by storing the sequences of combining operations, the database requires only $O(N)$ space, and the time complexity for each query is $O(N)$ .

The update rules are simple but tedious, so only the time complexity is stated in the context. And we will describe it here in detail.
For combining in series operations of sp-graphs $i$ and $j$ ( $i<j$ ), suppose we combine the sink of sp-graph $i$ , $t_i$ and the source of sp-graph $j$ , $s_j$ (vice versa). The shortest path between the new combined sp-graph’s source and sink $s_i'$ and $t_i'$ is the combination of two components’ shortest path, which means $P(s_i',t_i') = P(s_i,t_i)\cup P(s_j,t_j)$ . If $v$ is in sp-graph $i$ (similar for $j$ ), then $P(v,t_v)$ is updated by $P(v,t_i)\cup P(s_j,t_j)$ and $P(v,s_v)$ remains the same as $P(v,s_i)$ . It’s similar if $w$ is in sp-graph $i$ or $j$ . If $v$ is in sp-graph $i$ and $w$ is in sp-graph $j$ (vice versa), $P(v,w)$ is then updated by $P(v,t_i)\cup P(s_j,w)$ .
For combining in parallel operations of sp-graphs $i$ and $j$ $(i<j)$ , suppose we combine the sources $s_i,s_j$ and the sinks $t_i,t_j$ . The shortest path between the new combined sp-graph’s source and sink $s_i'$ and $t_i'$ is updated by the shorter one of its two components’ shortest path, which means $P(s_i',t_i') = \min (P(s_i,t_i),P(s_j,t_j))$ . If $v$ is in sp-graph $i$ (similar for $j$ ), $P(v,s_v)$ is updated by $\min (P(v,s_i),P(v,t_i)\cup P(t_j,s_j))$ . It’s similar if $w$ is in sp-graph $i$ or $j$ . If $v$ is in sp-graph $i$ and $w$ is in sp-graph $j$ (vice versa), the shortest path between $v,w$ is updated by the shorter path through the common source and the common sink, which means $P(v,w) = \min (P(v,s_i)\cup P(s_j,w), P(v,t_i)\cup P(t_j,w))$ . If $v$ and $w$ are all in sp-graph $i$ (similar for $j$ ), $P(v,w)$ should be updated by $\min (P(v,w),P(v,s_i)\cup P(s_j,t_j)\cup P(t_i,w), P(v,t_i)\cup P(t_j,s_j)\cup P(s_j,w))$ .