Review of 4246 Algorithms for Data Science

Alex Tech Bolg

已于 2022-05-08 22:14:52 修改

阅读量643

点赞数

分类专栏： CU_Courses 文章标签：算法数据结构排序算法 algorithm

于 2021-12-28 00:40:38 首次发布

本文链接：https://blog.csdn.net/qq_41103204/article/details/122164257

版权

CU_Courses 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Important algorithms
Note
Lecture1: Insertion sort, efficient algorithm
Lecture2: Merge sort
Lecture3: Binary search, quicksort
Lecture5: Graphs, Breadth-First Search (BFS)
Lecture6: Depth-first search, topological sorting
Lecture7&8: Strongly connected components, single-origin shortest paths in weighted graphs
Lecture12: Data compression and Huffman coding
Lecture9: The dynamic programming principle; segmented least squares
After midterm
Lecture 11 Shortest paths in weighted graphs (Bellman-Ford)
Lecture 15&16 Network flows
Lecture 20&21 Reductions; independent set and vertex cover; decision problems
Lecture 22 Satisfiability problems: SAT, 3SAT, Circuit-SAT
Lecture 23&25 Representative NP-complete problems: TSP, Set Cover
Lecture18&19 Linear programming

Important algorithms

Sort: Insertion sort, merge sort, quick sort
Binary search
Graph: BFS、DFS and their application
Greedy: Dijkstra’s algorithm and improved implementation
Dynamic programming: Bellman-Ford
Network flow: Ford-Fulkerson algorithm
NPC: Vertex Cover(D), Independent Set(D), SAT, 3SAT, integer programming
Linear programming (integer programming)

Note

重点是理解各种算法的应用以及时间复杂度。

Lecture1: Insertion sort, efficient algorithm

Insertion sort

Analysis of algorithms

Correctness: Proof by induction.
Running time: Best case: 5n − 4; worst case. 3n^2 + 7n − 4
Space: in-place algorithm

Efficient algorithms (polynomial running time)

Lecture2: Merge sort

Asymptotic notation

Asymptotic upper bounds: Big-O notation
Asymptotic lower bounds: Big-Ω notation
Asymptotic tight bounds: Θ notation
Asymptotic upper bounds that are not tight: little-o
Asymptotic lower bounds that are not tight: little-ω

Divide & conquer principle, application: merge sort

Correctness
Running time: O(nlogn)
Space: Θ(n)

Solving recurrences and running time of merge sort

recursion trees
Master theorem

Lecture3: Binary search, quicksort

Binary search

Running time: O(log n)

Quicksort

divide and conquer
Space: in-place
Running time:
Its worst-case running time is Θ(n^2) but its average-case running time is Θ(nlogn)
Correctness:
- strong induction (the induction step at n requires that the inductive hypothesis holds at all steps 1, 2, …, n−1 and not just at step n−1, as with simple induction.)

Lecture5: Graphs, Breadth-First Search (BFS)

Definition

Undirected / directed
Simple: all vertices are distinct.

An undirected graph is connected when there is a path between every pair of vertices.
The connected component of a node u is the set of all node in the graph reachable by a path from u.
A directed graph is strongly connected if for every pair of vertices u, v, there is a path from u to v and from v to u.
The strongly connected component of a node u in a directed graph is the set of nodes v in the graph such that there is a path from u to v and from v to u.

Tree: connected acyclic graph
Bipartite graphs
Degree theorem
Linear graph algorithms run in O(n+m) time

Representing graphs

adjacency matrix
adjacency list

Breadth-first search (BFS)

queue: FIFO data structure
s-t connectivity
shortest s-v paths in unweighted graphs.
Connected components in undirected graphs
Testing bipartiteness & graph 2-colorability
SCC(u): SCC of node u
All SCC: but not linear

Lecture6: Depth-first search, topological sorting

Depth-first search (DFS)

stack: LIFO (Last-In First-Out)
s-t connectivity
Cycle detection
Topological sorting in DAGs (directed acyclic graph)
- Run DFS(G); compute finish times.
- Process the tasks in decreasing order of finish times.
- Running time: O(m+n)
- different edges: forward, back, cross --> time intervals for vertices
Undirected graphs: find all connected components
SCC(u): SCC of node u
All SCC: linear
- Compute Gr.
- Run DFS(Gr); compute finish(u) for all u.
- Run DFS(G) in decreasing order of finish(u).
- Output the vertices of each tree in the DFS forest of line 3 as an SCC.

Lecture7&8: Strongly connected components, single-origin shortest paths in weighted graphs

Applications of DFS: Strongly connected components

(combine it with the above)

Shortest paths in graphs with non-negative edge weights (Dijkstra’s algorithm)

Greedy principle
implementation and improved implementation
running time

Lecture12: Data compression and Huffman coding

Prefix codes and trees
The Huffman algorithm

Lecture9: The dynamic programming principle; segmented least squares

Overlapping subproblems
An easy-to-compute recurrence
Iterative, bottom-up computations

Segmented least squares
Sequence alignment

After midterm

Lecture 11 Shortest paths in weighted graphs (Bellman-Ford)

Bellman-Ford algorithm (DP solution)
- OPT (i, v) = cost of shortest s-v path using at most I edges
- 二维数组: time: O(nm), space: O(n^2)
  - Pseudocode: M [i, v]
- 一维数组: time: O(nm), space: O(n)
  - Early termination condition: if at some iteration I no value in M changed, then stop.
  - Pseudocode: M[v] similar to Dijkstra algorithm
- Detecting negative cycle
  - Update all edges n times (1 more time)

Lecture 15&16 Network flows

Definition:

Capacity constrains
Flow conservation
|f| = f_out
Max flow and min cut

Residual graph and augmenting paths

Residual graph (forward, backward)
P is simple path. Augment f by pushing extra flow on P
Bottleneck

Ford-Fulkerson algorithm

Running time: O(nmU) —— pseudo-polynomial
Correctness
Application: max bipartite matching
Reduction (Forward direction, Reverse direction)

Lecture 20&21 Reductions; independent set and vertex cover; decision problems

Reduction:

x of X
y=R(x) of Y

Reduction as a means to design efficient algorithm

Y has polynomial computational steps
Call polynomial number of Y
X <= pY

Reduction as a means to argue about hard problems

X <= p Y
Y is at least as hard as X
If X cannot be solved in polynomial time, then Y cannot.
Relative level of difficulty: X <= pY, Y <= pX

Two hard problems

Independent set
Vertex cover

Optimization versions for IS and VC

Max independent size
Min vertex cover

Decision version of optimization problems

Yes/no answer
Max – lower, min – upper

Rough equivalence of decision & optimization problems

Suppose we have an algorithm to solve MIS we can use it to solve IS(D)
Suppose we have an algorithm to solve IS(D) we can use it to solve MIS

Reduction from Independent Set to Vertex Cover

Forward direction
Reverse direction

Class P

Set of decision problems that can be solved by polynomial-time algorithm.

（引出NP）

If we were given a solution S for such a problem X(D), we could check if it is correct quickly.
Such an S is a succinct certificate that x 属于X(D)

Class NP

An efficient certifier (or verification algorithm) B for a problem X(D) is a polynomial algorithm that
- Takes two input arguments: instance x (which is a specific input of the problem) and the short certificate t
- B(x, t) = yes and |t| <= Poly(|x|), then we have x 属于 X(D)
Set of decision problems that have an efficient certifier.

P vs NP

P 属于 NP
P = NP ?
Why would NP contain more problems than P? Intuitively, the hardest problems in NP are the least likely to belong to P

NPC

The hardest problems
NPC X(D) 定义：
- If X(D) 属于 NP
- For all Y 属于 NP, Y <= p X(D)

Show a problem is NP-complete

Suppose we had an NP-complete problem X, to show Y is NPC, we only need to show:
- Y 属于 NP
- X <= p Y
(相比根据定义证明NPC，这种方法只需要做一次reduction，简化了很多）

Lecture 22 Satisfiability problems: SAT, 3SAT, Circuit-SAT

Definition

truth assignment
A truth assignment satisfies a clause if it causes the clause to evaluate to 1.
A formula φ is satisfiable if it has a satisfying truth assignment.

Satisfiability (SAT) and 3SAT

SAT: Given a formula φ in CNF with n variables and m clauses, is φ
satisfiable?
3SAT: Given a formula φ in CNF with n variables and m clauses such that each clause has exactly 3 literals, is φ satisfiable?

The art of proving NP-completeness

Circuit-SAT ≤p SAT
SAT ≤P 3SAT
3SAT ≤p IS(D)

Lecture 23&25 Representative NP-complete problems: TSP, Set Cover

Circuit SAT
TSP
Integer programming

Lecture18&19 Linear programming

Definition

feasible solutions
Feasible region
Optimal solution

Duality

We can alternatively solve the dual to find the optimal objective value.
An optimal dual solution can be used to derive an optimal primal solution (complementary slackness).
The dual may have structure making it easier to solve at scale (e.g., via parallel optimization).
7-step dualization