MST Review
Input: Undirected graph G = (V, E),edges costs
ce
c
e
.
Output: Min-cost spanning tree(no cycles, connected).
Assumptions : G is connected, distinct edge costs.
Cut Property: If e is the cheapest edge crossing some cut(A, B), then e belongs to the MST.
Kruskal’s MST Algorithm
Sort edges in order of increasing cost.
[Rename edges 1,2,…,m so that
c1<c2<...<cm
c
1
<
c
2
<
.
.
.
<
c
m
.]
−T=∅
−
T
=
∅
−
−
For to
m
m
O(m) time
If
T∪{i}
T
∪
{
i
}
has no cycles. O(n) time to check for cycle.
[ Use BFS or DFS in the graph(V,T)which contains
≤
≤
n-1 edges.
−−− − − − Add i i to .
−
−
Return .
Running time of straightforward implementation:
(m = # of edges, n= # of vertices] O(mlogn)+O(mn) = O(mn).
Plan: Data structure for O(1)-time cycle checks
→
→
O(mlogn)
O
(
m
l
o
g
n
)
time.
The Union-Find Data structure.
Maintain partition of a set of objects.
FIND(X): Return name of group that X belongs to.
UNION(
Ci
C
i
,
Cj
C
j
): Fuse groups
Ci,CJ
C
i
,
C
J
into a single one.
Objects = vertices
−
−
Groups = Connected components chosen edges T.
Adding new edge(u,v) to T.
Motivation: O(1)-time cycle checks in Kruskal’s algorithm.
Idea#1: -Maintain one linked structure per connected component of(V,T).
−
−
Each component has an arbitrary leader vertex.
Invariant: Each vertex points to the leaders of its component.
Key point: Given edge(u,v), can check if u&v already in same component in O(1) time.
[if and only if leader pointers of u, v match,i,e.,FIND(u) = FIND(v) ] O(1)-time cycle checks!
Runing Time of Fast Implementation
Scored:
O(mlogn) time for sorting .
O(m) times for cycle checks[O(1) per iteration]
O(nlog(n)) time overall for leader pointer updates.
→ → total (Matching Prim’s algorithm)
Clustering [unsupervised learning]
Informal goal: Given n “points”,[Web pages, images, genome fragments,etc] classfify into “coherent groups”;
Assumptions:
(1) As input, given a (dis) similarity measure
a distance
d(p,q)
d
(
p
,
q
)
between each point pair.
(2)Symmetric [d(p,q) = d(q,p)]
Examples: Euclidean distance, genome similarity, etc.
Goal: Same cluster
⇔
⇔
“nearby”
Max-Spacing k-Clusterings
Assume: We know
k:=
k
:=
# of clusters desired.
[In practice, can experiment with a range of values]
Call points p & q separated is they’re assigned to different clusters.
Definition: The spacing of a k-clustering is
minseparatedp,q
m
i
n
s
e
p
a
r
a
t
e
d
p
,
q
d(p,q)
d
(
p
,
q
)
.
Problem statement: Given a distance measure
d
d
and , compute the
k−clustering
k
−
c
l
u
s
t
e
r
i
n
g
with maximum spacing.
A Greedy Algorithm
−
−
Initially, each point in a sepatate cluster
Repeat until only k clusters.
−−
−
−
Let
p,q=
p
,
q
=
closest pair of separated points.
(determines the current spacing)
−
−
Merge the cluster containg p & q into a single cluster.
Note: Just like Krusskal’s MST algorithm, but stopped early.
Points
⇔
⇔
vertices, distances
↔
↔
edges costs, point pairs.
↔
↔
edges.
⇒
⇒
Called single-link clustering.