概率图模型11:Minimal I-Maps

最新推荐文章于 2025-02-21 13:37:00 发布

相国大人

最新推荐文章于 2025-02-21 13:37:00 发布

阅读量2.3k

点赞数

分类专栏：概率图模型

本文链接：https://blog.csdn.net/github_36326955/article/details/78585020

版权

概率图模型专栏收录该内容

22 篇文章

订阅专栏

作者：孙相国

E-mail:sunxiangguodut@qq.com

1. 引言

如我们之前讨论过的，实际问题中，变量的联合概率分布的原子情况往往非常巨大，我们根本不可能，或者说我们的数据也不可能把所有的情况都囊括其中。这就意味着，我们很难全面的发现这个真实的概率分布，而我们所能够做到的就是根据已有的数据，尽可能的发掘这个真是概率分布中的独立性子集。然后构建一个满足这个独立性子集的I-map。本节的工作是：给定一个概率分布 $P$ ，我们能在多大程度上构建出一个图 $\mathcal{G}$ ，使得这个图为 $P$ 的一个 $I-Map$ 呢？一般的情况是，根据一部分独立性集合，我们可以构建多个字图。为此我们希望找一个特殊的。

2.回顾

定理1:令 $\mathcal{G}$ 是定义在变量集 $\mathcal{X}$ 上的一个贝叶斯网络，并且 $P$ 是同一个空间上的联合分布。如果 $\mathcal{G}$ 是 $P$ 的一个I-map，那么 $P$ 根据 $\mathcal{G}$ 因子分解。

证明：

假定 $X_1,X_2,\cdots,X_n$ 的顺序就是图 $\mathcal{G}$ 的一个拓扑序。

由概率的链式法则有：

$> P (X 1, \dots, X n) = P (X 1) P (X 2 | X 1) P (X 3 | X 1, X 2) \dots P (X n | X 1, \dots, X n - 1) >$ $> P\left(X_1,\cdots,X_n\right)=P\left(X_1\right)P\left(X_2|X_1\right)P\left(X_3|X_1,X_2\right)\cdots P\left(X_n|X_1,\cdots,X_{n-1}\right) >$
由于 $\mathcal{G}$ 为I-map,因此 $\mathcal{G}$ 中蕴含了如下的独立性论断： $\mathcal{I}_\mathcal{l}\left(\mathcal{G}\right)=\{ \left(X_i \perp NonDescendants_{X_i}|Pa_{X_i}^\mathcal{G}\right):X_i \in X_{1:n} \}$ .且 $\mathcal{I}_\mathcal{l}\left(\mathcal{G}\right)\subseteq \mathcal{I}\left(P\right)$ 。

由于 $X_1,X_2,\cdots,X_n$ 是图 $\mathcal{G}$ 的一个拓扑序，因此对于式子 $(11)$ 中的任意一项 $P\left(X_i|X_1,\cdots,X_{i-1}\right)$ ， $X_i$ 的所有父节点都在集合 $\{X_1,\cdots,X_{i-1}\}$ 中，并且这个集合不存在任何 $X_i$ 的后代节点,即： $\{X_1,\cdots,X_{i-1}\}= Pa_{X_i}^\mathcal{G} \cup Z, Z \subseteq NonDescendants_{X_i}$ ，根据独立性论断 $\mathcal{I}_\mathcal{l}\left(\mathcal{G}\right)$ 和条件独立性分解性质，有： $P\left(X_i|X_1,\cdots,X_{i-1}\right)=P\left(X_i|Pa_{X_i}^\mathcal{G} \cup Z\right)= P\left(X_i|Pa_{X_i}^\mathcal{G}\right)$ ，进而有公式 $(9)$ .

得证

定理2:令 $\mathcal{G}$ 是定义在变量集 $\mathcal{X}$ 上的一个贝叶斯网络，并且 $P$ 是同一个空间上的联合分布。如果 $P$ 根据 $\mathcal{G}$ 因子分解，那么 $\mathcal{G}$ 是 $P$ 的一个I-map。

令 $P$ 是某个根据 $G_{students}$ 因子分解的概率分布。我们需要证明 $\mathcal{I}(G_{students})$ 在 $P$ 中成立。考虑任意随机变量 $X_k$ 的独立性假设 $\left( X_k \perp NonDescendants_{X_k}|Pa_{X_k}^\mathcal{G}\right)$ ，为了证明其在P中成立，需要证明：

P (X k | N o n D e s c e n d a n t s X k, P a  X k) = P (X k | P a  X k) (1)

$P(X_k|NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})=P(X_k|Pa_{X_k}^\mathcal{G})\tag{1}$
根据定义，
$P (X k | N o n D e s c e n d a n t s X k, P a  X k) = P ( X k , N o n D e s c e n d a n t s X k , P a  X k ) P ( N o n D e s c e n d a n t s X k , P a  X k ) (2)$ $P(X_k|NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})=\frac{P(X_k,NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})}{P(NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})}\tag{2}$
根据贝叶斯网的链式法则，分式的分子为：
$P (X k, N o n D e s c e n d a n t s X k, P a  X k) = Π X i \notin D e s c e n d a n t s X k P (X i | P a  X i) (3)$ $P(X_k,NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})=\Pi_{X_i \notin Descendants_{X_k}}P(X_i|Pa_{X_i}^\mathcal{G})\tag{3}$
通过对联合分布执行边缘化，分式的分母为：
$P (N o n D e s c e n d a n t s X k, P a  X k) = \sum X k P (X k, N o n D e s c e n d a n t s X k, P a  X k) = \sum X k Π X i \notin D e s c e n d a n t s X k P (X i | P a  X i) = \sum X k P (X k | P a  X k) Π X i \notin D e s c e n d a n t s X k, X i \neq X k P (X i | P a  X i) = Π X i \notin D e s c e n d a n t s X k, X i \neq X k P (X i | P a  X i) \sum X k P (X k | P a  X k) = Π X i \notin D e s c e n d a n t s X k, X i \neq X k P (X i | P a  X i) (4)$ $P(NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})=\sum_{X_k}P(X_k,NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})\\=\sum_{X_k}\Pi_{X_i \notin Descendants_{X_k}}P(X_i|Pa_{X_i}^\mathcal{G})\\=\sum_{X_k}P(X_k|Pa_{X_k}^\mathcal{G})\Pi_{X_i \notin Descendants_{X_k},X_i\neq X_k}P(X_i|Pa_{X_i}^\mathcal{G})\\=\Pi_{X_i \notin Descendants_{X_k},X_i\neq X_k}P(X_i|Pa_{X_i}^\mathcal{G})\sum_{X_k}P(X_k|Pa_{X_k}^\mathcal{G})\\=\Pi_{X_i \notin Descendants_{X_k},X_i\neq X_k}P(X_i|Pa_{X_i}^\mathcal{G})\tag{4}$
这样， $(2)$ 可以写为：
$P (X k | N o n D e s c e n d a n t s X k, P a  X k) = P ( X k , N o n D e s c e n d a n t s X k , P a  X k ) P ( N o n D e s c e n d a n t s X k , P a  X k ) = Π X i \notin D e s c e n d a n t s X k P ( X i | P a  X i ) Π X i \notin D e s c e n d a n t s X k , X i \neq X k P ( X i | P a  X i ) = P ( X k | P a  X k ) Π X i \notin D e s c e n d a n t s X k , X i \neq X k P ( X i | P a  X i ) Π X i \notin D e s c e n d a n t s X k , X i \neq X k P ( X i | P a  X i ) = P (X k | P a  X k)$ $P(X_k|NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})=\frac{P(X_k,NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})}{P(NonDescendants_{X_k},Pa_{X_k}^\mathcal{G})}\\=\frac{\Pi_{X_i \notin Descendants_{X_k}}P(X_i|Pa_{X_i}^\mathcal{G})}{\Pi_{X_i \notin Descendants_{X_k},X_i\neq X_k}P(X_i|Pa_{X_i}^\mathcal{G})}\\=\frac{P(X_k|Pa_{X_k}^\mathcal{G})\Pi_{X_i \notin Descendants_{X_k},X_i\neq X_k}P(X_i|Pa_{X_i}^\mathcal{G})}{\Pi_{X_i \notin Descendants_{X_k},X_i\neq X_k}P(X_i|Pa_{X_i}^\mathcal{G})}\\=P(X_k|Pa_{X_k}^\mathcal{G})$
证毕

3. minimal I-map

A graph $\mathcal{K}$ is a minimal I-map for a set of independencies $\mathcal{I}$ if it is an I-map for $\mathcal{I}$ , and if the removal of even a single edge from $\mathcal{K}$ renders it not an I-map.

第2节的定理1和定理2为我们找到minimal I-map提供了依据，We assume we are given a predetermined variable ordering, say, {X 1 , … , X n }. We now examine each variable X i , i = 1, … , n in turn. For each X i , we pick some minimal subset U of {X 1 , … , X i−1 } to be X i ’s parents in G. More precisely, we require that U satisfy (X i ⊥ {X 1 , … , X i−1 } − U | U), and that no node can be removed from U without violating this property. We then set U to be the parents of X i .

The proof of theorem 1 tells us that, if each node X i is independent of X 1 , … , X i−1 given its parents in G, then P factorizes over G. We can then conclude from theorem 3.2 that G is an I-map for P. By construction, G is minimal, so that G is a minimal I-map for P.

事实上，给定一个拓扑序列，找 $X_i$ 节点的父节点最小集 $U$ ，这个最小集 $U$ 的寻找并不是唯一的，例如有 $X_1,X_2,X_3$ 这3个节点，其中 $X_1,X_2$ 在逻辑上等价（如下图），那么我们可以选择 $X_1,X_2$ 中的任一个节点作为 $X_3$ 的父节点，不过一旦选择了一个，就不等选择另一个了,Hence, the minimal parent set U in our construction is not necessarily unique.

However, one can show that, if the distribution is positive (see deﬁnition 2.5), that is, if for any instantiation ξ to all the network variables X we have that P(ξ) > 0, then the choice of parent set, given an ordering, is unique. Under this assumption, algorithm 3.2 can produce all minimal I-maps for P: Let G be any minimal I-map for P. If we give call Build-Minimal-I-Map with an ordering ≺ that is topological for G, then, due to the uniqueness argument, the algorithm must return G.

At ﬁrst glance, the minimal I-map seems to be a reasonable candidate for capturing the structure in the distribution: It seems that if G is a minimal I-map for a distribution P, then we should be able to “read oﬀ” all of the independencies in P directly from G. Unfortunately, this intuition is false.

A distribution P is said to be positive if for all events α ∈ S such that α = ∅, we have that P(α) > 0.

4. Minimal I-Map的问题

Note that the graphs in ﬁgure 3.8b,c really are minimal I-maps for this distribution. However, they fail to capture some or all of the independencies that hold in the distribution. Thus, they show that the fact that G is a minimal I-map for P is far from a guarantee that G captures the independence structure in P.

Consider the distribution P B student , as deﬁned in ﬁgure 3.4, and let us go through the process of constructing a minimal I-map for P B student . We note that the graph G student precisely reﬂects the

independencies in this distribution P B student (that is, I(P B student ) = I(G student )), so that we can use G student to determine which independencies hold in P B student .

Our construction process starts with an arbitrary ordering on the nodes; we will go through this process for three diﬀerent orderings. Throughout this process, it is important to remember that we are testing independencies relative to the distribution P B student . We can use G student (ﬁgure 3.4) to guide our intuition about which independencies hold in P B student , but we can always resort to testing these independencies in the joint distribution P B student .

The ﬁrst ordering is a very natural one: D, I, S, G, L. We add one node at a time and see which of the possible edges from the preceding nodes are redundant. We start by adding D, then I. We can now remove the edge from D to I because this particular distribution satisﬁes (I ⊥ D), so I is independent of D given its other parents (the empty set). Continuing on, we add S, but we can remove the edge from D to S because our distribution satisﬁes (S ⊥ D | I). We then add G, but we can remove the edge from S to G, because the distribution satisﬁes (G ⊥ S | I, D).

Finally, we add L, but we can remove all edges from D, I, S. Thus, our ﬁnal output is the graph in ﬁgure 3.8a, which is precisely our original network for this distribution.

Now, consider a somewhat less natural ordering: L, S, G, I, D. In this case, the resulting I-map is not quite as natural or as sparse. To see this, let us consider the sequence of steps. We start by adding L to the graph. Since it is the ﬁrst variable in the ordering, it must be a root. Next, we consider S. The decision is whether to have L as a parent of S. Clearly, we need an edge from L to S, because the quality of the student’s letter is correlated with his SAT score in this distribution, and S has no other parents that help render it independent of L. Formally, we have that (S ⊥ L) does not hold in the distribution. In the next iteration of the algorithm, we introduce G. Now, all possible subsets of {L, S} are potential parents set for G. Clearly, G is dependent on L. Moreover, although G is independent of S given I, it is not independent of S given L. Hence, we must add the edge between S and G. Carrying out the procedure, we end up with the graph shown in ﬁgure 3.8b.

Finally, consider the ordering: L, D, S, I, G. In this case, a similar analysis results in the graph shown in ﬁgure 3.8c, which is almost a complete graph, missing only the edge from S to G, which we can remove because G is independent of S given I.

为了解决这样的问题，我们接下来将要提到的概念是P-Maps，请见此系列下一篇博文。