【机器学习】On the Identifiability of Nonlinear ICA： Sparsity and Beyond

小丫么小阿豪

已于 2023-11-23 01:16:38 修改

阅读量882

点赞数 1

分类专栏：论文阅读文章标签：机器学习人工智能经验分享

于 2023-11-23 01:09:22 首次发布

本文链接：https://blog.csdn.net/qq_43426078/article/details/134566566

版权

论文阅读专栏收录该内容

1 篇文章 0 订阅

订阅专栏

前言

本文是对On the Identifiability of Nonlinear ICA: Sparsity and Beyond (NIPS 2022)中两个结构稀疏假设的总结。原文链接在Reference中。

什么是ICA(Independent component analysis)？

独立成分分析简单来说，就是给定很多的样本X，通过样本分离出组合成样本的源S。

关于ICA的详细内容，可以参考Yifan Shen的博客：ICA简明攻略

非线性独立成分分析

非线性独立成分分析（ICA）旨在从其可观测的非线性混合物中恢复潜在的独立源。非线性独立成分分析（ICA）是无监督学习的基础。它推广了线性ICA（Comon, 1994），用于从观测数据中识别潜在源。这些观测数据是源的非线性混合。对于观测向量 $x$ ，非线性ICA将其表示为 $x = f (s)$ ， $f$ 是一个未知的可逆混合函数， $s$ 是表示（边缘上）独立源的潜在随机向量。非线性ICA问题的困难点在于，在没有额外假设的情况下是不可解的。即在没有额外假设或限制的情况下，把观测变量分解为独立的元素存在无穷种解。同时分离后的元素仍然是源的混合。

现有的研究引入了辅助变量 $u$ （例如，类别标签、域索引、时间索引），并假设在给定 $u$ 的条件下源是条件独立的。关于这部分工作如果感兴趣可以去看 Hyvärinen的talk。

条件独立

$\cdot P(B|C)$
一个很好的例子：
C指赖床，A指熬夜，B指迟到， $\cdot P(B|C)$ ，赖床的发生使得熬夜想通过赖床来影响迟到的可能为零，即没有别的可能来影响迟到了，所以此时熬夜和迟到就是条件独立了。

结构稀疏(Structural Sparsity)假设

Theorem 1. Let the observed data be sampled from a nonlinear ICA model as defined in Eqs. (1) and (2). Suppose the following assumptions hold:

i. Mixing function $f$ is invertible and smooth. Its inverse is also smooth.

ii. For all $\in \{1, \ldots, n\}$ and $\in \mathcal{F}_{i, :}$ , there exist $\{s^{(\ell)}\}_{\ell=1}^{|\mathcal{F}_{i, :}|}$ and $T$ s.t. $\text{span}\{J_f(s^{(\ell)})_{i, :}\}_{\ell=1}^{|\mathcal{F}_{i, :}|} = \mathbb{R}^n_{\mathcal{F}_{i, :}}$ and $[J_f(s^{(\ell)})^T]_{j, :} \in \mathbb{R}^{n}_{\hat{\mathcal{F}}_{i, :}}$ .

iii. $|\hat{\mathcal{F}}| \leq |\mathcal{F}|$ .

iv. (Structural Sparsity) For all $\in \{1, \ldots, n\}$ , there exists $C_k$ such that

$\bigcap_{i \in C_k} \mathcal{F}_{i, :} = \{k\}.$

Then $\hat{f}^{-1} \circ f$ is a composition of a component-wise invertible transformation and a permutation.
![[Pasted image 20231121180751.png]]

符号含义

${J_f(s)}$ 表示对源矩阵求偏导得到的雅可比矩阵，如上图所示。
$KaTeX parse error: Got function '\hat' with no arguments as subscript at position 4: {J_\̲h̲a̲t̲{f}(\hat{s})}$ 表示源矩阵经过变换后得到的矩阵求偏导得到的雅可比矩阵。其实就是表示nonlinear ICA求解后得到的矩阵求偏导。因为在有解的情况下，求解出的源只是真实源的排列组合，因此表示称这种形式。后面会用预测值来表示带hat的量。
$\mathcal{F}$ 表示雅各比矩阵 ${J_f(s)}$ 的support，也就是其不为0的点组成的集合。
$\hat{\mathcal{F}}$ 表示雅各比矩阵 $KaTeX parse error: Got function '\hat' with no arguments as subscript at position 4: {J_\̲h̲a̲t̲{f}(\hat{s})}$ 的support，也就是其不为0的点组成的集合。
$i, :$ 表示的是某一行中所有不为0点的索引。也就是横坐标固定，列坐标任取。
$s^{(l)}$ 表示的是第l个源
$C_k$

理论拆解：

第一条很容易理解。由观测样本X恢复出源S的过程是一个求逆的过程。在 $f$ 是可逆平滑的情况下才可以求逆，求导。
第二条表示的意思是源S可能的组合空间足够大（实数）。这条是为了避免ill-posed conditions。比如雅可比矩阵的某一个部分一直满足后面的假设条件，使得假设虽然成立但仍无法求解。
第三条表示预测雅可比矩阵support的数量小于等于真实雅可比矩阵support support的数量。
第四条表示的意思是，对于某一列 $C_k$ 中的行 $i$ 来说，从support集合 $\mathcal{F}$ 中取行坐标为 $i$ 的所有坐标点，也就是取了某一行上的所有support。依次取出所有行的support，在列方向上求交集。一定存在这样的列，它的列坐标是k，可以保证这列存在不止一个support（存在交集），同时这列存在这些support的对应行只在这列上有交集（意思是某些观测值的共同源只有一个）。概括来说就是，一定至少存在一个源，是某些观测变量的唯一共同源。
满足上述条件的情况下，可以得到对源 $s$ 做变换 $f$ 得到实际的 $x = f (s)$ ，再做逆变换 $f^{-1}$
整个过程是分量逐一逆变换加上排列，也就是component-wise invertible transformation and a permutation。

证明

问题简化

证明的目标是 $\hat{f}^{-1} \circ f$ 为源s的permutation with component-wise invertible变换。也就是： $\hat{f} = f \circ h^{-1}(s)$ . 假设 $D (s)$ 表示一个对角矩阵， $P$ 表示一个排列变换矩阵。通过使用链式法则, 我们可以把 $\hat{f} = f \circ h^{-1}(s)$ 表示为：

$\begin{aligned} J_{\hat{f}}(\hat{s}) &= J_{f \circ h^{-1}}(h(s)) \\ &= J_{f \circ g^{-1} \circ P^{-1}}(Pg(s)) \\ &= J_{f \circ g^{-1}}(P^{-1}Pg(s)) J_{P^{-1}}(Pg(s)) \\ &= J_{f \circ g^{-1}}(g(s)) J_{P^{-1}}(Pg(s)) \\ &= J_{f}(g^{-1}(g(s))) J_{g^{-1}}(g(s)) J_{P^{-1}}(Pg(s)) \\ &= J_{f}(s)D(s)P, \end{aligned} \tag{1}$

其中，g是一个元素可逆函数（invertible element-wise function）。
接着用 $T (s) = D (s) P$ 来表示 $D (s) P$ ， $T (s)$ 是可逆的。由此目标可以转化为证明（2）成立：
$J_{\hat{f}}(\hat{s}) = J_{f}(s)T(s), \tag{2}$
也就是我们希望证明预测得到的雅可比矩阵，只是真实雅可比矩阵的一个排列。
如果变换 $T (s)$ 不是一个简单的排列变换，也就是 $\neq D(s)P$ ，那么（2）就不成立。如果我们能证明 $T (s) = D (s) P$ ，也就能说明（2）成立。
接着我们的目标就转化为证明：
$\tag{3}$

找源雅可比矩阵，预测雅可比矩阵，变换矩阵T之间的关系

用 $\mathcal{F}$ 表示 $J_f(s)$ 的支撑集， $\hat{\mathcal{F}}$ 表示 $J_{\hat{f}}(\hat{s})$ 的支撑集， $\tau$ 表示 $T (s)$ 的支撑集， $T$ 表示具有支撑集 $\tau$ 的矩阵。根据假设 ii，我们可以得到：

$\text{span}\{J_f(s^{(\ell)})_i\}_{\ell=1}^{|\mathcal{F}_{i, ：}|} = \mathbb{R}^n_{\mathcal{F}_{i, ：}} \tag{4}$

由于 $\{J_f(s^{(\ell)})_i\}_{\ell=1}^{|\mathcal{F}_{i, ：}|}$ 形成了 $\mathbb{R}^n_{\mathcal{F}_{i, ：}}$ 的一组基，对于任意 $j_0 \in \mathcal{F}_{i, ：}$ ，我们可以将热独向量 $e_{j_0} \in \mathbb{R}^n_{\mathcal{F}_{i, ：}}$ 重写为：

$e_{j_0} = \sum_{\ell \in \mathcal{F}_{i, ：}} \alpha_{\ell} J_f(s^{(\ell)})_{i, :}, \tag{5}$

其中 $\alpha_{\ell}$ 是对应的系数。则有：

$T_{j_0, :} = e_{j_0} T = \sum_{\ell \in \mathcal{F}_{i, :}} \alpha_{\ell} J_f(s^{(\ell)})_{i, :} \cdot T \in \mathbb{R}^n_{\hat{\mathcal{F}}_{i, :}}, \tag{6}$

公式（6）的“ $\in$ ”符号来自于假设 ii，即（6）求和中的每个元素都属于 $\mathbb{R}^n_{\hat{\mathcal{F}}_{i, ：}}$ 。因此有：

$\forall j \in \mathcal{F}_{i, ：}, \quad T_{j, ：} \in \mathbb{R}^n_{\hat{\mathcal{F}}_{i, ：}} \tag{7}$
到这一步我们推出了 $T_{j, :}$ 是属于预测矩阵的支撑集的。

然后我们可以在支撑集之间建立如下关系：

$\forall (i, j) \in \mathcal{F}, \{i\} \times \tau_{j, :} \subseteq \hat{\mathcal{F}}. \tag{8}$

需要注意，这里的 $\times$ 表示的是combine。意思是横坐标 ${i}$ 和 $\tau_{j, :}$ 表示的纵坐标组合起来。

根据假设 i, $T (s)$ 是一个可逆矩阵，这意味着它有一个非零的行列式值。将矩阵 $T (s)$ 的行列式为它的 Leibniz 公式展开：

$\text{det}(T(s)) = \sum_{\sigma \in S_n} \{\text{sgn}(\sigma) \prod_{i=1}^n T(s)_{i,\sigma(i)}\} \neq 0. \tag{9}$

其中 $S_n$ 是 $n$ 排列的集合。

由于（9）不为0，则存在至少一个求和元素是非零的，表示为：

$\exists \sigma \in S_n, \forall i \in \{1, \ldots, n\}, \text{sgn}(\sigma) \prod_{i=1}^n T(s)_{i,\sigma(i)} \neq 0. \tag{10}$

（10）等价于：

$\exists \sigma \in S_n, \forall i \in \{1, \ldots, n\}, T(s)_{i,\sigma(i)} \neq 0. \tag{11}$

由此我们可以得到 $\sigma(·)$ 在 $T (s)$ 的支持集中，进而有：

$\forall j \in \{1, \ldots, n\}, \sigma(j) \in T_{j, ：} \tag{12}$

结合 (8)，可以得到：

$\forall (i, j) \in \mathcal{F}, (i, \sigma(j)) \in \{i\} \times T_{j, :} \subseteq \hat{\mathcal{F}}. \tag{13}$

表示

$\sigma(\mathcal{F}) = \{(i, \sigma(j)) | (i, j) \in \mathcal{F}\}. \tag{14}$

以上我们得到了 $\mathcal{F}$ , $\hat{\mathcal{F}}$ , $T$ 之间的变换关系。

反证法证明 $T (s) = D (s) P$ 成立

假设 $\neq D(s)P$ ，那么：

$\exists j_1 \neq j_2, \tau_{j_1, :} \cap \tau_{j_2, :} \neq \varnothing. \tag{15}$

此外，考虑 $j_3 \in \{1, \ldots, n\}$ 使得：

$\sigma(j_3) \in \tau_{j_1, :} \cap \tau_{j_2, :}. \tag{16}$

因为 $j_1 \neq j_2$ ，我们可以假设 $j_3 \neq j_1$ 而不失一般性（矩阵通常很大， $j_3 = j_1$ 只是概率非常非常小的一种情况，概率小到 $\frac{1}{n}$ ）。

根据假设 iv, 存在 $j_1 \in C_{j_1}$ 使得：

$\bigcap_{i \in C_{j_1}} \mathcal{F}_{i, :} = \{j_1\}. \tag{17}$

由于：

$j_3 \not\in \{j_1\} = \bigcap_{i \in C_{j_1}} \mathcal{F}_{i, :}, \tag{18}$

那么一定存在 $i_3 \in C_{j_1}$ 使得：

$j_3 \not\in \mathcal{F}_{i_3, :} \tag{19}$

因为 $j_1 \in \mathcal{F}_{i_3, :}$ ，我们有 $(i_3, j_1) \in \mathcal{F}$ 。然后根据公式 (8)，可以得到：

$\{i_3\} \times \tau_{j_1, :} \subseteq \hat{\mathcal{F}}. \tag{20}$

由 $\sigma(j_3) \in \tau_{j_1, ：} \cap \tau_{j_2, ：}$ 可以得到：

$(i_3, \sigma(j_3)) \in \{i_3\} \times \tau_{j_1, ：}. \tag{21}$

通过公式(20) 和 (21)，我们可以得到：

$(i_3, \sigma(j_3)) \in \hat{\mathcal{F}}, \tag{22}$

结合公式 (14) 可知 $(i_3, j_3) \in \mathcal{F}$ ，与等式 (19) 矛盾。

则可知假设 $\neq D(s)P$ 不成立。

则 $T (s) = D (s) P$ ，得证。

不完备情况下的结构稀疏假设

不完备指观测样本的数量远大于源的数量

Theorem 2. Let the observed data be sampled from a linear ICA model defined in Eqs. $(1)$ and $(3)$ with Gaussian sources. Differently, the number of observed variables (denoted as $m$ ) could be larger than that of the sources $n$ , i.e., $\geq n$ . Suppose the following assumptions hold:

i. The nonzero coefficients of the mixing matrix $\mathbf{A}$ are randomly drawn from a distribution that is absolutely continuous with respect to Lebesgue measure.

ii. The estimated mixing matrix $\hat{A}$ has the minimal $L_0$ norm during estimation.

iii. (Structural Sparsity) Given $\subseteq \{1,2,\ldots,n\}$ where $∣ C ∣ > 1$ , let $A_C \in \mathbb{R}^{m \times |C|}$ represents a submatrix of $\in \mathbb{R}^{m \times n}$ consisting of columns with indices $C$ . Then, for all $\in C$ , we have

$\left| \bigcup_{k' \in C} \text{supp}(A_{k'}) \right| - \text{rank}(\text{overlap}(A_C)) > \left| \text{supp}(A_k) \right|.$

Then $\hat{A} = ADP$ with probability one, where $D$ is a diagonal matrix and $P$ is a column permutation matrix.

![[Pasted image 20231122193737.png]]

理论拆解：

第一条：保证了混合矩阵的结构足够复杂，不是特殊的单一结构。
第二条：L0正则化表示的是矩阵中不为0的点的数量。这条表示预测的混合矩阵越稀疏越好。
第三条：support的并集（按行求并，也就是向行投影）中support的数量 - 行重叠子矩阵的秩 > 每一列的support support数量。假设的目的是：每个source影响的观测越少越好，source共同作用越小越好。图中是一个例子， $k = 1$ 并且 $C = \{1,2,3\}$ 。让 $A_C$ 表示图中所示的子矩阵，其中 $\left| \bigcup_{k' \in C} \text{supp}(A_{k'}) \right| = 7.$ 蓝色虚线方块表示 $\text{overlap}(A_C)$ ， $\text{rank}(\text{overlap}(A_C)) = 2$ 。因此， $\left| \bigcup_{k' \in C} \text{supp}(A_{k'}) \right| - \text{rank}(\text{overlap}(A_C)) = 5 \text{（黑点的数量）},$ 比 $\left| \text{supp}(A_1) \right| = 4$ 大。

证明

根据假设（ii），我们考虑以下组合优化问题：

$\hat{U} := \arg \min_{U \in \mathbb{R}^{s \times s}, UU^T=I_s} \|AU\|_0,\tag{1}$

其中 $A$ 是真实的混合矩阵， $\hat{U}$ 表示对应于优化问题解的旋转矩阵。设 $\hat{A} = A\hat{U}$ 。

用反证法，假设 $\hat{A} \neq ADP$ ，那么 $\hat{U} \neq DP$ 。
这意味着存在某个 $\in \{1, \ldots, s\}$ 和它对应的行索引集合 $\mathcal{I}_{j'}$ （ $|\mathcal{I}_{j'}| > 1$ ），使得 $\hat{U}_{i,j'} \neq 0$ 对所有 $\in \mathcal{I}_{j'}$ 成立，并且 $\hat{U}_{i,j'} = 0$ 对所有 $\notin \mathcal{I}_{j'}$ 成立。
由于 $\hat{U}$ 是可逆的并且具有完整的行秩，为了避免列之间的线性依赖，存在唯一对应于 $j^{'}$ 的行索引 $i^{'}$ 。设：

$\hat{U} := \begin{bmatrix} \hat{U}_1 & \ldots & \hat{U}_s \end{bmatrix},$

我们可以得到：

$\|\hat{A}_{j'}\|_0 = \|A\hat{U}_{j'}\|_0 = \left\| \sum_{i \in \mathcal{I}_{j'}} A_{i} \hat{U}_{i,j'} \right\|_0. \tag{2}$

这里的 $\|\cdot\|_0$ 表示 $L_0$ 范数，即矩阵中非零元素的数量。

设 $A_{\mathcal{I}_j}$ 表示由 $A$ 的列索引 $\mathcal{I}_j$ 组成的子矩阵。注意 $A_i$ 表示矩阵 $A$ 的第 $i$ 列。根据假设 (ii, iii)，由于 $|\mathcal{I}_{j'}| > 1$ 并且 $U_{i,j'} \neq 0$ ，我们有:

$\left\|\sum_{i \in \mathcal{I}_{j'}} A_i \hat{U}_{i,j'}\right\|_0 \geq \left| \bigcup_{i \in \mathcal{I}_{j'}}\text{supp}(A_{i}) \right| - \text{rank}(\text{overlap}(A_{\mathcal{I}_{j'}})) > \left| \text{supp}(A_{i'}) \right|,\tag{3}$

其中，项 $\text{rank}(\text{overlap}(A_{\mathcal{I}_{j'}}))$ 表示其所有非零项可能被矩阵的线性组合 $\sum_{i \in \mathcal{I}_{j'}, A_i}$ 消除的行的最大数量。假设 i 排除了违反上述公式的情况，例如， $A$ 的两列具有相同的值和support。进而可以保证：

$\left\|\sum_{i \in \mathcal{I}_{j'}} A_i \hat{U}_{i,j'}\right\|_0 > \left| \text{supp}(A_{i'}) \right| = \left\|A_{i'}\right\|_0 = \left\| A_{i'} \hat{U}_{i',j'} \right\|_0.\tag{4}$

然后我们可以构造 $\tilde{U} := \begin{bmatrix} \tilde{U}_1 & \ldots & \tilde{U}_s \end{bmatrix}$ 。
首先，我们将 $\tilde{U}_{i,j'}$ 设置为列 $\tilde{U}_{j'}$ 中的唯一非零项。为了简便起见，我们可以将 $\tilde{U}_{i,j'}$ 设为 1。对于其他 $\neq j'$ 并且 $\hat{U}_{i,j} \neq 0$ 的列 $\tilde{U}_{j}$ ，我们设 $\tilde{U}_{i,j} = 1$ 。因此：

$\begin{align} \left\|A\hat{U_{j}}\right\|_0 &> \left\|A\tilde{U}_{j}\right\|_0, & j = j', \\ \left\|A\hat{U}_{j}\right\|_0 &= \left\|A\tilde{U}_{j}\right\|_0, & j \neq j'. \end{align} \tag{5}$

由于假设 iii 涵盖了所有列，公式 (4) 对于任何 $\in \{1, \ldots, s\}$ 都成立。如果存在多个 $\hat{U}$ 的列有多于一个非零项，为每一个列计算得到公式（4）。将不同的目标列索引 $j^{'}$ 的集合表示为 $J$ 。对于 $\in J$ ，设 $\hat{U}_{i,j} = 1$ ，其中 $i_j$ 是列 $U_{j}$ 中对应的非零项的唯一索引。对于其他 $\not\in J$ 并且 $U_{i,j} \neq 0$ 的列 $U_{j}$ ，设 $U_{i,j} = 1$ 。则有：

$\left\{ \begin{aligned} \|A{\hat{U}_j}\|_0 &> \|A{\tilde{U}_j}\|_0, & j \in J, \\ \|A{\hat{U}_j}\|_0 &= \|A{\tilde{U}_j}\|_0, & j \notin J. \end{aligned} \right.$

如前所述，每个列索引 $j$ 对应一个唯一的行索引。 $\tilde{U}$ 是一个置换矩阵且 $\tilde{U}\tilde{U}^T = I_s$ 。因此有：

$\left\|A\hat{U}\right\|_0 > \left\|A\tilde{U}\right\|_0，\tag{7}$

根据（1）， $\hat{U}$ 应该是最小值，而（7）说明当 $\hat{U} \neq DP$ 时， $\hat{U}$ 不是最小值。与定义相违背。因此证明 $\hat{U} = DP$ ，也就是 $\hat{A} = ADP$ 成立。

Reference

Zheng, Yujia, Ignavier Ng, and Kun Zhang. “On the identifiability of nonlinear ica: Sparsity and beyond.” Advances in Neural Information Processing Systems 35 (2022): 16411-16422.

小丫么小阿豪

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
【机器学习】On the Identifiability of Nonlinear ICA： Sparsity and Beyond

本文是对On the Identifiability of Nonlinear ICA: Sparsity and Beyond (NIPS 2022)中两个结构稀疏假设的总结。原文链接在Reference中。独立成分分析简单来说，就是给定很多的样本X，通过样本分离出组合成样本的源S。关于ICA的详细内容，可以参考Yifan Shen的博客：ICA简明攻略非线性独立成分分析（ICA）旨在从其可观测的非线性混合物中恢复潜在的独立源。非线性独立成分分析（ICA）是无监督学习的基础。它推广了线性ICA（Comon
复制链接

扫一扫