Overview of Stochastic Grammar

最新推荐文章于 2020-11-23 11:30:14 发布

花生啤酒八宝粥

最新推荐文章于 2020-11-23 11:30:14 发布

阅读量542

点赞数 1

分类专栏： probability grammar 文章标签：自然语言处理

本文链接：https://blog.csdn.net/weixin_43823054/article/details/104615643

版权

probability grammar 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Overview of Stochastic Grammar

随机语法

约定：（以下的一些概念将贯穿全文）

Key_word	Possible explanation
terminal	Prase graph中的节点，包含与节点和或节点
nonterminal	Prase graph中的终端节点
And–Or graph（AOG）	由And节点和Or节点组成的图，类似于模板，会将所有的情况列举出来
And-Or tree	只包含分支规则，不包含具体内容和概率的树结构
parse graph(pg)	解析图，为AOG的实例化
parse tree(pt)	解析树，只包含分支规则，不包含具体内容和概率
visual dictionaries	视觉词典，即由图像中的primitives组成的词库
primitives	基元，并不一定是终端节点
primal sketch representation	原始草图，由primitives组成
configuration	解析图的配置，即由终端节点组成的有序组合
language
SCSG	stochastic context sensitive grammar随机上下文关联语法
language	由终端节点组成的语法

Abstract

该语法主要是针对大量目标种类的表达、学习和识别的一个统一的框架。

The grammar represents both the hierarchical decompositions from scenes, to objects, parts, primitives and pixels by terminal and nonterminal nodes and the contexts for spatial and functional relations by horizontal links between the nodes.
- 垂直结构：该语法表示了如何通过终端节点和非终端节点，将场景垂直分解为目标、部分、原词和像素；
- 平行结构：如何通过平行节点间的链接建立空间内容和功能关系
The grammar is embodied in a simple And–Or graph representation where each Or-node points to alternative sub-configurations and an And-node is decomposed into a number of components. This representation supports recursive top-down/bottom-up procedures for image parsing under the Bayesian framework and make it convenient to scale up in complexity.
- 该语法代表了一种简单的AOG表示，每个Or节点表示可选的配置，每个And节点可以分解为多个组成。该表示可以对于贝叶斯概率的图像解析支持自顶而下和自下而上的递归关系，使之更方便的扩展。
- 即给定一幅图像，会输出一个可能性最大的parse graph解析图，该解析图是由and节点的分解和or节点的选择组成。
A probabilistic model is defined on this And–Or graph representation to account for the natural occurrence frequency of objects and parts as well as their relations. This model is learned from a relatively small training set per category and then sampled to synthesize a large number of configurations to cover novel object instances in the test set. This generalization capability is mostly missing in discriminative machine learning methods and can largely improve recognition performance in experiments.
- 概率模型从与或图中学习概率分布（节点出现的频率和关系）
- 该模型从小样本训练得到，然后通过采样，进行组合，以覆盖测试集中新的object
- 这种泛化能力是区别于传统机器学习的（机器学习不能组合生成新object）
To fill the well-known semantic gap between symbols and raw signals, the grammar includes a series of visual dictionaries and organizes them through graph composition. At the bottom-level the dictionary is a set of image primitives each having a number of anchor points with open bonds to link with other primitives. These primitives can be combined to form larger and larger graph structures for parts and objects.The ambiguities in inferring local primitives shall be resolved through top-down computation using larger structures.Finally these primitives forms a primal sketch representation which will generate the input image with every pixels explained.
- 为了解决语义标签与原始信号数据之间的语义分歧（对不上号），该语法包含了一系列的视觉词典（词库），并将他们进行图组合。
- 词典的底层代表图像的原词，并包含了原词之间的连接关系。这些原词的结合可以生成大量的图结构*（但是感觉不一定是正确的object）*。
- 对于推理模糊的局部原词语义时，应该使用全局结构进行自顶而下的计算，
- 原词组成了最初的草图表示

The proposal grammar integrates three prominent representations in the literature: stochastic grammars for composition, Markov (or graphical) models for contexts, and sparse coding with primitives (wavelets). It also combines the structure-based and appearance based methods in the vision literature.

该语法整合了文献中三种突出的表述：

随机语法的组成
上下文的马尔可夫模型
原词的稀疏编码

1.Introduction

目前遇到的问题：

problem1：object数量如此之多，如何定义目标（怎么区分、识别）

problem2：计算量巨大，特别是多目标识别

problem3：早期最大的问题便是原图像像素和符号标记label在早期的语法和结构方面存在歧义，即从原图像得到的符号不可信。因此转向了PCA，AAM和基于外观的识别，图像金字塔，小波变换，机器学习等方法

主要算法：

1.minimax entropy learning scheme最大最小熵学习策略

2.maximum likelihood estimation最大似然估计

图像解析示例：

图像与解析图概念之间的关系为：

pg即包含终端节点和非终端节点的树结构。终端节点之间的组合，组成了configuration。而终端节点大都表示图像的特征。

以下为一个AOG的关系实例：

2.Background

2.1 The traditional Formulation of Grammar

Grammar

define a grammar with 4-tuple :
$\mathcal{G}=\left(V_{N}, V_{T}, \mathrm{R}, S\right)$
其中，

$V_{N}$ 表示非终端节点的有限集合（a finite set of non-ternimal nodes)

$V_{T}$ 表示终端节点的有限集合（a finite set of terminal nodes)

$R$ 表示产生规则（production rules)： $\mathrm{R}=\{\gamma: \alpha \rightarrow \beta\}$

$S$ 表示起始根节点（start symbol at the root)

设 $\alpha, \beta \in\left(V_{N} \cup V_{T}\right)^{+}$ 表示终端或非终端节点（至少包含一个非终端节点），定义以下四种语法规则：

type 3： $\rightarrow a B$ or $\rightarrow a$ ，where $\in V_{T}$ and $\in V_{N}$

type 2： $\rightarrow \beta$ ，其中 $\beta$ 表示一个子集 $\beta = \{{\beta_{1},\beta_{2},...,\beta_{m}}\}$ ,称为context free grammar

type 1： $\xi A \eta \rightarrow \xi \beta \eta$ ，A被 $\beta$ 重写

type 0：no constraint on $\alpha$ and $\beta$

Language

通过语法 $\mathcal{G}$ 产生的所有可能的终端节点称为语言，
$\mathbf{L}(\mathcal{G})=\left\{\omega: S \stackrel{\mathrm{R}^{*}}{\longrightarrow} \omega, \omega \in V_{T}^{*}\right\}$
其中， $\omega=\left(\omega_{1}, \omega_{2}, \ldots, \omega_{n}\right)$ 表示所有终端节点的集合， $\mathrm{R}^{*}$ 表示从起始节点 $S$ 到终端节点 $\omega$ 的一系列的产生规则：
$S^{\gamma_{1}, \gamma_{2}, \ldots, \gamma_{n}(\omega)} \omega$
下图表示了语言。

Parse Tree

若语法规则属于type1-3，生成终端节点集合 $\omega$ ，则对于该 $\omega$ 的解析树定义为
$\mathbf{p} \mathbf{t}(\omega)=\left(\gamma_{1}, \gamma_{2}, \ldots, \gamma_{n}(\omega)\right)$
下图为一个AO树，是由pt组合而成的。所有特殊的pt组合为一个通用的AOt。

树的特点：即只包含规则，不包含具体的节点。

Overlapping Reusable Parts

将两个解析图PG进行合并，则会形成一个AOG。

2.2 Stochastic Grammar

为了连接真实世界的信号，在以上传统语法的基础上，增添了 $\mathcal{P}$ 作为语法的第五个组成部分。

设最常用的随机上下文无关语法(stochastic context free grammar(SCFG))，节点A为非终端节点 $\in V_{N}$ ，包含大量的产生规则：
$\rightarrow \beta_{1}\left|\beta_{2}\right| \cdots | \beta_{n(A)}, \quad \gamma_{i}: A \rightarrow \beta_{i}$
设每个对应的产生规则的概率为
$p\left(\gamma_{i}\right)=p\left(A \rightarrow \beta_{i}\right)$
则对任意节点A，满足：
$\sum_{i=1}^{n(A)} p\left(A \rightarrow \beta_{i}\right)=1$
这对应于统计中所谓的随机分支过程，与马尔可夫链相类似。

对于某一包含终端节点集 $\omega$ 的解析树pt的概率，定义如下：
$p(\mathbf{p} \mathbf{t}(\omega))=\prod_{j=1}^{n(\omega)} p\left(\gamma_{j}\right)$
其中， $\omega$ 表示一个终端节点集合，即configuration。某一解析树的概率为树中所有分支概率的乘积。

对于某一终端节点集合的概率（即有可能有多个解析树对应同一配置 $\omega \in \mathbf{L}(\mathcal{G})$ ）：
$p(\omega)=\sum_{\mathrm{pt}(\omega)} p(\mathbf{p t}(\omega))$
由此，对于随机语法stochastic grammar $\mathcal{G}=\left(V_{N}, V_{T}, \mathrm{R}, S, \mathcal{P}\right)$ ，可定义其语法：
$\mathbf{L}(\mathcal{G})=\left\{(\omega, p(\omega)): S \stackrel{\mathrm{R}^{*}}{\longrightarrow} \omega, \omega \in V_{T}^{*}\right\}$
对于概率 $\mathcal{P}$ ，可以通过监督学习的方式，通过观测解析树pt集合的最大似然估计进行学习：
$\mathcal{P}^{*}=\arg \max \prod_{m=1}^{M} p\left(\mathbf{p} \mathbf{t}_{i}\right)$
该解决方案相当直观：对于每个非终端节点A的分支概率，
$p\left(A \rightarrow \beta_{i}\right)=\frac{\#\left(A \rightarrow \beta_{i}\right)}{\sum_{j=1}^{n(A)} \#\left(A \rightarrow \beta_{j}\right)}$
其中， $\#\left(A \rightarrow \beta_{i}\right)$ 表示在所有解析树pt中，规则 $\rightarrow \beta_{i}$ 出现的次数。

2.3 Stochastic Grammar with Context

设 $\omega=\left(\omega_{1}, \omega_{2}, \ldots, \omega_{n}\right)$ ，使用bi-gram二元语法计数统计频率 $h\left(\omega_{i}, \omega_{i+1}\right)$ ，和所有词对，对于 $\omega$ 导出一个马尔可夫模型：
$p(\omega)=h\left(\omega_{1}\right) \prod_{i=1}^{n-1} h\left(\omega_{i+1} | \omega_{i}\right)$
通过添加因子 $h^{*}\left(\omega_{i}, \omega_{i+1}\right)$ 和重新正则化，整合了解析树模型和bi-gram模型的概率：
$p(\mathbf{p} \mathbf{t}(\omega))=\frac{1}{Z} h^{*}\left(\omega_{1}\right) \prod_{i=1}^{n-1} h^{*}\left(\omega_{i+1}, \omega_{i}\right) \cdot \prod_{j=1}^{n(\omega)} p\left(\gamma_{j}\right)$
现在可以用Gibbs格式重写整个解析树：
$p(\mathbf{p} \mathrm{t}(\omega) ; \Theta)=\frac{1}{Z} \exp \left\{-\sum_{j=1}^{n(\omega)} \lambda\left(\gamma_{j}\right)-\sum_{i=1}^{n-1} \lambda\left(\omega_{i+1}, \omega_{i}\right)\right\}$
其中， $\lambda\left(\gamma_{j}\right)=-\log p\left(\gamma_{j}\right)$ and $\lambda\left(\omega_{i+1} | \omega_{i}\right)=-\log h^{*}\left(\omega_{i+1} | \omega_{i}\right)$ 属于参数 $\Theta$ 。

3.The definition of concept

3.1 Visual vocabulary

即视觉词典的定义：
$\Delta=\left\{\left(\Phi_{i}\left(x, y ; \alpha_{i}\right), \beta_{i}\right):(x, y) \in \Lambda_{i}\left(\alpha_{i}\right) \subset \Lambda\right\}$
其中， $\Phi_{i}\left(x, y ; \alpha_{i}\right)$ 表示图像的几何信息， $x, y$ 表示坐标， $\alpha_{i}$ 向量表示：几何信息（尺度、位姿、形变等）和外观信息（强度、剖面、反光情况等）， $\beta_{i}=\left(\beta_{i, 1}, \ldots, \beta_{i, d(i)}\right)$ 表示与其他节点相连的地址。

如 $\Delta_{\text {cloth }}=\left\{\left(\Phi_{i}^{\text {cloth }}\left(x, y ; \alpha_{i}\right), \beta_{i}\right): \forall i, \alpha_{i}, \beta_{i}\right\}$ 表示如下图：

关于视觉词典的种类：

Image Primitives：词元，即图像的原子元素，可以组成primal shetch草图
Basic Geometric Groupings：基础几何组，由词元组成
Parts and Objects：高维视觉信息

3.2 Relations

关系的定义为：
$E=\{(s, t ; \gamma, \rho): s, t \in S\}$
其中， $\{(s, t)\} \subset S \times S$ 表示从节点 $s$ 到节点 $t$ 之间的连接， $\gamma=\gamma(s, t)$ 表示 $s$ 与 $t$ 之间的连接结构， $\rho=\rho(s, t)$ 表示 $s$ 与 $t$ 之间的相容性。 $< S, E >$ 表示关系 $E$ 在节点集 $S$ 上的图表示。

主要有三种连接关系用于为水平连接和文本增加抽象层目标：

Relation type 1 : Bonds and connections

对于一个节点集 $V=\left\{A_{i}: i=1,2, \ldots, n\right\}$ ，每个节点都有大量连接 $\left\{\beta_{i j}: j=1,2, \ldots, n(i)\right\}$ （第 $i$ 个节点上的第 $j$ 个bond），将所有连接集合为一个集合：
$S_{\mathrm{bond}}=\left\{\beta_{i j}: i=1,2, \ldots, n, j=1,2, \ldots, n(i)\right\}$
若两个节点的位置和角度（position orientation）一致，则两个连接 $\beta_{i j}$ 与 $\beta_{k l}$ 即可连接在一起。
$E_{\mathrm{bond}}(S)=\left\{\left(\beta_{i j}, \beta_{k l} ; \gamma, \rho\right)\right\}$
即第i个节点的第j个bond和第k个节点的第l个bond连接在一起。其中， $\gamma=(x, y, \theta)$ 表示位置和姿态， $\rho$ 表示两个连接之间的一致性强度和颜色的函数。

Relation type 2 : Joint and junctions

该关系主要表示相邻的连接关系（共享平面、共享线段等）

Relation type 3 : Object interactions and semantics

主要表示直接关系，如支撑、遮挡等。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mMDk3mxd-1583111998721)(img/10)]

上图可用如下关系表示：
$\begin{array}{l}{E_{\text {supp }}=\{\langle M, D\rangle,\langle M, E\rangle\}} \\ {E_{\text {occld }}=\{\langle D, M\rangle,\langle E, M\rangle,\langle D, N\rangle,\langle E, N\rangle\}}\end{array}$
在某个场景中，物体可能有很多关系，不止一种描述。

对于低维关系，连接较为哦稠密；对于高维关系，连接较为稀疏。

3.3 Configuration

定义如下：

configuration是一种抽象层面的空间布局。

若V是草图集，则C为原始草图配置；

若E是多种关系的集合，则C为混合配置。

该配置同样有三个层次：

$V$ is a set of image primitives with bonds and $E=E_{\text {bonds }}$ ，即草图级。如图：
parts to object level 部分层面，如下图：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-40OMaIwX-1583111998723)(img/13)]
scene configuration 高层次场景级配置，如图所示：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XtmTg3nA-1583111998724)(img/14)]

图中的两种配置，第一种是相邻关系：
$E_{\mathrm{adj}}=\{(s k y, \text { field }),(h e a d, b o d y)\}$
第二种是遮挡关系：
$E_{\text {contain }}=\{\langle\text {head}, s k y\rangle,\langle\text {head}, \text { field }\rangle,\langle\text {body}, \text { field }\rangle\}$

3.4 Parse Graph

定义如下：

parse graph(pt)是由垂直结构的树pt和大量的关系E组成
$\mathbf{p} g=(\mathbf{p} t, E)$
其中，pt中的所有非终端节点全是AND节点，将任意节点A解压得到configuration：
$\gamma: A \rightarrow \mathcal{C}=<V, E>$
整个解析树由以上一系列的产生规则组成：
$\mathbf{p} t(\omega)=\left(\gamma_{1}, \gamma_{2}, \ldots, \gamma_{n}\right)$
水平连接关系包含了大量的直接或非直接的关系，
$E=E_{r_{1}} \cup E_{r_{2}} \cup \cdots \cup E_{r_{k}}$
一个pg可以产生一系列的不同层次的配置：
$\mathbf{p} g \Longrightarrow \mathcal{C}$
通过这种关系类型，可以从高维到低维的产生关系，最终的配置即为图像的像素级的解析。

图像解析的任务主要是从图像中生成pg，可以用最大后验概率表示最优解：
$\mathbf{p} g^{*}=\arg \max p(\mathbf{p} g | I)$
或者对一组后验概率进行采样：
$\left\{\mathbf{p} g_{i}: i=1,2, \ldots, K\right\} \sim p(\mathbf{p} g | I)$
图像解析实例：

以上对两个钟表进行解析，生成的两个解析图。而解析图的生成是由AOG图得来的。

3.5 And-Or graph

parse graph（PG）表示的是某一确定的图像，而AOG表示包含所有parse graph和图像语法。

定义如下：

即：
$G_{\text {and }-\text { or }}=<S, V_{N}, V_{T}, \mathcal{R}, \mathcal{P}>$
其中， $S$ 是根节点（root），表示一个场景或者目标物体；

$V_{N}=V^{\text {and }} \cup V^{\text {or }}$ 表示非终端节点；

$V_{T}$ 属于终端节点（对于低分辨率的object不可直接分解）；

$R$ 是节点间的大量关系；

$P$ 则是AOG的概率模型。

下面分开解释：

1. Non-terminal nodes

非终端节点包括与节点和或节点： $V_{N}=V^{\text {and }} \cup V^{\text {or }}$

其中， $V^{\text {and }}=\left\{u_{1}, \ldots, u_{m(u)}\right\}, \quad V^{\text {or }}=\left\{v_{1}, \ldots, v_{m(v)}\right\}$

一个或节点就是一个开关，决定走到哪条路上。定义变量 $\omega(v)$ for $\in V$ ，函数值即为其index值：
$\omega(v) \in\{\emptyset, 1,2, \ldots, n(v)\}$
通过在or节点处选择变量，即从AOG中提取出parse graph。

2. Terminal nodes

终端节点 $V_{T}=\left\{t_{1}, \ldots, t_{m}(T)\right\}$ 是从image dictionary $\Delta$ 中提取处理的集合，通常用图模板 $(\Phi(x, y ; \alpha), \beta)$ 表示。

3. Configuration

从根节点root产生的配置是语法的语言： $G_{\text {and }-\text { or }}$ ：
$\mathbf{L}\left(G_{\text {and }-\text { or }}\right)=\Sigma=\left\{\mathcal{C}_{k}: S \stackrel{G_{\text {and }-\text { or }}}{\Longrightarrow} \mathcal{C}_{k} k=1,2, \ldots, N\right\}$
每个配置 $\mathcal{C} \in \Sigma$ 。对于下图，