2019.Jan
LWEC - Dong, 2018
MIA - Jinyu Chen & Shihua Zhang, 2018
⚪Joint NMF,2012(multi-dimensions data)
Discovery of multi-dimensional modules byintegrative analysis of cancer genomic data - Shihua Zhang, 2012
Linear?
K<min(M,N)
⚪SNMNMF,2011(multi-dimensions & network data)
A novel computational framework for simultaneousintegration of multiple types of genomic data to identifymicroRNA-gene regulatory modules - Shihua Zhang, 2011
Dataset:1、TCGA (gene miRNA expression data) 2、GO biological process 3、KEGG pathways 4、MicroCosm
(miRNA-gene network)
Comodule assignment:
Z-score
Evaluation(Functional Analysis)
1、Statistical significance: (p-value of the Pearson's correlation coefficients)
2、Biological significance: (miRBase-miRNA enrichment; Gene Ontology(GO) biological process(BP)-Gene enrichment; KEGG-metabolism pathway; NCBI gene ID-gene index)
3、Network & literature analysis: (IPA-Ingenuity Pathway Analysis; Genes Dev, Cancer Res, BMC Cancer, Journal of Cancer)
4、clinical data: from TCGA portal wesite(Kaplan-Meier survival analysis method)
5、Compare with other methods: EBC method(Peng,X. et al. (2009) Computational identification of hepatitis C virus associated microRNA-mRNA regulatory modules in human livers)
⚪SMBPLS,2012(multi-dimensions labeled data)
Identifying multi-layer gene regulatory modules frommulti-dimensional genomic data - Shihua Zhang, 2012
PLS
PLS works very well for data with small sample sizes & a large number of parameters. And it's a dimension reduction & regression approach.
Partial least squares: a versatile toolfor the analysis of high-dimensionalgenomic data - Boulesteix, 2006
Purpose
1、Systematically overview the PLS methods
2、Reviewing the broad range of applications to genome data
Modeling(PLS)
First, centralized the variables
so that
X is the predictor variables, Y is the response variables. T can be deem as a latent component matrix to construct both X and Y. And T is the linear combination of X with coefficient matrix W. Then Q、P are the loading matrices, E、F are the random errors matrices. B is the regression coefficient matrix.
The space spanned by the columns of T is more important than the columns themselves, because
so do
There are four forms of objective functions:
1、univariate response(PLS1)
maximize the squared sample covariance (means most significant linear association) with uncorrelated latent components and unit length
Multivariate response
2、PLS2
is the Moore-Penrose inverse
*3、Statistically Inspired Modification of PLS(SIMPLS)
Maximize the covariance between latent variables of X & Y,()
Modeling(sMBPLS)
sMBPLS can identify the linear(covariance) structure between multiple (3)predictor matrices and a respose matrix with the sparse regularization.
Object function---
Demulation
the terms of and
is a constant, thus they won't effect the maximum of object function, and they will also bring convenience to optimize with soft thresholding!
1、for a fixed and
, with 0 derivation and
2、for a fixed and
, with soft thresholding and
3、for a fixed and
, as above
Then, update until convergence.
Note that, each iterative procedure can identify a module. After each iteration, we should deflate the matrix by subtracting the module for identifying another module.
Remove the module's signal with:
⚪SNPLS,2016
Integrative analysis for identifying jointmodular patterns of gene-expression anddrug-response data - Shihua Zhang, 2016
Purpose
PLS with sparse and Network regularization.
Modeling
construct summary vector
object function
subject to
Demulation
1、Fix g, with
then ,
▲2、Fix d, first detailed object function about g as:
regularized form
gradient at
Let gradient be 0, then
where and
is the jth vector of
update for k=1,2,...,n;
Supplement
⚪Determine k of NMF
1. Cophenetic correlation coefficient (Brunet et al., 2004; Zitnik and Zupan, 2015)
2. Distance between X & WH (Kim and Tidor, 2003; Zitnik and Zupan, 2015; Zitnik et al., 2015 )
3. learned basis matrix W achieves the lowest instability under different initial starting points (Wu et al. 2016)
#多视角聚类
*Accounting for tumor purity improve cancer subtype classification - Zhang, 2017
Background
1、高甲基化的基因不表达或者表达的程度很低,导致抑癌基因丧失功能;低甲基化基因可促使癌基因活化。目前几乎所有类型的癌细胞都伴随着DNA甲基化异常。
2、肿瘤纯度是肿瘤组织中肿瘤细胞所占的比例。纯度估计的金标准是ABSOLUTE(或InfiniumPurify)
Absolute quantification of somatic DNA alterations in human cancer
3、恶性肿瘤的形成是一个长期的多因素的分阶段过程,需要多个原癌基因的突变以及多个抑癌基因的失活,以及凋亡调节、DNA修复基因的改变。
4、*原发同一部位的肿瘤有着很大的异质性,这些肿瘤仅仅在病理上相同,在更细的地方仍然存在差异,根据不同的临床和分子数据将肿瘤分为不同的亚型是分析的核心步骤。
Purpose
1、肿瘤样本的纯度对聚类结果产生偏差,具有相似纯度的肿瘤样本倾向于聚在一类。
2、直接对癌症细胞和正常细胞的混合组织进行聚类会得到有偏的结果。
Modeling
1、
利用正规方程解β:,
求得和
后获得:
估计:,
理论论证引入纯度因子能降低肿瘤样本的内部方差
2、反正弦arcsine比logistic变换更具有线性型,变换后的数据更加符合正态分布。
正常样本,纯肿瘤样本
得混合肿瘤样本的分布:
转化为K成分的混合高斯模型,需要估计的参数为
通过EM算法可以求得聚类结果,有
2、q-value
Evaluated metrics
类间评估-聚类精度:正确聚类的样本占全部样本的比例。聚类结果与参照集制成一个K*K表,元素(i,j)表示样本属于真实类的i个亚型,聚类的j个亚型,打乱表的行列直至对角线总和达到最大值,总和占样本总数的比例就是聚类精度。
Self-Representative Manifold Concept Factorization with Adaptive Neighbors for Clustering - MA,2018
Purpose
NMF算法不能兼容负的输入,而且测得的数据结构只跟输入的原数据有关,不能很好地拟合不同输入的结构。此论文提出能处理负输入、检测数据固有结构的算法。
Modeling
Document Clustering by Concept Factorization - Xu, 2014
首先,Concept Factorization(CF)以数据矩阵作为特征矩阵分解出系数矩阵,是由
线性组合而成的概念矩阵,
则可以看
成在R概念空间投影成的坐标,含有
的结构信息。所以
的近似(因为概念空间的维度有可能丢失
的信息)分解如下:
,
,
相应的目标函数为:
Graph Regularized Nonnegative MatrixFactorization for Data Representation - Cai, 2011
上面的CF虽然可以接受负的输入,但是CF仍然不能检测出数据的固有结构,因此还需引入图正规化(Graph Regularizer)来最逼近数据原有的结构。
使用图正规化得先定义数据的关系矩阵, 然后熟悉的graph Laplacian
再次出现了,有正则化项:
这里讨论GR的优点,GR选取样本点最近的P个邻域点进行正规化,可以保留数据点的相似结构(局部联系)。由上式第一个等式可知,最小化的过程中系数矩阵是固定的,
只取最接近的点,这说明了样本空间中相邻的
映射在特征空间中的点
也同样相邻(因为
大)。
同样重要的是,即使距离度量取得不合适,GR也能根据来削小其带来的误差。对于下图分布的数据,取欧式距离是不合适的。但是系数矩阵约束出来数据邻域的能保证欧式距离在小范围内适用。总的来说,GR是个好东西,提取了数据的原有结构。
流形图
问题还有一个,就是所用的系数矩阵是预定义的(predefined),会受输入数据的影响。所以作者再提出了一个自适应邻域结构的概率系数矩阵规则。鉴于上面距离近的数据点系数大(成为邻域点的可能性大),于是有:
其中,是防止概率1全分配给最近距离点的平凡解出现。至此该算法的建模部分就完整了。
不过评估部分显示NMF的效果比CF的要好得多(可能数据集原因)因此可考虑NMF的SRAN模型。
Evaluation
(PIN)Perturbation clustering for data integration - Tin, 2017
Purpose
1、整合有意义的数据种类(integration of multiple data types)
2、区分肿瘤亚型(subtype discovery)
维度灾难:随着维度的增加,样本将会在维度空间里变得越来越稀疏,而要取得足够覆盖范围的数据,就只能增大样本数量(指数级),否则只能陷入过拟合,导致预测性能下降。(比如总数为1000的属性空间中要求样本属性覆盖总体的60%,选定一个属性所需的样本数为600,选定两个属性需774个,选定三个属性则需要843个样本)
Modeling
⚪Perturbation clustering
Partition the patients using all possible number
with K-means, have (K-1) partitions
, then build the connectivity matrix
*Then generating H perturbed dataset by adding Gaussian noise to original data E. ,
(这里加入特征中值方差的高斯噪声避免了噪声过小扰动不充分或者过大破坏数据结构的问题,同时起到了泛化数据的作用,个人认为在一定程度上减弱了维度灾难的影响)
Then build the connectivity matrix , and perturbed connected connectivity matrix
stability assessment
Calculate the difference matrix , the more this distribution shift to 1, the less robust the clustering is. Then compute the cumulative distribution function(CDF)
, and calculate the area(
) under the curve of CDF . Choose the optimal
. 由于最优分类数受扰动的影响较少,差异矩阵的非零项较小,CDF曲线会较快跳至1,其AUC曲线面积也最大,因此这里能初步学出数据的最优类数以及最优划分。
*这里相当于把添加扰动得到的关系矩阵作为金标准,来求出AUC。
⚪Subtyping multi-omic data
Step 1-data integration and subtyping
Input T data types matrices, then construct T original connectivity matrices
and T perturbed matrices
.
Hence we have combined similarity matrix , combined perturbed matrices
.
represents the distance between patients, then dynamic tree cut, hierarchical clustering, partitioning around medoids can be used with pair-wise distances.
&
can determine the cluster number for HC and PAM.
At last, similar to the pair-wise agreement of Rand Index, we calculate the agreement between the data types , if agree(Sc) >50%, say the T data types have a strong agreement.
Finally, we choose the result of cluster algorithm that has the highest agreement.
Step 2-further splitting discovered groups
Case One
When the connectivity of data type are consistent in step 1(i.e. agree(Sc)>50%), then check the consistency within each group with the same procedure of step 1, further split the group if the optimal partitionings are strongly agree.
Case Two
When the data types are not consistent, we avoid unbalanced clustering by attempting to further split each group based on two conditions.
First, normalized entropy evaluates the balanced rate of group if , with
,
.
*Second, we use gap statistic to ckeck if the data can be further clustered. .
If gap statistic return 1, then we have no enough evidence to separate the group.
(SNF)Similarity network fusion for data integration - Wang, 2014
Purpose
1、整合数据通过融合样本的相似性网络:(1)能从小样本中派生出有用的信息;(2)对选择偏差、数据噪声鲁棒;(3)能提炼数据中的互补信息。
Evaluation Metrics
1、Silhouettes(评价划分聚类的紧致程度&指示合适的类别数)