#引用
##LaTex
@article{GUMUS201323,
title = “Multi objective SNP selection using pareto optimality”,
journal = “Computational Biology and Chemistry”,
volume = “43”,
pages = “23 - 28”,
year = “2013”,
issn = “1476-9271”,
doi = “https://doi.org/10.1016/j.compbiolchem.2012.12.006”,
url = “http://www.sciencedirect.com/science/article/pii/S1476927112001156”,
author = “Ergun Gumus and Zeliha Gormez and Olcay Kursun”,
keywords = “Feature selection, Principal component analysis (PCA), Mutual information (MI), Genomic鈥揼eographical distance, Human Genome Diversity Project SNP dataset”
}
##Normal
Ergun Gumus, Zeliha Gormez, Olcay Kursun,
Multi objective SNP selection using pareto optimality,
Computational Biology and Chemistry,
Volume 43,
2013,
Pages 23-28,
ISSN 1476-9271,
https://doi.org/10.1016/j.compbiolchem.2012.12.006.
(http://www.sciencedirect.com/science/article/pii/S1476927112001156)
Keywords: Feature selection; Principal component analysis (PCA); Mutual information (MI); Genomic–geographical distance; Human Genome Diversity Project SNP dataset
#摘要
Biomarker discovery 生物标志物发现
SNP — single nucleotide polymorphism 单核苷酸多态性
传统单目标 — 最大化分类准确度
1 高分类准确度
2 种族群体遗传多样性与地理距离的相关性
#主要内容
数据集:
Human Genome Diversity Project (HGDP) SNP 数据集
1064个个体
52个族群
原始数据:
1043个个体
每个个体 — 660,918 SNPs(163来自线粒体DNA,排除)— 用660,755
每个SNP — 2个等位基因 — 编码表示为:
{
−
1
,
0
,
1
}
\left\{ -1, 0, 1 \right\}
{−1,0,1}
目标一:
高分类准确度 — mutual information MI 互信息
H H H — 随机变量的熵
目标二:
基因组地理相关性 — principal components analysis PCA
由于维度较高 — 对PCA使用了“维度戏法”
C
C
C —
D
×
D
D\times D
D×D维协方差矩阵
Y
Y
Y —
N
×
D
N \times D
N×D为中心数据矩阵,
N
≪
D
N \ll D
N≪D
k
i
k_i
ki — 特征向量
i
i
i
两边同乘
Y
Y
Y
v
i
=
Y
k
i
v_i = Yk_i
vi=Yki — 协方差矩阵
Y
Y
T
YY^T
YYT的第
i
i
i个特征向量
两边同乘
Y
T
Y^T
YT
可得: