对于基因的定义总体可以划分为两类
GAD: Gene associated with Mendelian disorder; GADs include genes that meet criteria for definitive, strong, or moderate evidence for association with disease as described by ClinGen
GUS: Gene of uncertain significance; GUSs include gene that meet the ClinGen categories of limited or dispute evidence
Clinical Genome Resource (ClinGen,www.clinicalgenome.org),大概有600多个基因(https://search.clinicalgenome.org/kb/gene-validity)[8] 该数据库对每个基因进行了分类,针对不同的疾病。分类属于GAD是必须要包含在内的。
1:此外主要考虑的因素是你检测的对象是SNV(必须)、indels(必须的)、CNAs、SV,另外你的panel必须包含基因的热点区域(例如:PIK3CA的exon9 and 20以及BRAF的exon 15,exons 18 to 21 of EGFR, or exons 12 and 14 of JAK2)另外你也可以决定cover几个重点基因的整个编码区和非编码区(KRAS、NRAS、TP53)。
2:如果要设定copy数目的检测几个常见的例如TP53、PTEN、CDKN2A以及RB1的losses以及ERBB2(HER2)、MET、RICTOR、MDM2的gain在临床上都是很有意义的
3:SV的检测主要中主要体现的是基因融合,例如RET/PTC、TMPRSS2/ERG、EML4/ALK,无论是DNA还是RNA(ctDNA)断点都发生在内含子区域,建议在设计的时候至少向外延伸20bp
4:在探针富集层数上内含子和外显子可以区别对待
5:梯度测试:不同DNA输入量的梯度测试,一篇文章中分别给出了75bp、100bp、150bp、200bp四个不同梯度总共4X7个样本,这个需要在测试完成后需要提出最低起始量和NGS的建议起始量,一般较高的起始量会得到较低的Duplication,因此做完了梯度测试应该有类似以下的三个图:
6:可重复性
一般是过CAP要自己测序,对于同样的样本可以选择重复测序3次也就是3个RUN,样本频率的范围选择是0-0.7,如下是总共考察了17个样本,每个样本重复用3个独立的实验,总共是17X3X3=153个实验
实验完后应得到如下图的结果:
7:检测下限(Lower Limit of Detection)
将12个样本为肿瘤纯度在80%-100%的样本进行稀释,按照100%、50%、20%,也是重复三次,得到如下结果
8:数据追溯
FASTQ、BAM、VCF
9:样本接收类型(可参考专家共识)
10:target区域描述,可参考FoundationOne的描述以表格的形式呈现(表2和表3)
11:样本测序质控metrix的一个例子[10]
12:关于变异位点的解释可以参考文献[6]
13:目前的生信流程针对Indels的分析其长度一般为<=21bp,根据文献[7]
14:在没有真实数据的时候,你可以用BAmsurgeon https://github.com/adamewing/bamsurgeon/ 进行数据模拟变异,来首先检测你的数据分析流程
15:另外还有一篇文章极具参考价值[11]
1:关于panel设计Baits were designed by taking overlapping 120 bp DNA sequence intervals covering target exons (60 bp overlap) and introns (20 bp overlap), with a minimum of three baits per target; SNP targets were allocated one bait each. Intronic baits were filtered for repetitive elements46 as defined by the UCSC Genome RepeatMasker track
2:本篇文章使用GATK进行Call变异,对于SNP和indel的过滤不通,可以参考这篇文章,对于碱基变异Final calls are made at MAF ≥ 5% (MAF ≥ 1% at hotspots),对于indel分析的阈值是Filtering of indel candidates was carried out as described for base substitutions above (strand bias P < 1e-10, MAF ≥ 3% at hotspots), with an empirically increased MAF threshold at repeats and adjacent sequence quality metrics as implemented in GATK: percentage of neighboring base mismatches <25%, average neighboring base quality >25, average number of supporting read mismatches ≤2.
3:本篇文章对于基因融合的过滤条件是要有10条reads支持(clusters containing at least 10 chimeric pairs)
4:对于样本之间污染判断,该篇文章选取了与panel重合的 5,801 SNPs (marked coding-synonymous, missense, or nonsense), homozygous (MAF > 90%) or heterozygous (40% ≤ MAF ≤ 60%) state
参考文献
1:CAP Accreditation Program-Molecular Pathology Checklist.pdf
2:Denovo request for evaluation of automatic class III designation for the MSK-IMPACT