2021-01-17

最新推荐文章于 2022-12-17 23:11:53 发布

转载最新推荐文章于 2022-12-17 23:11:53 发布 · 1.8k 阅读

1 ·

CC 4.0 BY-SA版权

原文链接：https://blog.csdn.net/monk1992/article/details/82347464

机器学习同时被 2 个专栏收录

2 篇文章

订阅专栏

生信

1 篇文章

订阅专栏

本文介绍了一种评估PacBio MHC变异检测与Hiseq数据一致性的方法，并通过编写Python脚本来统计不同测序深度下真阳性率、假阳性率等关键指标。

医学、机器学习等等，在统计结果时时长会用到这两个指标来说明数据的特性。

定义

敏感性：在金标准判断有病（阳性）人群中，检测出阳性的几率。真阳性。（检测出确实有病的能力）
特异性：在金标准判断无病（阴性）人群中，检测出阴性的几率。真阴性。（检测出确实没病的能力）
假阳性率：得到了阳性结果，但这个阳性结果是假的。即在金标准判断无病（阴性）人群中，检测出为阳性的几率。（没病，但却检测结果说有病），为误诊率。
假阴性率：得到了阴性结果，但这个阴性结果是假的。即在金标准判断有病（阳性）人群中，检测出为阴性的几率。（有病，但却检测结果说没病），为漏诊率。

计算方法

Sensitivity and specificity：完整定义

True Positive （真正, TP）被模型预测为正的正样本；可以称作判断为真的正确率
 
True Negative（真负 , TN）被模型预测为负的负样本 ；可以称作判断为假的正确率
 
False Positive （假正, FP）被模型预测为正的负样本；可以称作误报率
 
False Negative（假负 , FN）被模型预测为负的正样本；可以称作漏报率
 
True Positive Rate（真正率 , TPR）或灵敏度（sensitivity）
TPR = TP /（TP + FN）
正样本预测结果数 / 正样本实际数
 
True Negative Rate（真负率 , TNR）或特指度（specificity）
TNR = TN /（TN + FP）
负样本预测结果数 / 负样本实际数
 
False Positive Rate （假正率, FPR）
FPR = FP /（FP + TN）
被预测为正的负样本结果数 /负样本实际数
 
False Negative Rate（假负率 , FNR）
FNR = FN /（TP + FN）
被预测为负的正样本结果数 / 正样本实际数

假阳性率=假阳性人数÷金标准阴性人数

即：假阳性率=b÷（b+d）

		金标准	金标准
		阳性（+）	阴性（-）	合计
某筛检方法	阳性（+）	a	b	a+b
某筛检方法	阴性（-）	c	d	c+d
合计		a+c	b+d	N

公式为:假阳性率=b/(b+d)×100%

(b:筛选为阳性,而标准分类为阴性的例数;d:阴性一致例数)

假阴性率=假阴性人数÷金标准阳性人数

即： β=c÷（a+c）

终于要用到这个玩意了，很激动，主要统计假阴假阳性率。

我的任务：

1. 评估Pacbio MHC variation calling 结果（CCS/non-CCS）与Hiseq数据结果的一致性。
2. 分别在不同深度梯度的区域完成以上评估，推断PB MHC做variation calling的最低深度。

这里要将一个位点分为SNP、REF 和 LowQual，然后只去 SNP 和 REF 进行统计。

这是我一下午写出来的统计代码：

#!/usr/bin/env python
# Author: LI ZHIXIN
 
import sys
import pysam
from collections import OrderedDict
 
def classify_DP(depth):
     if depth > 101 :
         return 21
     return ((depth - 1 ) / / 5 + 1 )
 
def parse_rec(rec):
     sample = list (rec.samples)[ 0 ]
     # filter the Invalid line
     if not ( 'GQ' or 'GT' or 'DP' ) in rec.samples[sample].keys() or len (rec.alleles) < = 1 :
         # continue
         return 1 , "LowQual" , rec.pos
     # filter the LowQual
     if rec.samples[sample][ 'GQ' ] < 30 :
         return rec.samples[sample][ 'DP' ], "LowQual" , rec.pos
     # filter the indel
     flag = 0
     for one in rec.alleles:
         if len (one) ! = len (rec.ref):
             flag = 1
     if flag = = 1 :
         return rec.samples[sample][ 'DP' ], "LowQual" , rec.pos
     if rec.samples[sample][ 'GT' ] ! = ( 0 , 0 ): # rec.qual > 30
         # variation_dict[rec.pos] = ["snp", rec.alleles]
         return rec.samples[sample][ 'DP' ], "snp" , rec.pos 
     elif rec.samples[sample][ 'GT' ] = = ( 0 , 0 ):
         # variation_dict[rec.pos] = ["ref", rec.alleles]
         return rec.samples[sample][ 'DP' ], "ref" , rec.pos
 
def read_gvcf(gvcf_file_path):
     variation_dict = OrderedDict()
     for i in range ( 1 , 22 ):
         variation_dict[i] = {}
         for j in ( 'LowQual' , 'snp' , 'ref' ):
             variation_dict[i][j] = []
     # pos_list = []
     gvcf_file = pysam.VariantFile(gvcf_file_path)
     for rec in gvcf_file.fetch( 'chr6' , 28477796 , 33448354 ):
         DP, pos_type, pos = parse_rec(rec)
         if DP < 1 or DP > 20 :
             continue
         # DP = classify_DP(DP)
         variation_dict[DP][pos_type].append(pos)
         # print(pos, DP, pos_type)
     gvcf_file.close()
     # return variation_dict, pos_list
     return variation_dict
 
def read_hiseq_gvcf(gvcf_file_path):
     variation_dict = OrderedDict()
     # for i in range(1,22):
     # variation_dict[i] = {}
     for j in ( 'LowQual' , 'snp' , 'ref' ):
         variation_dict[j] = []
     # pos_list = []
     gvcf_file = pysam.VariantFile(gvcf_file_path)
     for rec in gvcf_file.fetch( 'chr6' , 28477796 , 33448354 ):
         DP, pos_type, pos = parse_rec(rec)
         DP = classify_DP(DP)
         variation_dict[pos_type].append(pos)
         # print(pos, DP, pos_type)
     gvcf_file.close()
     # return variation_dict, pos_list
     return variation_dict
 
def show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2):
     for DP in range ( 1 , 21 ):
         Hiseq_snp = set (Hiseq_unified_variation_dict[ 'snp' ])
         Hiseq_ref = set (Hiseq_unified_variation_dict[ 'ref' ])
         Hiseq_lowqual = set (Hiseq_unified_variation_dict[ 'LowQual' ])
         PB_snp = PB_non_CCS_variation_dict[DP][ 'snp' ]
         PB_ref = PB_non_CCS_variation_dict[DP][ 'ref' ]
         PB_lowqual = PB_non_CCS_variation_dict[DP][ 'LowQual' ]
         total = set (PB_snp + PB_ref + PB_lowqual)
         Hiseq_snp = total & Hiseq_snp
         Hiseq_ref = total & Hiseq_ref
         Hiseq_lowqual = total & Hiseq_lowqual
         PB_snp = set (PB_snp)
         PB_ref = set (PB_ref)
         PB_lowqual = set (PB_lowqual)
         a = len (Hiseq_snp & PB_snp)
         b = len (Hiseq_ref & PB_snp)
         c = len (Hiseq_lowqual & PB_snp)
         d = len (Hiseq_snp & PB_ref)
         e = len (Hiseq_ref & PB_ref)
         f = len (Hiseq_lowqual & PB_ref)
         g = len (Hiseq_snp & PB_lowqual)
         h = len (Hiseq_ref & PB_lowqual)
         i = len (Hiseq_lowqual & PB_lowqual)
         Low_total = (g + h + i) / (a + b + c + d + e + f + g + h + i)
         if (a + b) = = 0 :
             PPV = "NA"
         else :
             PPV = a / (a + b)
             PPV = "%.4f" % (PPV)
         if (a + d) = = 0 :
             TPR = "NA"
         else :
             TPR = a / (a + d)
             TPR = "%.4f" % (TPR)
         print ( str (DP) + " :\n" , a,b,c, "\n" ,d,e,f, "\n" ,g,h,i, "\n" , file = outf2, sep = '\t' , end = '\n' )
         print (DP, TPR, PPV, "%.4f" % Low_total, file = outf, sep = '\t' , end = '\n' )
 
with open ( "./depth_stat.txt" , "w" ) as outf:
     print ( "Depth" , "TPR" , "PPV" , "Low_total" , file = outf, sep = '\t' , end = '\n' )
     outf2 = open ( "raw.txt" , "w" )
     Hiseq_unified_variation_dict = read_hiseq_gvcf( "./hiseq_call_gvcf/MHC_Hiseq.unified.gvcf.gz" )
     PB_non_CCS_variation_dict = read_gvcf( "./non_CCS_PB_call_gvcf/MHC_non_CCS.unified.gvcf.gz" )
     show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2)
     outf2