敏感性、特异性、假阳性、假阴性（sensitivity and specificity）

最新推荐文章于 2025-04-06 22:56:42 发布

monk1992

最新推荐文章于 2025-04-06 22:56:42 发布

阅读量10w+

点赞数 23

分类专栏：机器学习深度学习

机器学习同时被 2 个专栏收录

12 篇文章

订阅专栏

深度学习

10 篇文章

订阅专栏

本文介绍了一种评估PacBio MHC变异调用准确性与Hiseq数据一致性的方法，并通过统计不同深度下的假阴性和假阳性率来确定PacBio MHC变异调用的最低有效深度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

医学、机器学习等等，在统计结果时时长会用到这两个指标来说明数据的特性。

定义

敏感性：在金标准判断有病（阳性）人群中，检测出阳性的几率。真阳性。（检测出确实有病的能力）
特异性：在金标准判断无病（阴性）人群中，检测出阴性的几率。真阴性。（检测出确实没病的能力）
假阳性率：得到了阳性结果，但这个阳性结果是假的。即在金标准判断无病（阴性）人群中，检测出为阳性的几率。（没病，但却检测结果说有病），为误诊率。
假阴性率：得到了阴性结果，但这个阴性结果是假的。即在金标准判断有病（阳性）人群中，检测出为阴性的几率。（有病，但却检测结果说没病），为漏诊率。

计算方法

Sensitivity and specificity：完整定义

 
          True Positive （真正, TP）被模型预测为正的正样本；可以称作判断为真的正确率 
         
          True Negative（真负 , TN）被模型预测为负的负样本 ；可以称作判断为假的正确率 
         
          False Positive （假正, FP）被模型预测为正的负样本；可以称作误报率 
         
          False Negative（假负 , FN）被模型预测为负的正样本；可以称作漏报率 
         
          True Positive Rate（真正率 , TPR）或灵敏度（sensitivity）  
         
          TPR = TP /（TP + FN）  
         
          正样本预测结果数 / 正样本实际数  
         
          True Negative Rate（真负率 , TNR）或特指度（specificity）  
         
          TNR = TN /（TN + FP）  
         
          负样本预测结果数 / 负样本实际数  
         
          False Positive Rate （假正率, FPR）  
         
          FPR = FP /（FP + TN）  
         
          被预测为正的负样本结果数 /负样本实际数 
         
          False Negative Rate（假负率 , FNR）  
         
          FNR = FN /（TP + FN）  
         
          被预测为负的正样本结果数 / 正样本实际数

假阳性率=假阳性人数÷金标准阴性人数

即：假阳性率=b÷（b+d）

		金标准	金标准
		阳性（+）	阴性（-）	合计
某筛检方法	阳性（+）	a	b	a+b
某筛检方法	阴性（-）	c	d	c+d
合计		a+c	b+d	N

公式为:假阳性率=b/(b+d)×100%

(b:筛选为阳性,而标准分类为阴性的例数;d:阴性一致例数)

假阴性率=假阴性人数÷金标准阳性人数

即： β=c÷（a+c）

终于要用到这个玩意了，很激动，主要统计假阴假阳性率。

我的任务：

1. 评估Pacbio MHC variation calling 结果（CCS/non-CCS）与Hiseq数据结果的一致性。
2. 分别在不同深度梯度的区域完成以上评估，推断PB MHC做variation calling的最低深度。

这里要将一个位点分为SNP、REF 和 LowQual，然后只去 SNP 和 REF 进行统计。

这是我一下午写出来的统计代码：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

 
          #!/usr/bin/env python 
         
 
          # Author: LI ZHIXIN 
         

             
         
 
          import 
          sys 
         
 
          import 
          pysam 
         
 
          from 
          collections  
          import 
          OrderedDict 
         

             
         
 
          def 
          classify_DP(depth): 
         
 
               
          if 
          depth >  
          101 
          : 
         
 
                   
          return 
          21 
         
 
               
          return 
          ((depth 
          - 
          1 
          ) 
          / 
          / 
          5 
          + 
          1 
          ) 
         

             
         
 
          def 
          parse_rec(rec): 
         
 
               
          sample  
          = 
          list 
          (rec.samples)[ 
          0 
          ] 
         
 
               
          # filter the Invalid line 
         
 
               
          if 
          not 
          ( 
          'GQ' 
          or 
          'GT' 
          or 
          'DP' 
          )  
          in 
          rec.samples[sample].keys()  
          or 
          len 
          (rec.alleles) < 
          = 
          1 
          : 
         
 
                   
          # continue 
         
 
                   
          return 
          1 
          ,  
          "LowQual" 
          , rec.pos 
         
 
               
          # filter the LowQual 
         
 
               
          if 
          rec.samples[sample][ 
          'GQ' 
          ] <  
          30 
          : 
         
 
                   
          return 
          rec.samples[sample][ 
          'DP' 
          ],  
          "LowQual" 
          , rec.pos 
         
 
               
          # filter the indel 
         
 
               
          flag  
          = 
          0 
         
 
               
          for 
          one  
          in 
          rec.alleles: 
         
 
                   
          if 
          len 
          (one) ! 
          = 
          len 
          (rec.ref): 
         
 
                       
          flag  
          = 
          1 
         
 
               
          if 
          flag  
          = 
          = 
          1 
          : 
         
 
                   
          return 
          rec.samples[sample][ 
          'DP' 
          ],  
          "LowQual" 
          , rec.pos 
         
 
               
          if 
          rec.samples[sample][ 
          'GT' 
          ] ! 
          = 
          ( 
          0 
          ,  
          0 
          ):  
          # rec.qual > 30 
         
 
                   
          # variation_dict[rec.pos] = ["snp", rec.alleles] 
         
 
                   
          return 
          rec.samples[sample][ 
          'DP' 
          ],  
          "snp" 
          , rec.pos   
         
 
               
          elif 
          rec.samples[sample][ 
          'GT' 
          ]  
          = 
          = 
          ( 
          0 
          ,  
          0 
          ): 
         
 
                   
          # variation_dict[rec.pos] = ["ref", rec.alleles] 
         
 
                   
          return 
          rec.samples[sample][ 
          'DP' 
          ],  
          "ref" 
          , rec.pos 
         

             
         
 
          def 
          read_gvcf(gvcf_file_path): 
         
 
               
          variation_dict  
          = 
          OrderedDict() 
         
 
               
          for 
          i  
          in 
          range 
          ( 
          1 
          , 
          22 
          ): 
         
 
                   
          variation_dict[i]  
          = 
          {} 
         
 
                   
          for 
          j  
          in 
          ( 
          'LowQual' 
          ,  
          'snp' 
          ,  
          'ref' 
          ): 
         
 
                       
          variation_dict[i][j]  
          = 
          [] 
         
 
               
          # pos_list = [] 
         
 
               
          gvcf_file  
          = 
          pysam.VariantFile(gvcf_file_path) 
         
 
               
          for 
          rec  
          in 
          gvcf_file.fetch( 
          'chr6' 
          , 
          28477796 
          , 
          33448354 
          ): 
         
 
                   
          DP, pos_type, pos  
          = 
          parse_rec(rec) 
         
 
                   
          if 
          DP <  
          1 
          or 
          DP >  
          20 
          : 
         
 
                       
          continue 
         
 
                   
          # DP = classify_DP(DP) 
         
 
                   
          variation_dict[DP][pos_type].append(pos) 
         
 
                   
          # print(pos, DP, pos_type) 
         
 
               
          gvcf_file.close() 
         
 
               
          # return variation_dict, pos_list 
         
 
               
          return 
          variation_dict 
         

             
         
 
          def 
          read_hiseq_gvcf(gvcf_file_path): 
         
 
               
          variation_dict  
          = 
          OrderedDict() 
         
 
               
          # for i in range(1,22): 
         
 
               
          # variation_dict[i] = {} 
         
 
               
          for 
          j  
          in 
          ( 
          'LowQual' 
          ,  
          'snp' 
          ,  
          'ref' 
          ): 
         
 
                   
          variation_dict[j]  
          = 
          [] 
         
 
               
          # pos_list = [] 
         
 
               
          gvcf_file  
          = 
          pysam.VariantFile(gvcf_file_path) 
         
 
               
          for 
          rec  
          in 
          gvcf_file.fetch( 
          'chr6' 
          , 
          28477796 
          , 
          33448354 
          ): 
         
 
                   
          DP, pos_type, pos  
          = 
          parse_rec(rec) 
         
 
                   
          DP  
          = 
          classify_DP(DP) 
         
 
                   
          variation_dict[pos_type].append(pos) 
         
 
                   
          # print(pos, DP, pos_type) 
         
 
               
          gvcf_file.close() 
         
 
               
          # return variation_dict, pos_list 
         
 
               
          return 
          variation_dict 
         

             
         
 
          def 
          show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2): 
         
 
               
          for 
          DP  
          in 
          range 
          ( 
          1 
          , 
          21 
          ): 
         
 
                   
          Hiseq_snp  
          = 
          set 
          (Hiseq_unified_variation_dict[ 
          'snp' 
          ]) 
         
 
                   
          Hiseq_ref  
          = 
          set 
          (Hiseq_unified_variation_dict[ 
          'ref' 
          ]) 
         
 
                   
          Hiseq_lowqual  
          = 
          set 
          (Hiseq_unified_variation_dict[ 
          'LowQual' 
          ]) 
         
 
                   
          PB_snp  
          = 
          PB_non_CCS_variation_dict[DP][ 
          'snp' 
          ] 
         
 
                   
          PB_ref  
          = 
          PB_non_CCS_variation_dict[DP][ 
          'ref' 
          ] 
         
 
                   
          PB_lowqual  
          = 
          PB_non_CCS_variation_dict[DP][ 
          'LowQual' 
          ] 
         
 
                   
          total  
          = 
          set 
          (PB_snp  
          + 
          PB_ref  
          + 
          PB_lowqual) 
         
 
                   
          Hiseq_snp  
          = 
          total & Hiseq_snp 
         
 
                   
          Hiseq_ref  
          = 
          total & Hiseq_ref 
         
 
                   
          Hiseq_lowqual  
          = 
          total & Hiseq_lowqual 
         
 
                   
          PB_snp  
          = 
          set 
          (PB_snp) 
         
 
                   
          PB_ref  
          = 
          set 
          (PB_ref) 
         
 
                   
          PB_lowqual  
          = 
          set 
          (PB_lowqual) 
         
 
                   
          a  
          = 
          len 
          (Hiseq_snp & PB_snp) 
         
 
                   
          b  
          = 
          len 
          (Hiseq_ref & PB_snp) 
         
 
                   
          c  
          = 
          len 
          (Hiseq_lowqual & PB_snp) 
         
 
                   
          d  
          = 
          len 
          (Hiseq_snp & PB_ref) 
         
 
                   
          e  
          = 
          len 
          (Hiseq_ref & PB_ref) 
         
 
                   
          f  
          = 
          len 
          (Hiseq_lowqual & PB_ref) 
         
 
                   
          g  
          = 
          len 
          (Hiseq_snp & PB_lowqual) 
         
 
                   
          h  
          = 
          len 
          (Hiseq_ref & PB_lowqual) 
         
 
                   
          i  
          = 
          len 
          (Hiseq_lowqual & PB_lowqual) 
         
 
                   
          Low_total  
          = 
          (g 
          + 
          h 
          + 
          i) 
          / 
          (a 
          + 
          b 
          + 
          c 
          + 
          d 
          + 
          e 
          + 
          f 
          + 
          g 
          + 
          h 
          + 
          i) 
         
 
                   
          if 
          (a 
          + 
          b)  
          = 
          = 
          0 
          : 
         
 
                       
          PPV  
          = 
          "NA" 
         
 
                   
          else 
          : 
         
 
                       
          PPV  
          = 
          a 
          / 
          (a 
          + 
          b) 
         
 
                       
          PPV  
          = 
          "%.4f" 
          % 
          (PPV) 
         
 
                   
          if 
          (a 
          + 
          d)  
          = 
          = 
          0 
          : 
         
 
                       
          TPR  
          = 
          "NA" 
         
 
                   
          else 
          : 
         
 
                       
          TPR  
          = 
          a 
          / 
          (a 
          + 
          d) 
         
 
                       
          TPR  
          = 
          "%.4f" 
          % 
          (TPR) 
         
 
                   
          print 
          ( 
          str 
          (DP) 
          + 
          " :\n" 
          , a,b,c, 
          "\n" 
          ,d,e,f, 
          "\n" 
          ,g,h,i, 
          "\n" 
          ,  
          file 
          = 
          outf2, sep 
          = 
          '\t' 
          , end 
          = 
          '\n' 
          ) 
         
 
                   
          print 
          (DP, TPR, PPV,  
          "%.4f" 
          % 
          Low_total,  
          file 
          = 
          outf, sep 
          = 
          '\t' 
          , end 
          = 
          '\n' 
          ) 
         

             
         
 
          with  
          open 
          ( 
          "./depth_stat.txt" 
          ,  
          "w" 
          ) as outf: 
         
 
               
          print 
          ( 
          "Depth" 
          ,  
          "TPR" 
          ,  
          "PPV" 
          ,  
          "Low_total" 
          ,  
          file 
          = 
          outf, sep 
          = 
          '\t' 
          , end 
          = 
          '\n' 
          ) 
         
 
               
          outf2  
          = 
          open 
          ( 
          "raw.txt" 
          ,  
          "w" 
          ) 
         
 
               
          Hiseq_unified_variation_dict  
          = 
          read_hiseq_gvcf( 
          "./hiseq_call_gvcf/MHC_Hiseq.unified.gvcf.gz" 
          ) 
         
 
               
          PB_non_CCS_variation_dict  
          = 
          read_gvcf( 
          "./non_CCS_PB_call_gvcf/MHC_non_CCS.unified.gvcf.gz" 
          ) 
         
 
               
          show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2) 
         
 
               
          outf2