医学、机器学习等等,在统计结果时时长会用到这两个指标来说明数据的特性。
定义
敏感性:在金标准判断有病(阳性)人群中,检测出阳性的几率。真阳性。(检测出确实有病的能力)
特异性:在金标准判断无病(阴性)人群中,检测出阴性的几率。真阴性。(检测出确实没病的能力)
假阳性率:得到了阳性结果,但这个阳性结果是假的。即在金标准判断无病(阴性)人群中,检测出为阳性的几率。(没病,但却检测结果说有病),为误诊率。
假阴性率:得到了阴性结果,但这个阴性结果是假的。即在金标准判断有病(阳性)人群中,检测出为阴性的几率。(有病,但却检测结果说没病),为漏诊率。
计算方法
Sensitivity and specificity:完整定义
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
True Positive (真正, TP)被模型预测为正的正样本;可以称作判断为真的正确率
True Negative(真负 , TN)被模型预测为负的负样本 ;可以称作判断为假的正确率
False Positive (假正, FP)被模型预测为正的负样本;可以称作误报率
False Negative(假负 , FN)被模型预测为负的正样本;可以称作漏报率
True Positive Rate(真正率 , TPR)或灵敏度(sensitivity)
TPR = TP /(TP + FN)
正样本预测结果数 / 正样本实际数
True Negative Rate(真负率 , TNR)或特指度(specificity)
TNR = TN /(TN + FP)
负样本预测结果数 / 负样本实际数
False Positive Rate (假正率, FPR)
FPR = FP /(FP + TN)
被预测为正的负样本结果数 /负样本实际数
False Negative Rate(假负率 , FNR)
FNR = FN /(TP + FN)
被预测为负的正样本结果数 / 正样本实际数
|
假阳性率=假阳性人数÷金标准阴性人数
即: 假阳性率=b÷(b+d)
金标准 | 金标准 | |||
阳性(+) | 阴性(-) | 合计 | ||
某筛检方法 | 阳性(+) | a | b | a+b |
某筛检方法 | 阴性(-) | c | d | c+d |
合计 | a+c | b+d | N |
公式为:假阳性率=b/(b+d)×100%
(b:筛选为阳性,而标准分类为阴性的例数;d:阴性一致例数)
假阴性率=假阴性人数÷金标准阳性人数
即: β=c÷(a+c)
终于要用到这个玩意了,很激动,主要统计假阴假阳性率。
我的任务:
1. 评估Pacbio MHC variation calling 结果(CCS/non-CCS)与Hiseq数据结果的一致性。
2. 分别在不同深度梯度的区域完成以上评估,推断PB MHC做variation calling的最低深度。
这里要将一个位点分为SNP、REF 和 LowQual,然后只去 SNP 和 REF 进行统计。
这是我一下午写出来的统计代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
|
#!/usr/bin/env python
# Author: LI ZHIXIN
import
sys
import
pysam
from
collections
import
OrderedDict
def
classify_DP(depth):
if
depth >
101
:
return
21
return
((depth
-
1
)
/
/
5
+
1
)
def
parse_rec(rec):
sample
=
list
(rec.samples)[
0
]
# filter the Invalid line
if
not
(
'GQ'
or
'GT'
or
'DP'
)
in
rec.samples[sample].keys()
or
len
(rec.alleles) <
=
1
:
# continue
return
1
,
"LowQual"
, rec.pos
# filter the LowQual
if
rec.samples[sample][
'GQ'
] <
30
:
return
rec.samples[sample][
'DP'
],
"LowQual"
, rec.pos
# filter the indel
flag
=
0
for
one
in
rec.alleles:
if
len
(one) !
=
len
(rec.ref):
flag
=
1
if
flag
=
=
1
:
return
rec.samples[sample][
'DP'
],
"LowQual"
, rec.pos
if
rec.samples[sample][
'GT'
] !
=
(
0
,
0
):
# rec.qual > 30
# variation_dict[rec.pos] = ["snp", rec.alleles]
return
rec.samples[sample][
'DP'
],
"snp"
, rec.pos
elif
rec.samples[sample][
'GT'
]
=
=
(
0
,
0
):
# variation_dict[rec.pos] = ["ref", rec.alleles]
return
rec.samples[sample][
'DP'
],
"ref"
, rec.pos
def
read_gvcf(gvcf_file_path):
variation_dict
=
OrderedDict()
for
i
in
range
(
1
,
22
):
variation_dict[i]
=
{}
for
j
in
(
'LowQual'
,
'snp'
,
'ref'
):
variation_dict[i][j]
=
[]
# pos_list = []
gvcf_file
=
pysam.VariantFile(gvcf_file_path)
for
rec
in
gvcf_file.fetch(
'chr6'
,
28477796
,
33448354
):
DP, pos_type, pos
=
parse_rec(rec)
if
DP <
1
or
DP >
20
:
continue
# DP = classify_DP(DP)
variation_dict[DP][pos_type].append(pos)
# print(pos, DP, pos_type)
gvcf_file.close()
# return variation_dict, pos_list
return
variation_dict
def
read_hiseq_gvcf(gvcf_file_path):
variation_dict
=
OrderedDict()
# for i in range(1,22):
# variation_dict[i] = {}
for
j
in
(
'LowQual'
,
'snp'
,
'ref'
):
variation_dict[j]
=
[]
# pos_list = []
gvcf_file
=
pysam.VariantFile(gvcf_file_path)
for
rec
in
gvcf_file.fetch(
'chr6'
,
28477796
,
33448354
):
DP, pos_type, pos
=
parse_rec(rec)
DP
=
classify_DP(DP)
variation_dict[pos_type].append(pos)
# print(pos, DP, pos_type)
gvcf_file.close()
# return variation_dict, pos_list
return
variation_dict
def
show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2):
for
DP
in
range
(
1
,
21
):
Hiseq_snp
=
set
(Hiseq_unified_variation_dict[
'snp'
])
Hiseq_ref
=
set
(Hiseq_unified_variation_dict[
'ref'
])
Hiseq_lowqual
=
set
(Hiseq_unified_variation_dict[
'LowQual'
])
PB_snp
=
PB_non_CCS_variation_dict[DP][
'snp'
]
PB_ref
=
PB_non_CCS_variation_dict[DP][
'ref'
]
PB_lowqual
=
PB_non_CCS_variation_dict[DP][
'LowQual'
]
total
=
set
(PB_snp
+
PB_ref
+
PB_lowqual)
Hiseq_snp
=
total & Hiseq_snp
Hiseq_ref
=
total & Hiseq_ref
Hiseq_lowqual
=
total & Hiseq_lowqual
PB_snp
=
set
(PB_snp)
PB_ref
=
set
(PB_ref)
PB_lowqual
=
set
(PB_lowqual)
a
=
len
(Hiseq_snp & PB_snp)
b
=
len
(Hiseq_ref & PB_snp)
c
=
len
(Hiseq_lowqual & PB_snp)
d
=
len
(Hiseq_snp & PB_ref)
e
=
len
(Hiseq_ref & PB_ref)
f
=
len
(Hiseq_lowqual & PB_ref)
g
=
len
(Hiseq_snp & PB_lowqual)
h
=
len
(Hiseq_ref & PB_lowqual)
i
=
len
(Hiseq_lowqual & PB_lowqual)
Low_total
=
(g
+
h
+
i)
/
(a
+
b
+
c
+
d
+
e
+
f
+
g
+
h
+
i)
if
(a
+
b)
=
=
0
:
PPV
=
"NA"
else
:
PPV
=
a
/
(a
+
b)
PPV
=
"%.4f"
%
(PPV)
if
(a
+
d)
=
=
0
:
TPR
=
"NA"
else
:
TPR
=
a
/
(a
+
d)
TPR
=
"%.4f"
%
(TPR)
print
(
str
(DP)
+
" :\n"
, a,b,c,
"\n"
,d,e,f,
"\n"
,g,h,i,
"\n"
,
file
=
outf2, sep
=
'\t'
, end
=
'\n'
)
print
(DP, TPR, PPV,
"%.4f"
%
Low_total,
file
=
outf, sep
=
'\t'
, end
=
'\n'
)
with
open
(
"./depth_stat.txt"
,
"w"
) as outf:
print
(
"Depth"
,
"TPR"
,
"PPV"
,
"Low_total"
,
file
=
outf, sep
=
'\t'
, end
=
'\n'
)
outf2
=
open
(
"raw.txt"
,
"w"
)
Hiseq_unified_variation_dict
=
read_hiseq_gvcf(
"./hiseq_call_gvcf/MHC_Hiseq.unified.gvcf.gz"
)
PB_non_CCS_variation_dict
=
read_gvcf(
"./non_CCS_PB_call_gvcf/MHC_non_CCS.unified.gvcf.gz"
)
show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2)
outf2
|
又碰到一个高级python语法:在双层循环中如何退出外层循环? 我用了一个手动的flag,有其他好方法吗?
如何统计下机数据的覆盖度和深度?当然要比对之后才能统计,而且还要对比对做一些处理。
在计算一个位点是否是SNP、indel、Ref时,不仅要考虑ref、alts、qual、GQ,而且必须要把GT、DP考虑在内,所以说还是比较复杂的。
最后如何分析第二个问题,call variation的最低深度?
统计不同深度下的假阴假阳性率,看在什么深度下其达到饱和。