靶向分析流程（Pipeline）中的数据质控

最新推荐文章于 2022-12-12 22:30:30 发布

weixin_39900139

最新推荐文章于 2022-12-12 22:30:30 发布

阅读量534

点赞数

分类专栏：图形化通用生信分析系统/平台开发生信分析流程生信分析文章标签： bash 开发语言

本文链接：https://blog.csdn.net/weixin_39900139/article/details/126574487

版权

图形化通用生信分析系统/平台开发同时被 3 个专栏收录

23 篇文章 4 订阅

订阅专栏

生信分析流程

22 篇文章 11 订阅

订阅专栏

生信分析

17 篇文章 2 订阅

订阅专栏

本文是对靶向测序Pipeline中数据质控的升级，顺便做一个记录

此前Pipeline中数据质控来源于几个软件：

fastp：

fastp -w ${threads} \
	-i ${data}/${pn}/${sn}_S*_R1_*.fastq.gz \
    -I ${data}/${pn}/${sn}_S*_R2_*.fastq.gz \
    -o ${result}/${sn}/trimmed/${sn}_trimmed_R1.fastq.gz     \
    -O ${result}/${sn}/trimmed/${sn}_trimmed_R2.fastq.gz     \
    --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA     \
    --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT  \
    --length_required 15 \
    --json ${result}/${sn}/trimmed/${sn}_fastp.json \
    --html ${result}/${sn}/trimmed/${sn}_fastp.html

从输出文件${sn}_fastp.json文件中获取过滤前后Q20,Q30比例,总的reads

samtools flagstat
```
samtools flagstat --threads ${threads} \
	${result}/${sn}/aligned/${sn}_sorted.bam  > \
    ${result}/${sn}/stat/${sn}_sorted.flagstat
```
从输出文件${sn}_marked.flagstat文件中获取mapping的一些信息，比如mapping比例，比对到参考基因组上的比例
samtools depth
```
samtools depth -a -b ${ref.bed} --threads ${threads} \
	${result}/${sn}/aligned/${sn}_marked.bam > \
    ${result}/${sn}/stat/${sn}_marked.depth
```
输出所有区域文件${ref.bed}位点的测序深度，然后统计整体的测序深度，比如1× 10× 20× 等测序深度下的覆盖率，总体的平均测序深度和中位数测序深度

gatk CollectInsertSizeMetrics （其实是整合进去的pcard）

gatk CollectInsertSizeMetrics \
  -I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
  -O ${result}/${sn}/stat/${sn}_normal_insertsize_metrics.txt \
  -H ${result}/${sn}/stat/${sn}_normal_insertsize_histogram.pdf

从metrics文件中获取insert size信息。

编写脚本汇总以上数据，形成最终的质控信息

然而某个朋友给我看了《2019-GB_T_37872目标基因区域捕获质量评价通则》之后：

里面有一项内容，计算捕获特异性：基于序列比对后的数据进行重复序列去除，比对到目标基因区域的碱基数量与比对到全基因组上区域的碱基数据量的比值：

我陷入了沉思，本着能用现有的轮子不用自己写的想法，我搜索到了bamdst这个软件替换掉samtools的输出，用法如下：

#github 下载源码
git clone https://github.com/shiquan/bamdst.git
#编译
cd  bamdst && make
#运行
/opt/ref/bamdst/bamdst \ 
	-p /opt/ref/projects/lang_cancer_hg38.bed \
	-o ./bamqc \
	--cutoffdepth 20 \
	./aligned/RD1703007FFP_marked.bam

#注：遗憾的是这里--cutoffdepth 参数只能增加一个参数，不能增加多个数值

运行结束后输出7个文件，

chromosomes.report
coverage.report
depth_distribution.plot
depth.tsv.gz
insertsize.plot
region.tsv.gz

uncover.bed

coverage.report 参考字段如下：

[Total] Raw Reads (All reads) // All reads in the bam file(s).
[Total] QC Fail reads // Reads number failed QC, this flag is marked by other software,like bwa. See flag in the bam structure.
[Total] Raw Data(Mb) // Total reads data in the bam file(s).
[Total] Paired Reads // Paired reads numbers.
[Total] Mapped Reads // Mapped reads numbers.
[Total] Fraction of Mapped Reads // Ratio of mapped reads against raw reads.
[Total] Mapped Data(Mb) // Mapped data in the bam file(s).
[Total] Fraction of Mapped Data(Mb) // Ratio of mapped data against raw data.
[Total] Properly paired // Paired reads with properly insert size. See bam format protocol for details.
[Total] Fraction of Properly paired // Ratio of properly paired reads against mapped reads
[Total] Read and mate paired // Read (read1) and mate read (read2) paired.
[Total] Fraction of Read and mate paired // Ratio of read and mate paired against mapped reads
[Total] Singletons // Read mapped but mate read unmapped, and vice versa.
[Total] Read and mate map to diff chr // Read and mate read mapped to different chromosome, usually because mapping error and structure variants.
[Total] Read1 // First reads in mate paired sequencing
[Total] Read2 // Mate reads
[Total] Read1(rmdup) // First reads after remove duplications.
[Total] Read2(rmdup) // Mate reads after remove duplications.
[Total] forward strand reads // Number of forward strand reads.
[Total] backward strand reads // Number of backward strand reads.
[Total] PCR duplicate reads // PCR duplications.
[Total] Fraction of PCR duplicate reads // Ratio of PCR duplications.
[Total] Map quality cutoff value // Cutoff map quality score, this value can be set by -q. default is 20, because some variants caller like GATK only consider high quality reads.
[Total] MapQuality above cutoff reads // Number of reads with higher or equal quality score than cutoff value.
[Total] Fraction of MapQ reads in all reads // Ratio of reads with higher or equal Q score against raw reads.
[Total] Fraction of MapQ reads in mapped reads // Ratio of reads with higher or equal Q score against mapped reads.
[Target] Target Reads // Number of reads covered target region (specified by bed file).
[Target] Fraction of Target Reads in all reads // Ratio of target reads against raw reads.
[Target] Fraction of Target Reads in mapped reads // Ratio of target reads against mapped reads.
[Target] Target Data(Mb) // Total bases covered target region. If a read covered target region partly, only the covered bases will be counted.
[Target] Target Data Rmdup(Mb) // Total bases covered target region after remove PCR duplications. 
[Target] Fraction of Target Data in all data // Ratio of target bases against raw bases.
[Target] Fraction of Target Data in mapped data // Ratio of target bases against mapped bases.
[Target] Len of region // The length of target regions.
[Target] Average depth // Average depth of target regions. Calculated by "target bases / length of regions".
[Target] Average depth(rmdup) // Average depth of target regions after remove PCR duplications.
[Target] Coverage (>0x) // Ratio of bases with depth greater than 0x in target regions, which also means the ratio of covered regions in target regions.
[Target] Coverage (>=4x) // Ratio of bases with depth greater than or equal to 4x in target regions.
[Target] Coverage (>=10x) // Ratio of bases with depth greater than or equal to 10x in target regions.
[Target] Coverage (>=30x) // Ratio of bases with depth greater than or equal to 30x in target regions.
[Target] Coverage (>=100x) // Ratio of bases with depth greater than or equal to 100x in target regions.
[Target] Coverage (>=Nx) // This is addtional line for user self-defined cutoff value, see --cutoffdepth
[Target] Target Region Count // Number of target regions. In normal practise,it is the total number of exomes.
[Target] Region covered > 0x // The number of these regions with average depth greater than 0x.
[Target] Fraction Region covered > 0x // Ratio of these regions with average depth greater than 0x.
[Target] Fraction Region covered >= 4x // Ratio of these regions with average depth greater than or equal to 4x.
[Target] Fraction Region covered >= 10x // Ratio of these regions with average depth greater than or equal to 10x.
[Target] Fraction Region covered >= 30x // Ratio of these regions with average depth greater than or equal to 30x.
[Target] Fraction Region covered >= 100x // Ratio of these regions with average depth greater than or equal to 100x.
[flank] flank size // The flank size will be count. 200 bp in default. Oligos could also capture the nearby regions of target regions.
[flank] Len of region (not include target region) // The length of flank regions (target regions will not be count).
[flank] Average depth // Average depth of flank regions.
[flank] flank Reads // The total number of reads covered the flank regions. Note: some reads covered the edge of target regions, will be count in flank regions also. 
[flank] Fraction of flank Reads in all reads // Ratio of reads covered in flank regions against raw reads.
[flank] Fraction of flank Reads in mapped reads // Ration of reads covered in flank regions against mapped reads.
[flank] flank Data(Mb) // Total bases in the flank regions.
[flank] Fraction of flank Data in all data // Ratio of total bases in the flank regions against raw data.
[flank] Fraction of flank Data in mapped data // Ratio of total bases in the flank regions against mapped data.
[flank] Coverage (>0x) // Ratio of flank bases with depth greater than 0x.
[flank] Coverage (>=4x) // Ratio of flank bases with depth greater than or equal to 4x.
[flank] Coverage (>=10x) // Ratio of flank bases with depth greater than or equal to 10x.
[flank] Coverage (>=30x) // Ratio of flank bases with depth greater than or equal to 30x.

通过编写脚本收集以上信息，可以根据需要酌情增加或删减，脚本文件:

#!/usr/bin/python
#-*-coding:utf-8-*-
import os,sys,collections,pandas,numpy,getopt,logging,json

__author__ = '豆浆包子'
__all__    = [
    'SingleQcUtil', 
    'process', 
    'doProcess',
    'collectFastqInfo',
    'collectBamInfo',
    'collectInsertSize'
    'usage'
]


class SingleQcUtil(object):
    """ 
        根据以下软件输出结果，生成最终Qc文件
        fastp    fastp.json
        bamdst   coverage.report
        gatk     CollectInsertSizeMetrics(Picard)
    """


    def __init__(self):
        '''
        初始化:验证工具,获取脚本参数,Usage使用说明等
        '''
        logging.basicConfig(
            level   = logging.INFO,
            format  = '%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s'
        )
        self.log = logging.getLogger(__name__)
        
        self.sn                 ='Sample'

        self.bed                = None
        self.out                = None
        self.sample_fastp       = None
        self.sample_bamdst      = None
        self.sample_insertsize  = None


        try:
            self.opts = getopt.getopt(
                sys.argv[1:], 
                "o:dh", 
                [
                    "out=",
                    "sample-fastp=",
                    "sample-bamdst=",
                    "sample-insertsize=",
                    "document",
                    "help"
                ]
            )
        except getopt.GetoptError:
            print('错误：请检查您的输入参数是否正确\n\n')
            self.usage()
            sys.exit()
        
        if len(self.opts[0])==0:
            self.usage()
            sys.exit()

        #获取命令行参数值
        for arg, value in self.opts[0]:
            if arg=="-o" or arg == "--out":
                if value.endswith('.xls'):
                    self.out = value
                    if os.path.isfile(value) :
                        self.log.warn('[-o]或[--out]参数值 %s 文件已经存在，输出结果将覆盖该文件' %(value))
                else:
                    print('[-o]或[--out]参数值 %s 不是一个有效的文件(以%s结尾)' %(value,'.xls'))
            elif arg == "--sample-fastp":
                if os.path.isfile(value) and value.endswith('.json'):
                    self.sample_fastp = value
                    fullpath     = os.path.realpath(value)
                    fileOnly     = os.path.basename(fullpath)
                    self.sn      = fileOnly[0:fileOnly.rindex('_fastp.json')]
                else:
                    print('[--sample-fastp]参数值 %s 不是一个有效的文件' % value)
                    sys.exit()
            elif arg == "--sample-bamdst":
                if os.path.isfile(value) :
                    self.sample_bamdst=value
                else:
                    print('[--sample-bamdst]参数值 %s 不是一个有效的文件' % value)
                    sys.exit()
            elif arg == "--sample-insertsize":
                if os.path.isfile(value):
                    self.sample_insertsize=value
                else:
                    print('[--sample-insertsize]参数值 %s 不是一个有效的文件' % value)
                    sys.exit()
            elif arg=="-h" or arg == "--help":
                self.usage()
                sys.exit()
            elif arg=="-d" or arg=="--document":
                import SingleQcUtil
                help(SingleQcUtil)
                sys.exit()
        ps_stats = False
        if self.out is None:
            print('[-o]或[--out]参数值不能为空')
            ps_stats = True
        if self.sample_fastp is None:
            print('[--sample-fastp]参数值不能为空')
        if self.sample_bamdst is None:
            print('[--sample-bamdst]参数值不能为空')
            ps_stats = True
        if self.sample_insertsize is None:
            print('[--sample-insertsize]参数值不能为空')
            ps_stats = True
        if ps_stats:
            sys.exit()

        self.cpath = os.getcwd()
        self.log.info('WorkingDir is : %s' % self.cpath)


    def doProcess(self):
        """
        处理传入的vcf文件和annovar注释文件，输出结果
        args:
            fastp              fastp    输出文件
            sample_bamdst      bamdst   输出文件
            sample_insertsize  gatk     CollectInsertSizeMetrics 输出文件
        return:
            无返回值
        """
        dic                                    = collections.OrderedDict()
        dic['Sample Number']                   = self.sn
        dic['Total Reads (M) Before Filtering']='NAN'
        dic['Q20 Rate Before Filtering']       ='NAN'
        dic['Q30 Rate Before Filtering']       ='NAN'
        dic['GC Rate  Before Filtering']       ='NAN'

        dic['Total Reads (M) After  Filtering']='NAN'
        dic['Q20 Rate After  Filtering']       ='NAN'
        dic['Q30 Rate After  Filtering']       ='NAN'
        dic['GC Rate  After  Filtering']       ='NAN'
        dic['Mapping Rate']                    ='NAN'
        dic['Duplicate Rate']                  ='NAN'
        dic['Target in Mapping Rate']          ='NAN'
        dic['Region Length']                   ='NAN'
        dic['Average Depth']                   ='NAN'
        dic['1× Coverage']                     ='NAN'
        dic['4× Coverage']                     ='NAN'
        dic['10× Coverage']                    ='NAN'
        dic['20× Coverage']                    ='NAN'
        dic['30× Coverage']                    ='NAN'
        dic['100× Coverage']                   ='NAN'
        #dic['500× Coverage']                   ='NAN'
        dic['mean insert size']                ='NAN'
        dic                                    = self.collectFastqInfo(self.sample_fastp,dic)
        dic                                    = self.collectBamInfo(self.sample_bamdst,dic)
        dic                                    = self.collectInsertSize(self.sample_insertsize,dic)
        
        result                                 = pandas.DataFrame([dic],index=[0])
        #result.drop(['covered','total'],axis=1,inplace=True)
        self.log.info('writting result to file %s',self.out)
        print(result.stack())
        result.to_csv(self.out,index=False,encoding='utf-8',sep='\t')
        

    def collectFastqInfo(self,filename,dic):
        '''
            从fastp输出json文件中读取过滤前后Total Reads Q20比例，Q30比例，GC %百分比
        '''
        if os.path.isfile(filename):
            try:
                with open(filename,'r') as load_f:
                    load_dict = json.load(load_f)
                    dic['Total Reads (M) Before Filtering'] = format(float(load_dict['summary']['before_filtering']['total_reads'])/1000000,'.2f')
                    dic['Q20 Rate Before Filtering'] = format(load_dict['summary']['before_filtering']['q20_rate'],'.2%')
                    dic['Q30 Rate Before Filtering'] = format(load_dict['summary']['before_filtering']['q30_rate'],'.2%')
                    dic['GC Rate  Before Filtering'] = format(load_dict['summary']['before_filtering']['gc_content'],'.2%')
                    dic['Total Reads (M) After  Filtering'] = format(float(load_dict['summary']['after_filtering']['total_reads'])/1000000,'.2f')
                    dic['Q20 Rate After  Filtering'] = format(load_dict['summary']['before_filtering']['q20_rate'],'.2%')
                    dic['Q30 Rate After  Filtering'] = format(load_dict['summary']['before_filtering']['q30_rate'],'.2%')
                    dic['GC Rate  After  Filtering'] = format(load_dict['summary']['before_filtering']['gc_content'],'.2%')
            except Exception,e:
                self.log.error(e)
        return dic

    def collectBamInfo(self,filename,dic):
        '''
            读取bamdst输出文件coverage.report，获取相关字段
            Mapping Rate,Duplicate Rate,Target Reads in Mapping Rate,
            Average Depth,1×,4×,10×,20,30×,100×,500× Coverage@
        '''
        if os.path.isfile(filename):
            f = open(filename)
            line = f.readline()
            while line:
                #print (line)
                line = f.readline()
                if line.startswith('##') or len(line)==0:
                  continue
                arrs = line.split('\t')
                key  = arrs[0].strip()
                val  = arrs[1].split('\n')[0]
                if key=='[Total] Fraction of Mapped Reads':
                  dic['Mapping Rate']=val
                elif key=='[Total] Fraction of PCR duplicate reads':
                  dic['Duplicate Rate']=val
                elif key=='[Target] Fraction of Target Reads in mapped reads':
                  dic['Target in Mapping Rate']=val
                elif key=='[Target] Len of region':
                  dic['Region Length']=val
                elif key=='[Target] Average depth(rmdup)':
                  dic['Average Depth']=val
                elif key=='[Target] Coverage (>0x)':
                  dic['1× Coverage']=val
                elif key=='[Target] Coverage (>=4x)':
                  dic['4× Coverage']=val
                elif key=='[Target] Coverage (>=10x)':
                  dic['10× Coverage']=val
                elif key=='[Target] Coverage (>=20x)':
                  dic['20× Coverage']=val
                elif key=='[Target] Coverage (>=30x)':
                  dic['30× Coverage']=val
                elif key=='[Target] Coverage (>=100x)':
                  dic['100× Coverage']=val
                #elif key=='[Target] Coverage (>=500x)':
                #  dic['500× Coverage']=val
            f.close()
        return dic

    def collectInsertSize(self,filename,dic):
        '''
            gatk CollectInsertSizeMetrics 输出文件获取mean insert size
        '''
        if os.path.isfile(filename):
            temp_data = pandas.read_csv(
                filename,
                names=[
                    'MEDIAN_INSERT_SIZE',
                    'MODE_INSERT_SIZE',
                    'MEDIAN_ABSOLUTE_DEVIATION',
                    'MIN_INSERT_SIZE',
                    'MAX_INSERT_SIZE',
                    'MEAN_INSERT_SIZE',
                    'STANDARD_DEVIATION',
                    'READ_PAIRS',
                    'PAIR_ORIENTATION',
                    'WIDTH_OF_10_PERCENT',
                    'WIDTH_OF_20_PERCENT',
                    'WIDTH_OF_30_PERCENT',
                    'WIDTH_OF_40_PERCENT',
                    'WIDTH_OF_50_PERCENT',
                    'WIDTH_OF_60_PERCENT',
                    'WIDTH_OF_70_PERCENT',
                    'WIDTH_OF_80_PERCENT',
                    'WIDTH_OF_90_PERCENT',
                    'WIDTH_OF_95_PERCENT',
                    'WIDTH_OF_99_PERCENT',
                    'SAMPLE',
                    'LIBRARY',
                    'READ_GROUP'
                ],
                sep='\t',
                header=0,
                nrows=1,
                comment='#'
            )
            dic['mean insert size']=int(temp_data['MEAN_INSERT_SIZE'][0])
        return dic
    
    def usage(self):
        '''
        打印输出Usage
        '''
        print('Usage : ./SingleQcUtil [OPTION]... 或者 python SingleQcUtil [OPTION]')
        print('''
            根据fastp,bamdst,gatk CollectInsertSizeMetrics(picard)
            输出质控分析结果文件，扩展名为.xls
            Example:
                ./SingleQcUtil.py \\
                    -o /opt/result/2019.result.QC.xls \\
                    --sample-fastp=/opt/result/2019_fastp.json  \\
                    --sample-bamdst=/opt/result/coverage.report \\
                    --sample-insertsize=/opt/result/2019_insertsize_metrics.txt
            或者:
                python SingleQcUtil.py \\
                    --out=/opt/result/2019.result.QC.xls \\
                    --sample-fastp=/opt/result/2019_fastp.json  \\
                    --sample-bamdst=/opt/result/coverage.report \\
                    --sample-insertsize=/opt/result/2019_insertsize_metrics.txt
        ''')
        print('''部分使用方法及参数如下：\n''')
        print('-o, --out=\t\t输出处理结果文件')
        print('--sample-fastp=\t\tfastp 处理后的输出文件')
        print('--sample-bamdst=\tbamdst分析bam文件的输出文件')
        print('--sample-insertsize=\tgatk CollectInsertSizeMetrics(Picard) 统计bam文件的输出文件')
        print('-h, --help\t\t显示帮助')
        print('-d, --document\t\t显示开发文档')
        print('\n')
        print('提交bug,to <6041738@qq.com>.\n')


if __name__ == '__main__':
    f = SingleQcUtil()
    f.doProcess()buzhi

最终汇总信息为横向表格，转换纵向数据如下：

项目	数值
Sample Number	RD170300FFP
Total Reads (M) Before Filtering	1.54
Q20 Rate Before Filtering	96.92%
Q30 Rate Before Filtering	92.25%
GC Rate Before Filtering	48.99%
Total Reads (M) After Filtering	1.50
Q20 Rate After Filtering	96.92%
Q30 Rate After Filtering	92.25%
GC Rate After Filtering	48.99%
Mapping Rate	99.98%
Duplicate Rate	45.00%
Target in Mapping Rate	17.69%
Region Length	55953
Average Depth	154.77
1× Coverage	25.06%
4 × Coverage	21.93%
10× Coverage	21.16%
20 × Coverage	20.71%
30× Coverage	20.36%
100× Coverage	19.00%
mean Insert size	252

weixin_39900139

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
靶向分析流程（Pipeline）中的数据质控

从输出文件${sn}_fastp.json文件中获取过滤前后Q20,Q30比例,总的reads从输出文件${sn}_marked.flagstat文件中获取mapping的一些信息，比如mapping比例，比对到参考基因组上的比例输出所有区域文件${ref.bed}位点的测序深度，然后统计整体的测序深度，比如1× 10× 20× 等测序深度下的覆盖率，总体的平均测序深度和中位数测序深度gatk CollectInsertSizeMetrics （其实是整合进去的pcard）...
复制链接

扫一扫