fgbio分析UMI

最新推荐文章于 2024-09-15 08:20:06 发布

QuYu~

最新推荐文章于 2024-09-15 08:20:06 发布

阅读量677

点赞数

文章标签： linux

原文链接：https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md

版权

该文档描述了从包含模板（基因组）碱基和UMI碱基的FASTQ文件构建到对齐和过滤的共识读取的管道的最佳实践。流程分为两个阶段：从FASTQ到对齐和分组的原始读取，再到对齐和过滤的共识读取。提供了针对研发和高通量生产环境的两种不同实现。主要工具包括fgbio、samtools和bwamem，并详细介绍了压缩、异步IO等性能优化选项以及各个步骤的详细操作。

摘要由CSDN通过智能技术生成

fgbio Best Practise FASTQ -> Consensus Pipeline

This document describes the best practise(s) for building pipelines to go from FASTQ files that contain a mixture of template (genomic) bases and UMI bases, through to aligned and filtered consensus reads.

The pipeline is broken into two phases:

Getting to aligned and grouped raw reads (identifying which reads are from the same source molecules)
Going from grouped raw reads to aligned and filtered consensus reads

We provide two alternatives for the second phase. One intended for R&D environment where it is desirable to be able to branch off from the pipeline and test e.g. multiple consensus calling or filtering parameters. The other is intended for high throughput production environments where performance and throughput take precedence over flexibility.

Example Pipeline Implementations

A simple snakemake pipeline that implements the R&D version of the best practice pipeline is available in FastqToConsensus-RnD.smk. The snakemake pipeline is intended for demo purposes.

Tools used

The steps described below use a mixture of:

fgbio (version 2.0 or higher)
samtools (version 1.14 or higher)
bwa mem

Notes on some common options

The fgbio toolkit exposes some common options in addition to those supported by each tool. The general form of running fgbio is:

fgbio <common options> Tool <tool options>

The following options are frequently used to optimize performance:

Compression

--compression can be used to control the GZIP compression level used when writing BAM files and other gzip compressed files. The default value is 5. Lower values yield less compression (and larger file sizes) but less CPU usage and shorter runtimes. The recommendation is to set compresion level as follows:

When outputting file(s) that will be retained as outputs of the pipeline, use the default --compression 5
When outputting file(s) that are inputs to subsequence tools but are discarded at or before the end of the pipeline use --compression 1. Compression level 1 uses a lot less CPU than level 5, but still generates substantially smaller files than turning compression off with level 0. We have found this to be a good tradeoff for intermediate files that must be written to disk where both compression time and disk IO can be limiting.
When outputting to stdout and piping into another tool use --compression 0 to avoid compression

Async IO

--async-io causes additional threads to be used to i) read ahead and decompress/decode reads when reading SAM, BAM and VCF files, and ii) for encoding and compression when writing SAM, BAM and VCF files.
Each file being read or written is allocated one additional thread.
So for example, a tool that reads one BAM file and writes one BAM file will end up using three threads (one reader, one worker, one writer).
While the additional threads offer some increase in performance by off-loading work from the main/worker thread, depending on the tool in use, they may offer anything from a very minor performance boost to a significant one.

We recommend using --async-io in places where compressed outputs are being written and the tool in use needs more than a nominal amount of CPU for non-IO activities.

Phase 1: FASTQ -> Grouped BAM

The first phase deals with going from FASTQ files through to aligned and grouped BAM files. The general structure of this phase is:

In this example we’ll use a single sample with paired-end fastq files and an inline 8bp UMI at the beginning of read one (see the fgbio Read Structure page for how to specify other UMI setups).

Step 1.1: FASTQ -> uBam

This step extracts the UMI bases and stores them in standard SAM tags. It can extract any number of inline UMIs within the sequencing reads, and optionally by providing the --extract-umis-from-read-names flag, UMIs from the read names. Downstream fgbio tools expect the UMIs to be stored in tags in the BAM file (e.g. RX:Z:ACTGCTAG); converting to a uBam as the first step is the easiest way to do this correctly.

When working with fully degenerate UMIs that cannot be error-corrected, the uBam should be generated as follows:

fgbio -Xmx1g --compression 1 --async-io FastqToBam \
  --input r1.fq.gz r2.fq.gz \
  --read-structures 8M+T +T \
  --sample s1 \
  --library library1 \
  --platform-unit ABCDEAAXX.1 \
  --output s1.unmapped.bam

If, however, the UMIs are a limited and non-fully degenerate set then they can be error-corrected after extraction from the reads as follows:

fgbio -Xmx1g --compression 0 FastqToBam \
  --input r1.fq.gz r2.fq.gz \
  --read-structures 8M+T +T \
  --sample s1 \
  --library library1 \
  --platform-unit ABCDEAAXX.1 \
  --output /dev/stdout \
  | fgbio -Xmx2g --compression 1 --async-io CorrectUmis \
      --input /dev/stdin \
      --output s1.unmapped.bam \
      --max-mismatches 1 \
      --min-distance 2 \
      --umi-files file-of-expected-umis-one-per-line.txt \
      --dont-store-original-umis

Neither option above requires or performs any sorting.

Step 1.2: uBam -> Mapped BAM

This step uses a combination of samtools (to convert the uBam back to FASTQ), bwa mem to do the actual alignment and fgbio ZipperBams to carry metadata and UMIs forward from the unmapped BAM. No sorting is performed in this step.

# Align the data with bwa and recover headers and tags from the unmapped BAM
samtools fastq s1.unmapped.bam \
  | bwa mem -t 16 -p -K 150000000 -Y ref.fa - \
  | fgbio -Xmx4g --compression 1 --async-io ZipperBams \
      --unmapped s1.unmapped.bam \
      --ref ref.fa \
      --output s1.mapped.bam

Some important bwa mem options used here are:

-t 16 to set the number of threads to use to 16; this can be set to any number between 1 and the number of CPU cores available
-K 150000000 to tell bwa to process reads/bases in fixed size chunks regardless of the number of threads; this makes bwa deterministic and reproducible with different thread counts
-Y to use soft-clipping (instead of hard clipping) in supplemental alignments

Under no circumstances should -M be used to mark shorter split hits as secondary as this will cause problems with downstream tools.

Step 1.3: Mapped BAM -> Grouped BAM

This step identifies reads or read pairs that originate from the same source molecule based on genomic positions and UMI. For most data this is the recommended way to run grouping:

# Group the reads by position and UMI
fgbio -Xmx8g --compression 1 --async-io GroupReadsByUmi \
  --input s1.mapped.bam \
  --strategy Adjacency \
  --edits 1 \
  --output s1.grouped.bam \
  --family-size-histogram s1.tag-family-sizes.txt

However:

If UMIs are significantly longer (e.g. 20bp) or have more errors, --edits can be increased
For multiplex PCR and similar data where reads’ genomic positions are fixed by the primers it is recommended to use --strategy Identity to reduce runtime at the expense of lower accuracy

Phase 2(a): GroupedBam -> Filtered Consensus, R&D Version

Step 2(a).1: GroupedBam -> Consensus uBam

This step generates unmapped consensus reads from the grouped reads. Depending on your data you may choose to raise or lower the --min-input-base-quality to accept or reject more bases from the grouped raw reads. In an R&D setting it is frequently useful to see all consensus reads, but there is a performance trade-off. Frequently a very large fraction of molecules will have only a single read[-pair]; if consensus reads with depth 1 will be discarded later, setting --min-reads 2 (or higher) can cause this step to run significantly faster.

fgbio -Xmx4g --compression 1 CallMolecularConsensusReads \
  --input s1.grouped.bam \
  --output s1.cons.unmapped.bam \
  --min-reads 1 \
  --min-input-base-quality 20 \
  --threads 4

Step 2(a).2: Consensus uBam -> Consensus Mapped BAM

Remapping the consensus reads is nearly identical to mapping the raw reads, with the addition of the --tags-to-reverse and --tags-to-revcomp options. As specified below, these cause the per-base tags generated by consensus calling to be reversed and/or reverse complemented on reads mapped to the negative strand, so that they are in sync with the read’s bases and quality scores.

The -Y flag to bwa (causing it to soft-clip instead of hard-clip supplemental alignments) is critical at this step otherwise the per-base tags from the consensus caller will get out of sync with the read’s bases and quality scores.

# Re-align the consensus reads
samtools fastq s1.cons.unmapped.bam \
  | bwa mem -t 16 -p -K 150000000 -Y ref.fa - \
  | fgbio -Xmx4g --compression 1 --async-io ZipperBams \
      --unmapped s1.cons.unmapped.bam \
      --ref ref.fa \
      --tags-to-reverse Consensus \
      --tags-to-revcomp Consensus \
      --output s1.cons.mapped.bam

Step 2(a).3: Consensus Mapped -> Consensus Filtered & Sorted BAM

Finally we take the mapped consensus reads, filter them and sort them ready for downstream analysis. The parameters to FilterConsensusReads will need tuning based on the specific library preparation technology and the downstream analysis needs. The filtering parameters presented here (--min-reads 3 --min-base-quality 45 --max-base-error-rate 0.2) aim for a balance between generating high quality consensus reads and yield of consensus reads. This step can be repeated one or more times with different filtering criteria using the same consensus mapped BAM to explore different filtering strategies.

# Filter and sort the consensus reads
fgbio -Xmx8g --compression 0 FilterConsensusReads \
  --input s1.cons.mapped.bam \
  --output /dev/stdout \
  --ref ref.fa \
  --min-reads 3 \
  --min-base-quality 45 \
  --max-base-error-rate 0.2 \
  | samtools sort --threads 8 -o s1.cons.filtered.bam --write-index

Phase 2(b): GroupedBam -> Filtered Consensus, HT Version

For a high throughput production pipeline looking to maximize throughput and minimize turaround time, the following pipeline is recommended:

Step 2(b).1: GroupedBam -> Filtered Consensus uBam

This step generates unmapped consensus reads from the grouped reads and immediately filters them.

min_reads=3  # Mirrors the setting in 2(a).3 

fgbio -Xmx4g --compression 0 CallMolecularConsensusReads \
  --input s1.grouped.bam \
  --output /dev/stdout \
  --min-reads $min_reads \
  --min-input-base-quality 20 \
  --threads 4 \
  |  fgbio -Xmx8g --compression 1 FilterConsensusReads \
       --input /dev/stdin \
       --output s1.cons.unmapped.bam \
       --ref ref.fa \
       --min-reads 3 \
       --min-base-quality 45 \
       --max-base-error-rate 0.2

Step 2(b).2: Consensus Filtered uBam -> Consensus Mapped & Filtered BAM

This step is similar to the re-mapping in 2(a).2, but since the uBam has already been filtered, we have ZipperBams output uncompressed BAM and pipe it immediately into samtools to coordinate sort and index the resulting BAM ready for analysis.

# Re-align the consensus reads
samtools fastq s1.cons.unmapped.bam \
  | bwa mem -t 16 -p -K 150000000 -Y ref.fa - \
  | fgbio -Xmx4g --compression 0 --async-io ZipperBams \
      --unmapped s1.cons.unmapped.bam \
      --ref ref.fa \
      --tags-to-reverse Consensus \
      --tags-to-revcomp Consensus \
  | samtools sort --threads 8 -o s1.cons.filtered.bam --write-index