next-generation sequencing analysis method——paper3

最新推荐文章于 2022-09-20 20:59:37 发布

weixin_42953727

最新推荐文章于 2022-09-20 20:59:37 发布

阅读量598

点赞数 1

分类专栏： bioinformatics 文章标签： NGS paper

本文链接：https://blog.csdn.net/weixin_42953727/article/details/101897052

版权

bioinformatics 专栏收录该内容

27 篇文章 5 订阅

订阅专栏

在这里插入图片描述
Abstract: available software to align reads to a reference; use resulting alignments to call, annotate, view, and filter small sequence variants; variant calling includes read alignment with novoalign, removal of duplicates with samtools or bam Utils, detection of variants with Freebayes or bam2mpg software. annotate the predicted vatiants using gene models, population frequencies with ANNOVAR , and mutation severity, producing variant files with VarSifter.

1.1 Statistical Methods in Variant Analysis

(1)read alignment:
types of index: “Burrows-Wheeler Transform”
full read alignments algorithm: Smith–Waterman or Needleman–-Wunsch
(2)Variant calling:
Bayesian approach (Freebayes groups short-range haplotypes into single loci, bam2mpg considers each single base, insertion, or deletion variant within read alignments separately.

1.2 Choice of Analysis Software

(1)Best-known software pipeline: BWA aligner and the Genome Analysis Toolkit (GATK)
GATK pipeline:
highly accurate, handling a variety of experimental designs(without identifying barcodes before sequencing, low depth of coverage) , recognize false positives
extensive skill in software installation, computational time and memory, optimal choices for software parameters and settings,
(2) other choice: alignment——novoalign; calling variants——Freebayes or bam2mpg; annotate——ANNOVAR; view——VarSifter variant viewer.
sequence reads to be analyzed are from single samples, rather than unidentified pools, and these samples
have been sequenced to sufficient depth of coverage to yield high quality variant calls (generally 25X or greater), other choice are faster to run, without a demonstrable decrease in accuracy

1.3 Variant Filtering and Accuracy

calling variants will have reduced accuracy in regions of the genome that are prone to misalignment(due either to repetitive sequences or inaccuracies in the genome reference build)

Removal of variant calls overlapping low-complexity regions(LCRs) and variants in regions with significantly large read depth.

1.4 Software Pipelines and Reproducibility

Several publicly available pipelines:
Galaxy Project: https://usegalaxy.org
blue collar bioinformatics” pipeline: https://github.com/chapmanb/bcbio-nextgen

2 Materials

This protocol outlines the steps required to analyze sequence reads, either single- or paired-end, from Illumina GAII or HiSeq sequence analyzers, in order to create files in VCF format containing genotypes of SNVs and small insertions and deletion variants. The required materials to follow this protocol are:

1. A high-performance computer or computer cluster running the Linux operating system (see below for hardware requirements).
2. Sequence reads in FASTQ format for samples of interest. Alternatively, Subheading 3.6.1 gives instructions for downloading a sample dataset containing sequence from the National Center for Biotechnology Information (NCBI).
3. Installed software for sequence analysis:
SRA Toolkit—http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software (optional, for use in obtaining sample data from the NCBI Sequence Read Archive)
Samtools—http://www.htslib.org
bamUtil—http://genome.sph.umich.edu/wiki/BamUtil (only required for PCR duplicate removal if the installed
version of samtools is 1.0 or later)
Novoalign—http://www.novocraft.com/products/novoalign/
**Variant and genotype calling software **(one of the following):
– Freebayes—https://github.com/ekg/freebayes
– Bam2mpg—http://research.nhgri.nih.gov/software/bam2mpg/
tabix—http://sourceforge.net/projects/samtools/files/tabix/
ANNOVAR—http://www.openbioinformatics.org/annovar/
VarSifter—https://github.com/teerjk/VarSifter

computer requirements:
at least 8 gigabytes of physical memory…
In addition, example processing scripts and updates regarding software usage changes are maintained in the github repository available at http://github.com/nhansen/MMBVariantCalling.

3 Methods

在这里插入图片描述

3.1 Alignment of Reads

3.1.1 FASTQ Files

FASTQ-formatted files have four lines per sequence read, containing

The read name preceded by the character “@”.
The sequence of the read, directed from 50 to 30.
A line beginning with the character “+”.
A line containing ASCII-encoded quality scores for each base.

3.1.2 Running Novoalign

index reference genome using the “novoindex” command; generate the file extension “.ndx”
run Novoalign: licensed or unlicensed software; licensed——allows to read FASTQ files that have been compressed with the “gzip”, multiple treads of a single computer; FASTQ files can be split into multiple smaller files, run seqaretely on different processors.
resulting BMA files.

# Run novoalign, piping output to samtools to make a sorted BAM file:
novoalign -F STDFQ -o SAM -d hg38.ndx -f SRR1611184_1.fastq.gz
SRR1611184_2.fastq.gz | samtools view -uS - |
samtools sort - NA12878

When using an unlicensed copy of novoalign, or when analyzing a large dataset such as the one obtained from whole genome sequencing, it is usually better to split large FASTQ files into multiple smaller ones and then run novoalign on each smaller FASTQ file separately. A script “split_fastq.pl” is provided for this purpose in the scripts subdirectory of the github repository at http://github.com/nhansen/MMBVariantCalling.

# Split fastq file obtained from NCBI into 40 separate smaller files
# --nogzip option will create uncompressed files and should only
# be used when running novoalign without the paid license
split_fastq.pl SRR1611184_1.fastq.gz 40 --nogzip --dir split_40/
split_fastq.pl SRR1611184_2.fastq.gz 40 --nogzip --dir split_40/

Resulting BAM files can be merged using samtools:

samtools merge NA12878.merge.bam NA12878.1.bam NA12878.2.bam [...]

3.2 Removal of PCR Duplicates

most common way to remove duplicate reads was samtools’s “rmdup” command, but samtools version 1.0 and later no longer have a working rmdup command
below describes bamUtil command “dedup” as an alternative

3.2.1 Duplicate Removal Option 1

prior to version 1.0 (e.g., version 0.1.19)

# remove PCR duplicates using samtools
samtools rmdup NA12878.merge.bam NA12878.final

3.2.2 Duplicate Removal Option 2

bamUtil package, available at http://genome.sph.umich.edu/wiki/BamUtil

# remove PCR duplicates using bamUtil
bam dedup --in NA12878.merge.bam --out NA12878.final.bam --rmDups

After either Option 1 or Option 2 has been completed, the final BAM file “NA12878.final.bam” should be indexed for fast access to genomic regions of interest.

# Index BAM file for fast retrieval of genomic regions
samtools index NA12878.final.bam

3.3 Variant and Genotype Calling

software: Freebayes and bam2mpg（variant calls from duplicate-free BAM files）

3.3.1 Variant Calling with Freebayes

Freebayes is a haplotype-based caller for producing VCF formatted single nucleotide and small deletion/insertion variant calls from BAM-formatted sequences.

freebayes --genotype-qualities -f hg38.mfa NA12878.final.bam > NA12878.vcf

3.3.2 Variant Calling with bam2mpg

bam2mpg --bam_filter ’-q31’ --qual_filter 20 --only_nonref --snv_vcf NA12878.snv.vcf --div_vcf NA12878.div.vcf hg19.mfa NA12878.final.bam

To obtain a BED file of regions determined to have adequate depth of coverage for calling diploid variants, one must not use the “–only_nonref” option when running bam2mpg. This will create larger files than the “variant only” VCF files (on the order of 5 gigabytes for a whole exome sample, or 15 gigabytes for a whole
genome sample), but the resulting “MPG” formatted output filecan be used as input to the script “mpg2bed.pl” to create a BED formatted file of adequately covered regions:

bam2mpg --bam_filter ’-q31’ --qual_filter 20 --mpg NA12878.mpg.out.gz
hg19.mfa NA12878.final.bam
mpg2bed.pl --minscore 10 sample.mpg.out.gz > NA12878.mpg.bed

3.4 Annotation of VCF Files

ANNOVAR can be used to compare variant locations to annotation sets downloaded from the ANNOVAR website and the University of California at Santa Cruz’s (UCSC) Genome Browser .
Specifically, **lists of “gene-based” **(describing variants’ impact on gene models from NCBI’s RefSeq, UCSC’s known gene, or the ENSEMBL gene set) and **“filter-based” **(annotating specific variants with a wide variety of allele frequency, pathogenicity, and conservation-related measures) can be downloaded and used to add annotations to VCF files for viewing and filtering.

For example, to add gene annotation from the National Center of Biotechnology’s “RefGene” database to a variant VCF file, first download the “refGene” database from the ANNOVAR website:

# Download RefSeq gene annotations
annotate_variation.pl --buildver hg38 -downdb -webfrom annovar refGene humandb/

Annotating variants with allele frequencies from different populations can also be helpful in determining whether a particular variant might be pathogenic. So also download the Exome Aggregation Consortium’s “ExAC” database:

# Download ExAC variant annotations
annotate_variation.pl --buildver hg38 -downdb -webfrom annovar exac03 humandb/

Finally, run ANNOVAR’s table_annovar.pl script using these databases to add INFO fields to the variant VCF file:

# Add RefGene and ExAC annotations to the VCF file
table_annovar.pl --vcfinput NA12878.rmdup.freebayes.vcf.gz humandb/
-buildver hg38 -remove -protocol refGene,exac03
-out NA12878_freebayes -operation g,f humandb/

3.5 Viewing and Filtering Variants

Figure 2 shows the graphical interface of the viewing tool VarSifter. Using checkboxes and custom queries, VarSifter makes it possible to filter annotated VCF files based on variants’ impact on protein sequence, population allele frequency (e.g., from large sequencing projects such as ExomeGO), consistency with known Mendelian inheritance schemes, or potential to be pathogenic from databases like HGMD.
To open a VCF file in VarSifter, type:

java -Xmx1G -jar VarSifter<version>.jar

where “version” is the version number of VarSifter installed, and the memory requirement specified after “-Xmx” may need to be increased for larger VCF files. Open the VCF file to be viewed, and select the name of the INFO fields containing the mutation type and gene name (for RefSeq annotations added by ANNOVAR, these fields have identifiers “ExonicFunc.RefGene” and “Gene. RefGene,” respectively). VarSifter then allows the user to select which INFO annotations in the VCF file to display.

To filter based on genomic region, download the BED file of low-complexity repeat locations used in reference:

wget https://github.com/lh3/varcmp/blob/master/scripts/LCR-hs38.bed.gz?raw=true

3.6 Supplementary Protocols

3.6.1 Obtaining Sequence Data from NCBI

With the SRA Toolkit and Aspera Connect/ascp installed, a sample dataset for this method can be downloaded using the “prefetch” command, and converted to FASTQ format using fastq-dump:

# Download the .sra file and necessary NCBI reference sequences into
# your home directory’s "ncbi" subdirectory
prefetch SRR1611184
# Convert .sra file into fastq format, separating read 1 and read 2
# into two files, and gzip’ing to save space
fastq-dump -I --gzip --split-3 SRR1611184

If a novoalign license has not been purchased, the “–gzip” option should be omitted from the fastq-dump call, and larger, uncompressed FASTQ files will be generated.

3.6.2 Downloading a Genome’s Reference Sequence

use the rsync command:

# Find reference sequences for your genome build of interest at
# ftp://hgdownload.cse.ucsc.edu/goldenPath
# rsync places gzipped fasta file in current directory
rsync -a -P
rsync://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz ./

3.6.3 Removing Alternate Haplotypes and Masking Pseudo-Autosomal Regions in the Reference

Entries for alternate haplotypes should usually be removed from the reference FASTA file before formatting an alignment index. If a user is particularly interested in one of the regions for which common alternate haplotypes exist, he or she should explore possibilities for custom, region-specific analyses, which are not covered in this protocol.

It is also helpful to mask the pseudo-autosomal regions, or “PARs,” of the organism’s sex chromosomes. These PARs are identical for the X and the Y chromosomes, and like the alternate haplotypes above, they can confuse alignment software because they appear to be exact repeats. For human reference builds, the coordinates of these regions are available from the UCSC Genome Browser.

A simple Perl script for removing alternate haplotype sequences from a reference assembly and masking the PARs on chromosome Y for human is included in the “scripts” subdirectory of the github repository at http://github.com/nhansen/MMBVariantCalling, and can be run by typing:

# Specify output fasta as an "mfa", or "multi-FASTA" file:
prepare_reference_fasta.pl --build hg38 --input hg38.fa.gz
–output hg38.mfa

3.6.4 Formatting a Novoalign Database

A FASTA file of chromosome sequences can be downloaded from a resource like the UCSC Genome Browser. If
a reference database has not been created ahead of time, it can be created using the “novoindex” command of novoalign:

# Format an hg38 index for novoalign (requires 8Gb of memory
# for a human genome)
novoindex -n hg38 hg38.ndx hg38.mfa

Notes

The program novoalign is available with a free license. However, with a paid license, it can be run using multiple threads, which can considerably speed up processing time, especially for whole genome sequences. Without a paid license, it is also necessary to use uncompressed FASTQ files as input to novoalign, which necessitates a greater use of disk space.
When running novoalign on paired-end reads, the reads in the FASTQ file for read number one must be in the same order as their pairs in the FASTQ file for read number two. Most paired FASTQ files are correctly ordered, but a mismatch in ordering will result in an error of the form “Error: Read headers do not match at Record #…”.
If reads are greater than 100 bases in length, novoalign can be run with the option “-t 400” to decrease run time without significant loss of alignment accuracy.
The program Freebayes should be installed using “git clone” with the recursive option, which will download necessary external repositories that are not included in the distributed tarball.