接上一篇Chapter 11
Visualizing Alignments with samtools tview and the Integrated Genomics Viewer
samtools tview
- Samtools tview requires position-sorted and indexed BAM files as input.
- View these alignments with samtools tview
$ samtools tview NA12891_CEU_sample.bam human_g1k_v37.fasta
- go to a specific region with the option -p:
$ samtools tview -p 1:215906469-215906652 NA12891_CEU_sample.bam \
human_g1k_v37.fasta
press ? to see these options
IGV
Genomes → Load Genome from File
Genomes → Load Genome From Server
Pileups with samtools pileup, Variant Calling, and Base Alignment Quality
pileup format
The per-base summary of the alignment data created in a pileup can then be used to identify variants (regions different from the reference), and determine sample individuals’ genotypes
Samtools’s mpileup subcommand creates pileups from BAM files, and this tool is the first step in samtools-based variant calling pipelines.
Samtools’s clever Base Alignment Quality algorithm can prevent erroneous variant calls due to misalignment.
$ samtools mpileup --no-BAQ --region 1:215906528-215906567 \
--fasta-ref human_g1k_v37.fasta NA12891_CEU_sample.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1 215906528 G 21 ,,,,,,,,.,,,.,,..,.,, ;=?./:?>>;=7?>>@A?==:
1 215906529 A 18 ,,,,,,,,.,,,,,..,. D>AA:@A>9>?;;?>>@=
[...]
1 215906547 C 15 gGg$,GggGG,,.... <;80;><9=86=C>=
1 215906548 G 19 c$,ccC.,.,,,.,....,^]. ;58610=7=>75=7<463;
[...]
1 215906555 G 16 .$aaaaaA.AAAaAAA^:A 2@>?8?;<:335?:A>
[...]
-
Samtools mpileup requires an input BAM file (in this example, we use NA12891_CEU_sample.bam). We also supply a reference genome in FASTA format through the –fasta-ref or -f options (be sure to use the exact same reference used for mapping) so samtools mpileup knows each reference base. Additionally, we use the –region or -r options to limit our pileup to the same region we visualized with IGV. Lastly, we disable Base Alignment Quality (BAQ) with –no-BAQ or -B, an additional feature of samtools mpileup we’ll discuss later.
-
This is is a typical line in the pileup format. The columns are:
• Reference sequence name (e.g., chromosome 1 in this entry).
• Position in reference sequence, 1-indexed (position 215,906,528 in this entry).
• Reference sequence base at this position (G in this entry).
• Depth of aligned reads at this position (the depth or coverage, 21 in this entry).
• This column encodes the reference reads bases. Periods (.) indicate a reference sequence match on the forward strand, commas (,) indicate a reference sequence match to the reverse strand, an uppercase base (either A, T, C, G, or N) indicates a mismatch on the forward strand, and a lowercase base (a, t, c, g, or n) indicates a mismatch on the reverse strand. The ^ and $ characters indicate the start and end of reads, respectively. Insertions are denoted with a plus sign (+), followed by the length of the insertion, followed by the sequence of the insertion. An insertion on a line indicates it’s between the current line and the next line. Similarly, deletions are encoded with a minus sign (-), followed by the length of the deletion and the deleted sequence. Lastly, the mapping quality of each alignment is specified after the beginning of the alignment character, ^ as the ASCII character value minus 33.
• Finally, the last column indicates the base qualities (the ASCII value minus
33). -
On this line and the next line are the stacked mismatches we saw at positions 215,906,547 and 215,906,548 with IGV in Figure 11-2. These lines show the stacked mismatches as nucleotides in the fifth column. Note that variants in this column are both lower- and uppercase, indicating they are supported by both reads aligning to both the forward and reverse strands.
-
Finally, note at this position most reads disagree with the reference base. Also, this line contains the start of a new read (indicated with the ^ mark), with mapping quality 25 (the character : has ASCII code 58, and 58 - 33 = 25; try ord( : )- 33 in Python).
Calling variants with samtools and its companion tool bcftools is a two-step process
- In the first step, samtools mpileup
-v or -g: generate genotype likelihoods for every site in the genome (or all sites within a region if one is specified).
-v: return result to Variant Call Format(VCF)
-g: return result to BCF (the binary analog of VCF) - In the second step, bcftools call will filter these results so only variant sites remain, and call genotypes for all individuals at these sites. Let’s see how the first step works by calling samtools mpileup with -v in the region we investigated earlier with IGV:
$ samtools mpileup -v --no-BAQ --region 1:215906528-215906567 \
--fasta-ref human_g1k_v37.fasta NA12891_CEU_sample.bam \
> NA12891_CEU_sample.vcf.gz
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
-u: force samtools mpileup to output uncompressed
$ zgrep "^##" -v NA12891_CEU_sample.vcf.gz | \
awk 'BEGIN{OFS="\t"} {split($8, a, ";"); print $1,$2,$4,$5,$6,a[1],$9,$10}'
#CHROM POS REF ALT QUAL INFO FORMAT NA12891
1 215906528 G <X> 0 DP=21 PL 0,63,236
1 215906529 A <X> 0 DP=22 PL 0,54,251
[...]
1 215906547 C G,<X> 0 DP=22 PL 123,0,103,144,127,233
1 215906548 G C,<X> 0 DP=22 PL 23,0,163,68,175,207
[...]
1 215906555 G A,<X> 0 DP=19 PL 184,7,0,190,42,204
[...]
zgrep -v to remove the header and awk to select some columns of interest.
Run this bcftools call step, and then do some exploring of where we lose variants and why:
$ bcftools call -v -m NA12891_CEU_sample.vcf.gz > NA12891_CEU_sample_calls.vcf.gz
-m