Bioinformatics Data Skills by Oreilly学习笔记-11-2

本文介绍了如何使用samtools tview和Integrated Genomics Viewer (IGV)来可视化比对,并通过samtools pileup进行变异调用和碱基对齐质量分析。详细阐述了samtools mpileup在识别变异和处理BAM文件中的作用,以及如何使用Pysam创建自定义的SAM/BAM处理工具。
摘要由CSDN通过智能技术生成

接上一篇Chapter 11

Visualizing Alignments with samtools tview and the Integrated Genomics Viewer
samtools tview
  1. Samtools tview requires position-sorted and indexed BAM files as input.
  2. View these alignments with samtools tview
$ samtools tview NA12891_CEU_sample.bam human_g1k_v37.fasta
  1. go to a specific region with the option -p:
$ samtools tview -p 1:215906469-215906652 NA12891_CEU_sample.bam \
human_g1k_v37.fasta

press ? to see these options
在这里插入图片描述

IGV

Genomes → Load Genome from File
Genomes → Load Genome From Server
在这里插入图片描述

Pileups with samtools pileup, Variant Calling, and Base Alignment Quality

pileup format
The per-base summary of the alignment data created in a pileup can then be used to identify variants (regions different from the reference), and determine sample individuals’ genotypes

Samtools’s mpileup subcommand creates pileups from BAM files, and this tool is the first step in samtools-based variant calling pipelines.
Samtools’s clever Base Alignment Quality algorithm can prevent erroneous variant calls due to misalignment.

$ samtools mpileup --no-BAQ --region 1:215906528-215906567 \
--fasta-ref human_g1k_v37.fasta NA12891_CEU_sample.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1 215906528 G 21 ,,,,,,,,.,,,.,,..,.,, ;=?./:?>>;=7?>>@A?==:
1 215906529 A 18 ,,,,,,,,.,,,,,..,. D>AA:@A>9>?;;?>>@=
[...]
1 215906547 C 15 gGg$,GggGG,,.... <;80;><9=86=C>=
1 215906548 G 19 c$,ccC.,.,,,.,....,^]. ;58610=7=>75=7<463;
[...]
1 215906555 G 16 .$aaaaaA.AAAaAAA^:A 2@>?8?;<:335?:A>
[...]
  1. Samtools mpileup requires an input BAM file (in this example, we use NA12891_CEU_sample.bam). We also supply a reference genome in FASTA format through the –fasta-ref or -f options (be sure to use the exact same reference used for mapping) so samtools mpileup knows each reference base. Additionally, we use the –region or -r options to limit our pileup to the same region we visualized with IGV. Lastly, we disable Base Alignment Quality (BAQ) with –no-BAQ or -B, an additional feature of samtools mpileup we’ll discuss later.

  2. This is is a typical line in the pileup format. The columns are:
    Reference sequence name (e.g., chromosome 1 in this entry).
    Position in reference sequence, 1-indexed (position 215,906,528 in this entry).
    Reference sequence base at this position (G in this entry).
    Depth of aligned reads at this position (the depth or coverage, 21 in this entry).
    • This column encodes the reference reads bases. Periods (.) indicate a reference sequence match on the forward strand, commas (,) indicate a reference sequence match to the reverse strand, an uppercase base (either A, T, C, G, or N) indicates a mismatch on the forward strand, and a lowercase base (a, t, c, g, or n) indicates a mismatch on the reverse strand. The ^ and $ characters indicate the start and end of reads, respectively. Insertions are denoted with a plus sign (+), followed by the length of the insertion, followed by the sequence of the insertion. An insertion on a line indicates it’s between the current line and the next line. Similarly, deletions are encoded with a minus sign (-), followed by the length of the deletion and the deleted sequence. Lastly, the mapping quality of each alignment is specified after the beginning of the alignment character, ^ as the ASCII character value minus 33.
    • Finally, the last column indicates the base qualities (the ASCII value minus
    33).

  3. On this line and the next line are the stacked mismatches we saw at positions 215,906,547 and 215,906,548 with IGV in Figure 11-2. These lines show the stacked mismatches as nucleotides in the fifth column. Note that variants in this column are both lower- and uppercase, indicating they are supported by both reads aligning to both the forward and reverse strands.

  4. Finally, note at this position most reads disagree with the reference base. Also, this line contains the start of a new read (indicated with the ^ mark), with mapping quality 25 (the character : has ASCII code 58, and 58 - 33 = 25; try ord( : )- 33 in Python).

Calling variants with samtools and its companion tool bcftools is a two-step process

  1. In the first step, samtools mpileup
    -v or -g: generate genotype likelihoods for every site in the genome (or all sites within a region if one is specified).
    -v: return result to Variant Call Format(VCF)
    -g: return result to BCF (the binary analog of VCF)
  2. In the second step, bcftools call will filter these results so only variant sites remain, and call genotypes for all individuals at these sites. Let’s see how the first step works by calling samtools mpileup with -v in the region we investigated earlier with IGV:
$ samtools mpileup -v --no-BAQ --region 1:215906528-215906567 \
--fasta-ref human_g1k_v37.fasta NA12891_CEU_sample.bam \
> NA12891_CEU_sample.vcf.gz
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

-u: force samtools mpileup to output uncompressed

$ zgrep "^##" -v NA12891_CEU_sample.vcf.gz | \
awk 'BEGIN{OFS="\t"} {split($8, a, ";"); print $1,$2,$4,$5,$6,a[1],$9,$10}'
#CHROM POS REF ALT QUAL INFO FORMAT NA12891
1 215906528 G <X> 0 DP=21 PL 0,63,236
1 215906529 A <X> 0 DP=22 PL 0,54,251
[...]
1 215906547 C G,<X> 0 DP=22 PL 123,0,103,144,127,233
1 215906548 G C,<X> 0 DP=22 PL 23,0,163,68,175,207
[...]
1 215906555 G A,<X> 0 DP=19 PL 184,7,0,190,42,204
[...]

zgrep -v to remove the header and awk to select some columns of interest.

Run this bcftools call step, and then do some exploring of where we lose variants and why:

$ bcftools call -v -m NA12891_CEU_sample.vcf.gz > NA12891_CEU_sample_calls.vcf.gz

-m

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
This practical book teaches the skills that scientists need for turning large sequencing datasets into reproducible and robust biological findings. Many biologists begin their bioinformatics training by learning scripting languages like Python and R alongside the Unix command line. But there's a huge gap between knowing a few programming languages and being prepared to analyze large amounts of biological data. Rather than teach bioinformatics as a set of workflows that are likely to change with this rapidly evolving field, this book demsonstrates the practice of bioinformatics through data skills. Rigorous assessment of data quality and of the effectiveness of tools is the foundation of reproducible and robust bioinformatics analysis. Through open source and freely available tools, you'll learn not only how to do bioinformatics, but how to approach problems as a bioinformatician. Go from handling small problems with messy scripts to tackling large problems with clever methods and tools Focus on high-throughput (or "next generation") sequencing data Learn data analysis with modern methods, versus covering older theoretical concepts Understand how to choose and implement the best tool for the job Delve into methods that lead to easier, more reproducible, and robust bioinformatics analysis Table of Contents Part I. Ideology: Data Skills for Robust and Reproducible Bioinformatics Chapter 1. How to Learn Bioinformatics Part II. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project Chapter 2. Setting Up and Managing a Bioinformatics Project Chapter 3. Remedial Unix Shell Chapter 4. Working with Remote Machines Chapter 5. Git for Scientists Chapter 6. Bioinformatics Data Part III. Practice: Bioinformatics Data Skills Chapter 7. Unix Data Tools Chapter 8. A Rapid Introduction to the R Language Chapter 9. Working with Range Data Chapter 10. Working with Sequence Data Chapter 11. Working with Alignment Data Chapter 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks Chapter 13. Out-of-Memory Approaches: Tabix and SQLite Chapter 14. Conclusion
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值