Bioinformatics Data Skills by Oreilly学习笔记-11-2

最新推荐文章于 2024-05-20 09:36:30 发布

weixin_42953727

最新推荐文章于 2024-05-20 09:36:30 发布

阅读量650

点赞数 1

分类专栏： bioinformatics 文章标签： Bioinformatics

本文链接：https://blog.csdn.net/weixin_42953727/article/details/100674514

版权

本文介绍了如何使用samtools tview和Integrated Genomics Viewer (IGV)来可视化比对，并通过samtools pileup进行变异调用和碱基对齐质量分析。详细阐述了samtools mpileup在识别变异和处理BAM文件中的作用，以及如何使用Pysam创建自定义的SAM/BAM处理工具。

摘要由CSDN通过智能技术生成

接上一篇Chapter 11

Visualizing Alignments with samtools tview and the Integrated Genomics Viewer

samtools tview

Samtools tview requires position-sorted and indexed BAM files as input.
View these alignments with samtools tview

$ samtools tview NA12891_CEU_sample.bam human_g1k_v37.fasta

go to a specific region with the option -p:

$ samtools tview -p 1:215906469-215906652 NA12891_CEU_sample.bam \
human_g1k_v37.fasta

press ? to see these options
在这里插入图片描述

IGV

Genomes → Load Genome from File
Genomes → Load Genome From Server
在这里插入图片描述

Pileups with samtools pileup, Variant Calling, and Base Alignment Quality

pileup format
The per-base summary of the alignment data created in a pileup can then be used to identify variants (regions different from the reference), and determine sample individuals’ genotypes

Samtools’s mpileup subcommand creates pileups from BAM files, and this tool is the first step in samtools-based variant calling pipelines.
Samtools’s clever Base Alignment Quality algorithm can prevent erroneous variant calls due to misalignment.

$ samtools mpileup --no-BAQ --region 1:215906528-215906567 \
--fasta-ref human_g1k_v37.fasta NA12891_CEU_sample.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1 215906528 G 21 ,,,,,,,,.,,,.,,..,.,, ;=?./:?>>;=7?>>@A?==:
1 215906529 A 18 ,,,,,,,,.,,,,,..,. D>AA:@A>9>?;;?>>@=
[...]
1 215906547 C 15 gGg$,GggGG,,.... <;80;><9=86=C>=
1 215906548 G 19 c$,ccC.,.,,,.,....,^]. ;58610=7=>75=7<463;
[...]
1 215906555 G 16 .$aaaaaA.AAAaAAA^:A 2@>?8?;<:335?:A>
[...]

Samtools mpileup requires an input BAM file (in this example, we use NA12891_CEU_sample.bam). We also supply a reference genome in FASTA format through the –fasta-ref or -f options (be sure to use the exact same reference used for mapping) so samtools mpileup knows each reference base. Additionally, we use the –region or -r options to limit our pileup to the same region we visualized with IGV. Lastly, we disable Base Alignment Quality (BAQ) with –no-BAQ or -B, an additional feature of samtools mpileup we’ll discuss later.
This is is a typical line in the pileup format. The columns are:
• Reference sequence name (e.g., chromosome 1 in this entry).
• Position in reference sequence, 1-indexed (position 215,906,528 in this entry).
• Reference sequence base at this position (G in this entry).
• Depth of aligned reads at this position (the depth or coverage, 21 in this entry).
• This column encodes the reference reads bases. Periods (.) indicate a reference sequence match on the forward strand, commas (,) indicate a reference sequence match to the reverse strand, an uppercase base (either A, T, C, G, or N) indicates a mismatch on the forward strand, and a lowercase base (a, t, c, g, or n) indicates a mismatch on the reverse strand. The ^ and $ characters indicate the start and end of reads, respectively. Insertions are denoted with a plus sign (+), followed by the length of the insertion, followed by the sequence of the insertion. An insertion on a line indicates it’s between the current line and the next line. Similarly, deletions are encoded with a minus sign (-), followed by the length of the deletion and the deleted sequence. Lastly, the mapping quality of each alignment is specified after the beginning of the alignment character, ^ as the ASCII character value minus 33.
• Finally, the last column indicates the base qualities (the ASCII value minus
33).
On this line and the next line are the stacked mismatches we saw at positions 215,906,547 and 215,906,548 with IGV in Figure 11-2. These lines show the stacked mismatches as nucleotides in the fifth column. Note that variants in this column are both lower- and uppercase, indicating they are supported by both reads aligning to both the forward and reverse strands.
Finally, note at this position most reads disagree with the reference base. Also, this line contains the start of a new read (indicated with the ^ mark), with mapping quality 25 (the character : has ASCII code 58, and 58 - 33 = 25; try ord( : )- 33 in Python).

Calling variants with samtools and its companion tool bcftools is a two-step process

In the first step, samtools mpileup
-v or -g: generate genotype likelihoods for every site in the genome (or all sites within a region if one is specified).
-v: return result to Variant Call Format(VCF)
-g: return result to BCF (the binary analog of VCF)
In the second step, bcftools call will filter these results so only variant sites remain, and call genotypes for all individuals at these sites. Let’s see how the first step works by calling samtools mpileup with -v in the region we investigated earlier with IGV:

$ samtools mpileup -v --no-BAQ --region 1:215906528-215906567 \
--fasta-ref human_g1k_v37.fasta NA12891_CEU_sample.bam \
> NA12891_CEU_sample.vcf.gz
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

-u: force samtools mpileup to output uncompressed

$ zgrep "^##" -v NA12891_CEU_sample.vcf.gz | \
awk 'BEGIN{OFS="\t"} {split($8, a, ";"); print $1,$2,$4,$5,$6,a[1],$9,$10}'
#CHROM POS REF ALT QUAL INFO FORMAT NA12891
1 215906528 G <X> 0 DP=21 PL 0,63,236
1 215906529 A <X> 0 DP=22 PL 0,54,251
[...]
1 215906547 C G,<X> 0 DP=22 PL 123,0,103,144,127,233
1 215906548 G C,<X> 0 DP=22 PL 23,0,163,68,175,207
[...]
1 215906555 G A,<X> 0 DP=19 PL 184,7,0,190,42,204
[...]

zgrep -v to remove the header and awk to select some columns of interest.

Run this bcftools call step, and then do some exploring of where we lose variants and why:

$ bcftools call -v -m NA12891_CEU_sample.vcf.gz > NA12891_CEU_sample_calls.vcf.gz

-m

最低0.47元/天解锁文章

weixin_42953727

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Bioinformatics Data Skills by Oreilly学习笔记-11-2

接上一篇Chapter 11Visualizing Alignments with samtools tview and the Integrated Genomics ViewerSamtools tview requires position-sorted and indexed BAM files as input.
复制链接

扫一扫

专栏目录