Chapter 11 Working with Alignment Data
突然觉得这是一本比较基础的且要有耐心才能看下去的书,但作者介绍的比较繁琐,没有直入主题,基本的分析流程和背景并不太成体系。有基础的人甚至可以直接跳到11章,想快点看完进入下一本了。
The Sequence Alignment/ Mapping (SAM) format for mapping data (and its binary analog, BAM)
Getting to Know Alignment Formats: SAM and BAM
The SAM Header
$ head -n 10 celegans.sam
@SQ SN:I LN:15072434
@SQ SN:II LN:15279421
@SQ SN:III LN:13783801
@SQ SN:IV LN:17493829
@SQ SN:MtDNA LN:13794
@SQ SN:V LN:20924180
@SQ SN:X LN:17718942
@RG ID:VB00023_L001 SM:celegans-01
@PG ID:bwa PN:bwa VN:0.7.10-r789 [...]
I_2011868_2012306_0:0:0_0:0:0_2489 83 I 2012257 40 50M [...]
- @SQ header entries store information about the reference sequences (e.g., the chromosomes if you’ve aligned to a reference genome). The required key-values are SN, which stores the sequence name (e.g., the C. elegans chromosome I), and LN, which stores the sequence length (e.g., 15,072,434 bases). All separate sequences in your reference have a corresponding entry in the header.
- @RG header entries contain important read group and sample metadata. The read group identifier ID is required and must be unique. This ID value contains information about the origin of a set of reads. Consequently, it’s beneficial to create read groups related to the specific sequencing run (e.g., ID could be related to the name of the sequencing run and lane).
- @PG header entries contain metadata about the programs used to create and process a set of SAM/BAM files. Each program must have a unique ID value, and metadata such as program version number (via the VN key) and the exact command line (via the CL key) can be saved in these header entries. Many programs will add these lines automatically.
Most aligners allow you to specify this important metadata through your alignment command. For example, BWA allows (using made-up files in this example):
$ bwa mem -R'@RG\tID:readgroup_id\tSM:sample_id' ref.fa
in.fq
Bowtie2 similarly allows read group and sample information to be set with the –rg-id and --rgoptions.
• head won’t always provide the entire header.
• It won’t work with binary BAM files.
Samtools
Look at an entire SAM/BAM header is with samtools view option -H:
$ samtools view -H celegans.sam
@SQ SN:I LN:15072434
@SQ SN:II LN:15279421
@SQ SN:III LN:13783801
[...]
Also works with BAM files
$ samtools view -H celegans.bam | grep "^@RG"
@RG ID:VB00023_L001 SM:celegans-01
Samtools view without any arguments returns the entire alignment section without the header:
$ samtools view celegans.sam | head -n 1
I_2011868_2012306_0:0:0_0:0:0_2489 83 I 2012257 40 50M
The SAM Alignment Section
$ samtools view celegans.sam | tr '\t' '\n' | head -n 11
I_2011868_2012306_0:0:0_0:0:0_2489
83
I
2012257
40
50M
=
2011868
-439
CAAAAAATTTTGAAAAAAAAAATTGAATAAAAATTCACGGATTTCTGGCT
22222222222222222222222222222222222222222222222222
tr : convert tabs
- QNAME, the query name (e.g., a sequence read’s name).
- FLAG, the bitwise flag, which contains information about the alignment.