Bioinformatics Data Skills by Oreilly学习笔记-11-1

Chapter 11 Working with Alignment Data

突然觉得这是一本比较基础的且要有耐心才能看下去的书,但作者介绍的比较繁琐,没有直入主题,基本的分析流程和背景并不太成体系。有基础的人甚至可以直接跳到11章,想快点看完进入下一本了。
The Sequence Alignment/ Mapping (SAM) format for mapping data (and its binary analog, BAM)

Getting to Know Alignment Formats: SAM and BAM

The SAM Header
$ head -n 10 celegans.sam
@SQ SN:I LN:15072434
@SQ SN:II LN:15279421
@SQ SN:III LN:13783801
@SQ SN:IV LN:17493829
@SQ SN:MtDNA LN:13794
@SQ SN:V LN:20924180
@SQ SN:X LN:17718942
@RG ID:VB00023_L001 SM:celegans-01
@PG ID:bwa PN:bwa VN:0.7.10-r789 [...]
I_2011868_2012306_0:0:0_0:0:0_2489 83 I 2012257 40 50M [...]
  1. @SQ header entries store information about the reference sequences (e.g., the chromosomes if you’ve aligned to a reference genome). The required key-values are SN, which stores the sequence name (e.g., the C. elegans chromosome I), and LN, which stores the sequence length (e.g., 15,072,434 bases). All separate sequences in your reference have a corresponding entry in the header.
  2. @RG header entries contain important read group and sample metadata. The read group identifier ID is required and must be unique. This ID value contains information about the origin of a set of reads. Consequently, it’s beneficial to create read groups related to the specific sequencing run (e.g., ID could be related to the name of the sequencing run and lane).
  3. @PG header entries contain metadata about the programs used to create and process a set of SAM/BAM files. Each program must have a unique ID value, and metadata such as program version number (via the VN key) and the exact command line (via the CL key) can be saved in these header entries. Many programs will add these lines automatically.

Most aligners allow you to specify this important metadata through your alignment command. For example, BWA allows (using made-up files in this example):

$ bwa mem -R'@RG\tID:readgroup_id\tSM:sample_id' ref.fa
in.fq

Bowtie2 similarly allows read group and sample information to be set with the –rg-id and --rgoptions.
• head won’t always provide the entire header.
• It won’t work with binary BAM files.

Samtools
Look at an entire SAM/BAM header is with samtools view option -H:

$ samtools view -H celegans.sam
@SQ SN:I LN:15072434
@SQ SN:II LN:15279421
@SQ SN:III LN:13783801
[...]

Also works with BAM files

$ samtools view -H celegans.bam | grep "^@RG"
@RG ID:VB00023_L001 SM:celegans-01

Samtools view without any arguments returns the entire alignment section without the header:

$ samtools view celegans.sam | head -n 1
I_2011868_2012306_0:0:0_0:0:0_2489 83 I 2012257 40 50M
The SAM Alignment Section
$ samtools view celegans.sam | tr '\t' '\n' | head -n 11
I_2011868_2012306_0:0:0_0:0:0_2489
83
I
2012257
40
50M
=
2011868
-439
CAAAAAATTTTGAAAAAAAAAATTGAATAAAAATTCACGGATTTCTGGCT
22222222222222222222222222222222222222222222222222

tr : convert tabs

  1. QNAME, the query name (e.g., a sequence read’s name).
  2. FLAG, the bitwise flag, which contains information about the alignment.
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值