only use good-quality or uniquely-mapped reads your bam files

only use good-quality or uniquely-mapped reads your bam files
java -jar picard.jar MarkDuplicates
I=input.bam
O=rmdup.bam
M=rmdup_metrics.txt
CREATE_INDEX=true
ASSUME_SORTED=false
REMOVE_DUPLICATES=true"
samtools view -hb -q 1 rmdup.bam > rmdupq1.bam 【因为picard去重了,所以不用加F 1024的参数再去重复序列】
bam files must contain a header section and an alignment section
samtools view -H rmdupq1.bam

Mapped reads were then marked duplicated with Picard (version
1.65). Only uniquely mapped reads (Samtools version 1.3.1, parameter ‘–q 1 –F 1024’) were used for analysis. We selected five highquality datasets, including 67 samples of six histone marks
-h print header for the SAM output
默认下输出的 sam 格式文件不带 header,该参数设定输出sam文件时带 header 信息
-H print header only (no alignments)
仅仅输出文件的头

flags

1 0x1 这序列是PE双端测序
2 0x2 这序列和参考序列完全匹配,没有错配和缺失
4 0x4 这序列没有mapping到参考序列上
8 0x8 这序列的mate序列没有mapping到参考序列上
16 0x10 这序列比对到参考序列的负链上
32 0x20 这序列的mate序列比对到参考序列的负链上
64 0x40 这序列是read1
128 0x80 这序列是read2
256 0x100 这序列不是主要的比对,因为序列可能比对到参考序列的多个位置上
512 0x200 这序列没有通过QC
1024 0x400 这序列是PCR重复序列
2048 0x800 这序列是补充比对
-F 过滤 ## -F 4 过滤掉没有mapping上的reads,也就是说提取出mapping上的reads
-h include header in SAM output
提取比对到参考序列的结果:

samtools view -bF 4 tmp.bam > tmp_F.bam
提取双端序列都比对到参考序列(4+8)的结果:

samtools view -bF 12 tmp.bam > tmp_F.bam
提取比对到chr1的结果

samtools view -b tmp.bam chr1 > tmp_chr1.bam
注:With no options or regions specified, prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header),也就是说,没有设定输出格式的话,默认是输出SAM格式,并且是没有header的SAM

samtools view

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region …]

Options:
-b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-u uncompressed BAM output (implies -b)
-h include header in SAM output
-H print SAM header only (no alignments)
-c print only the count of matching records
-o FILE output file name [stdout]
-U FILE output reads not selected by filters to FILE [null]
-t FILE FILE listing reference names and lengths (see long help) [null]
-L FILE only include reads overlapping this BED FILE [null]
-r STR only include reads in read group STR [null]
-R FILE only include reads with read group listed in FILE [null]
-q INT only include reads with mapping quality >= INT [0]
-l STR only include reads in library STR [null]
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
-f INT only include reads with all of the FLAGs in INT present [0]
-F INT only include reads with none of the FLAGS in INT present [0]
-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]
-s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
fraction of templates/read pairs to keep; INT part sets seed)
-M use the multi-region iterator (increases the speed, removes
duplicates and outputs the reads as they are ordered in the file)
-x STR read tag to strip (repeatable) [null]
-B collapse the backward CIGAR operation
-? print long help, including note about region specification
-S ignored (input format is auto-detected)
–input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]…
Specify output format (SAM, BAM, CRAM)
–output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
-T, --reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]

以上是官方,

Usage: samtools view [options] <in.bam>|<in.sam> [region1 […]]
默认情况下不加 region,则是输出所有的 region.

Options: -b output BAM
默认下输出是 SAM 格式文件,该参数设置输出 BAM 格式
-h print header for the SAM output
默认下输出的 sam 格式文件不带 header,该参数设定输出sam文件时带 header 信息
-H print header only (no alignments)
-S input is SAM
默认下输入是 BAM 文件,若是输入是 SAM 文件,则最好加该参数,否则有时候会报错。
-u uncompressed BAM output (force -b)
该参数的使用需要有-b参数,能节约时间,但是需要更多磁盘空间。
-c Instead of printing the alignments, only count them and print the
total number. All filter options, such as ‘-f’, ‘-F’ and ‘-q’ ,
are taken into account.
-1 fast compression (force -b)
-x output FLAG in HEX (samtools-C specific)
-X output FLAG in string (samtools-C specific)
-c print only the count of matching records
-L FILE output alignments overlapping the input BED FILE [null]
-t FILE list of reference names and lengths (force -S) [null]
使用一个list文件来作为header的输入
-T FILE reference sequence file (force -S) [null]
使用序列fasta文件作为header的输入
-o FILE output file name [stdout]
-R FILE list of read groups to be outputted [null]
-f INT required flag, 0 for unset [0]
-F INT filtering flag, 0 for unset [0]
Skip alignments with bits present in INT [0]
数字4代表该序列没有比对到参考序列上
数字8代表该序列的mate序列没有比对到参考序列上
-q INT minimum mapping quality [0]
-l STR only output reads in library STR [null]
-r STR only output reads in read group STR [null]
-s FLOAT fraction of templates to subsample; integer part as seed [-1]
-? longer help

unique mapped reads
就是指唯一比对的reads

现在人们已经开始避免使用unique mapped reads这个概念了,而转向使用mapq值来保留高质量的比对结果。因为mapq值反应了一组比对结果发生的可能性,MapQ = -10 log10§, 比如结果为10,那就是1/10的概率会出现这个比对结果,如果我们认为0.05%是一个小概率的话,那个mapq值为15就可以用于筛选了, 如果认为0.01%是个小概率的话,mapq值为20就可以用于筛选了。但是人们往往从30这个值开始试起(1/1000的概率),如果它的筛选结果符合你的测序要求,就可以使用它了。如果不行,可以适当的调整这个筛选值。

samtools view -bhS -q 30 input.sam > output.bam

#! /bin/bash
#sam转换为bam
samtools view -b -S -t ref.fai lane.sam > lane.bam
samtools view -bS lane.sam -o lane.bam
#提取比对到参考基因组上的数据
samtools view -b -F 4 lane.sam > lane.bam
#提取没有比对到参考基因组上的数据
samtools view -b -f 4 lane.sam > lane.bam
#提取paired reads中比对到参考基因组上的数据
samtools view -b -F 4 -f 8 lane.sam > only.read1.mapped.bam
samtools view -b -F 8 -f 4 lane.sam > only.read2.mapped.bam
samtools view -b -F 12 lane.sam > all.mapped.read1.read2.bam

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值