0. 写在前面
这是GitHub地址: https://github.com/lh3/seqtk
seqtk是针对fastq/fasta格式数据的处理工具,随机抽取fastq数据(sample
)只是seqtk的一个功能。还有一个是根据输入的bed文件或者reads的名称列表(name list)文件,获取子序列(subseq
)。
seqtk也包含了数据质控的一些功能,比如,去低质量reads(trimfq
)、N序列(cutN
)、质控(fqchk
),但并不能去adatper。还有一些其他功能,比如,数据合并(mergepe
, mergefa
)、fastq-fasta转换(seq
),高/低GC含量区识别(gc
)等。
这里仅示例sample
命令的使用,先是seqtk
的安装:
1. seqtk安装
seqtk安装命令(参照这里):
$ wget https://github.com/lh3/seqtk/archive/v1.2.tar.gz
$ tar zxvf v1.2.tar.gz
$ cd seqtk-1.2
$ make
GitHub地址中的安装命令:(git clone
很久没下载下来,用了wget
)
2. seqtk sample
示例
使用seqtk
软件中 sample
命令,可随机取指定条reads,例如:在 test_R1.fq中随机截取1000条reads,按照seed为100的随机方式。
# `sample` Example:
$ ${seqtk_dir}/seqtk-1.2/seqtk sample -s100 test_R1.fq 1000 > test_sub_R1.fq
$ ${seqtk_dir}/seqtk-1.2/seqtk sample -s100 test_R2.fq 1000 > test_sub_R2.fq
seqtk有多个命令,其中sample
是用于取部分reads的命令,具体说明:
$ ${seqtk_dir}/seqtk-1.2/seqtk sample
=============================================================================
Usage: seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number>
Options: -s INT RNG seed [11]
-2 2-pass mode: twice as slow but with much reduced memory
=============================================================================
上面的 Example 示例,各参数说明:
sample
: 使用的seqtk对应的sample
命令, 进行reads随机提取.
-s100
: 设定随机数种子为100,类型为整数. [ 当为PE数据时,随机数种子要相同,确保fastq的ID对应. ]
test_R1.fq/test_R2.fq
: 输入的R1_fastq / R2_fastq文件. [ 可输入.gz压缩文件, 但输出结果是非压缩格式. ]
1000
: 设置随机取的reads数,这里设置要随机从输入fastq文件中取1000条reads.
test_sub_R1.fq/test_sub_R2.fq
: 输出随机提取得到的fastq(非压缩格式)
其他参数…
$ ${seqtk_dir}/seqtk-1.2/seqtk
==============================================================
Usage: seqtk <command> <arguments>
Version: 1.2-r94
Command: seq common transformation of FASTA/Q
comp get the nucleotide composition of FASTA/Q
sample subsample sequences
subseq extract subsequences from FASTA/Q
fqchk fastq QC (base/quality summary)
mergepe interleave two PE FASTA/Q files
trimfq trim FASTQ using the Phred algorithm
hety regional heterozygosity
gc identify high- or low-GC regions
mutfa point mutate FASTA at specified positions
mergefa merge two FASTA/Q files
famask apply a X-coded FASTA to a source FASTA
dropse drop unpaired from interleaved PE FASTA/Q
rename rename sequence names
randbase choose a random base from hets
cutN cut sequence at long N
listhet extract the position of each het
==============================================================