FASTX-Toolkit

FASTX-Toolkit是一个用于处理FASTA和FASTQ短读文件的命令行工具集合,包括质量控制、序列转换、修剪、重命名等功能。在高通量测序数据分析前,它能帮助进行去接头、过滤低质量reads等预处理步骤,提高后续比对效率。使用时需要注意文件格式、N碱基处理和质量编码等问题。
摘要由CSDN通过智能技术生成

FASTX-Toolkit介绍

背景介绍

高通量测序数据下机后的原始fastq文件,包含4行,其中一行为质量值,另外一行则为对应序列,高通量的数据处理首先要进行质量控制,这些过程包括去接头、过滤低质量reads、去除低质量的3’和5’端,去除N较多的reads等,针对高通量测序数据的质控软件有很多,在此介绍质控工具:fastx_toolkit

FASTX-Toolkit

FASTX-Toolkit是用于短读FASTA / FASTQ文件预处理的命令行工具的集合。 新一代测序仪通常生成FASTA或FASTQ文件,包含多个短读序列(可能带有质量信息)。 这种FASTA / FASTQ文件的主要处理是使用专门程序将序列映射(也称为比对)到参考基因组或其他数据库。 这种映射程序的示例是:Blat,SHRiMP,LastZ,MAQ以及许多其他程序。 但是,在将序列映射到基因组之前预处理FASTA / FASTQ文件有时会更有效率 - 操作序列以产生更好的映射结果。 FASTX-Toolkit工具执行其中一些预处理任务。

可用工具

  • FASTQ-to-FASTA converter
    Convert FASTQ files to FASTA files.
    将FASTQ文件转换为FASTA文件
  • FASTQ Information
    Chart Quality Statistics and Nucleotide Distribution
    图表质量统计和核苷酸分布
  • FASTQ/A Collapser
    Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
    将FASTQ / A文件中的相同序列折叠成单个序列(同时保持读取计数)
  • FASTQ/A Trimmer
    Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise)
    缩短FASTQ或FASTQ文件中的读数。
  • FASTQ/A Renamer
    Renames the sequence identifiers in FASTQ/A file
    在FASTQ / A文件中重命名序列标识符
  • FASTQ/A Clipper
    Removing sequencing adapters / linkers
    删除测序适配器/连接器
  • FASTQ/A Reverse-Complement
    Producing the Reverse-complement of each sequence in a FASTQ/FASTA file
    在FASTQ / FASTA文件中生成每个序列的反向补码
  • FASTQ/A Barcode splitter
    Splitting a FASTQ/FASTA files containning multiple samples
    拆分包含多个样本的FASTQ / FASTA文件
  • FASTA Formatter
    changes the width of sequences line in a FASTA file
    更改FASTA文件中序列行的宽度
  • FASTA Nucleotide Changer
    Convets FASTA sequences from/to RNA/DNA
    将FASTA序列从/转换为RNA / DNA
  • FASTQ Quality Filter
    Filters sequences based on quality
    根据质量过滤序列
  • FASTQ Quality Trimmer
    Trims (cuts) sequences based on quality
    根据质量修剪(剪切)序列
  • FASTQ Masker
    Masks nucleotides with ‘N’ (or other character) based on quality
    根据质量,使用’N’(或其他字符)掩蔽核苷酸

下载

下载地址:fastx_toolkit下载链接

wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
tar xjvf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2

使用

注意事项

fastx_toolkit由一系列的命令组成,每个命令提供一个实用的小功能。在使用时需要注意以下几点:

  • 不支持压缩格式的输入文件
  • 不允许序列中存在N碱基,这样的序列会自动去除
  • 可视化命令依赖gunplot软件和perl的GD模块
  • 默认情况下认为fastq文件的碱基编码格式为phred64

在安装该软件时尤其时运时如果遇到:make命令报错:“fgets called with bigger size than length of destination buffer”,安装比较新版本,就能解决问题。

如果在运行fastx_quality_stats 过程中出现“fastx_quality_stats: Invalid quality score value (char ‘#’ ord 35 quality value -29) on line 4”,请在参数中加入“-Q 33”

参数及其使用

FASTQ-to-FASTA

usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-r]         = Rename sequence identifiers to numbers.
   [-n]         = keep sequences with unknown (N) nucleotides.
          Default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA output file. default is STDOUT.

FASTX Statistics

usage: fastx_quality_stats [-h] [-i INFILE] [-o OUTFILE]

version 0.0.6 (C) 2008 by Assaf Gordon (gordon@cshl.edu)
   [-h] = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
                  If FASTA file is given, only nucleotides
          distribution is calculated (there's no quality info).
   [-o OUTFILE] = TEXT output file. default is STDOUT.

The output TEXT file will have the following fields (one row per column):
    column    = column number (1 to 36 for a 36-cycles read solexa file)
    count   = number of bases found in this column.
    min     = Lowest quality score value found in this column.
    max     = Highest quality score value found in this column.
    sum     = Sum of quality score values for this column.
    mean    = Mean quality score value for this column.
    Q1    = 1st quartile quality score.
    med    = Median quality score.
    Q3    = 3rd quartile quality score.
    IQR    = Inter-Quartile range (Q3-Q1).
    lW    = 'Left-Whisker' value (for boxplotting).
    rW    = 'Right-Whisker' value (for boxplotting).
    A_Count    = Count of 'A' nucleotides found in this column.
    C_Count    = Count of 'C' nucleotides found in this column.
    G_Count    = Count of 'G' nucleotides found in this column.
    T_Count    = Count of 'T' nucleotides found in this column.
    N_Count = Count of 'N' nucleotides found in this column.
    max-count = max. number of bases (in all cycles)

FASTQ Quality Chart

Usage: /usr/local/bin/fastq_quality_boxplot_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

  [-p]           - Generate PostScript (.PS) file. Default is PNG image.
  [-i INPUT.TXT] - Input file. Should be the output of "solexa_quality_statistics" program.
  [-o OUTPUT]    - Output file name. default is STDOUT.
  [-t TITLE]     - Title (usually the solexa file name) - will be plotted on the graph.

FASTA/Q Nucleotide Distribution

Usage: /usr/local/bin/fastx_nucleotide_distribution_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

  [-p]           - Generate PostScript (.PS) file. Default is PNG image.
  [-i INPUT.TXT] - Input file. Should be the output of "fastx_quality_statistics" program.
  [-o OUTPUT]    - Output file name. default is STDOUT.
  [-t TITLE]     - Title - will be plotted on the graph.

FASTA/Q Clipper

usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).
   [-l N]       = discard sequences shorter than N nucleotides. default is 5.
   [-d N]       = Keep the adapter and N bases after it.
          (using '-d 0' is the same as not using '-d' at all. which is the default).
   [-c]         = Discard non-clipped sequences (i.e. - keep
  • 3
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值