FASTQ文件详解【转】

FASTQ是基于文本的,保存生物序列(通常是核酸序列)和其测序质量信息的标准格式。其序列以及质量信息都是使用一个ASCII字符标示,最初由Sanger开发,目的是将FASTA序列与质量数据放到一起,目前已经成为高通量测序结果的事实标准。

格式说明

FASTQ文件中每个序列通常有四行:

  1. 序列标识以及相关的描述信息,以‘@’开头;
  2. 第二行是序列
  3. 第三行以‘+’开头,后面是序列标示符、描述信息,或者什么也不加
  4. 第四行,是质量信息,和第二行的序列相对应,每一个序列都有一个质量评分,根据评分体系的不同,每个字符的含义表示的数字也不相同。
例如:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT 

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

 ILLUMINA SEQUENCE IDENTIFIERS

@HWUSI-EAS100R:6:73:941:1973#0/1
HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell lane
941 ‘x’-coordinate of the cluster within the tile
1973 ‘y’-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no indexing)
/1 the member of a pair, /1 or /2 (paired-endor mate-pair reads only)

 

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
EAS139 the unique instrument name                     唯一的仪器编号
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 ‘x’-coordinate of the cluster within the tile
197393 ‘y’-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-endor mate-pair reads only)
Y Y if the read fails filter (read is bad), N otherwise
18 0 when none of the control bits are on, otherwise it is an evennumber
ATCACG index sequence

 NCBI SEQUENCE READ ARCHIVE

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

关于质量编码格式

质量评分指的是一个碱基的错误概率的对数值。其最初在Phred拼接软件中定义与使用,其后在许多软件中得到使用。其质量得分与错误概率的对应关系见下表:

Phred quality scores are logarithmically linked to errorprobabilities
PHRED QUALITY SCORE PROBABILITY OF INCORRECT BASE CALL BASE CALL ACCURACY
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %
50 1 in 100000 99.999 %
Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error probabilities P.

除了Phred质量得分换算标准,还有就是Solexa标准:

两种换算标准的比较:

Relationship between Q and p using the Sanger (red) and Solexa(black) equations (described above). The vertical dotted lineindicates p = 0.05, or equivalently, Q ≈ 13.

对于每个碱基的质量编码标示,不同的软件采用不同的方案,目前有5种方案:

  • Sanger,Phred quality score,值的范围从0到92,对应的ASCII码从33到126,但是对于测序数据(rawread data)质量得分通常小于60,序列拼接或者mapping可能用到更大的分数。
  • Solexa/Illumina 1.0, Solexa/Illumina qualityscore,值的范围从-5到63,对应的ASCII码从59到126,对于测序数据,得分一般在-5到40之间;
  • Illumina 1.3+,Phred qualityscore,值的范围从0到62对应的ASCII码从64到126,低于测序数据,得分在0到40之间;
  • Illumina 1.5+,Phred qualityscore,但是0到2作为另外的标示,详见http://solexaqa.sourceforge.net/questions.htm#illumina
  • Illumina 1.8+

下面是更为直观的表示:

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
    (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

 文件后缀

没有特别的规定,通常使用.fq, .fastq, .txt等。

格式转换:

  • Biopython version1.51 onwards (interconverts Sanger, Solexa and Illumina 1.3+)
  • EMBOSS version6.1.0 patch 1 onwards (interconverts Sanger, Solexa and Illumina1.3+)
  • BioPerl version1.6.1 onwards (interconverts Sanger, Solexa and Illumina 1.3+)
  • BioRuby version1.4.0 onwards (interconverts Sanger, Solexa and Illumina 1.3+)
  • BioJava version1.7.1 to 1.8.x (interconverts Sanger, Solexa and Illumina1.3+)
  • MAQ canconvert from Solexa to Sanger (use this patch tosupport Illumina 1.3+ files).
  • fastx_toolkit Theincluded fastq_quality_converter program can convert Illumina toSanger

根据质量编码格式,fastq有多重变体,得到一个fastq文件,如何知道其编码格式(Detection of FASTQvariants)?确定其质量编码格式是许多分析处理的前提,有没有直接的简单的方法来判断?质量编码格式共有五种格式:

Sanger (Phred+33, 33 to 73)
Solexa (Phred+64, 59 to 104)
Illumina (1.3+) (Phred+64, 64 to 104)
Illumina (1.5+) (Phred+64, 66 to 104)
Illumina (1.8+) (Phred+33, 33 to 74)

具体的字母表示范围文章所示,通过扫描扫Fastq文件中的质量范围,从而推出其质量编码格式。QCToolkit就是采用这种方式对编码格式进行自动判断,perl代码如下:

my $file = $_[0];
my $isVariantIdntfcntOn = $_[1];
my $lines = 0;
open(F, "< $file") or die "Can not open file $file\n";
my $counter = 0;
my $minVal = 1000;
my $maxVal = 0;
while(my $line = ) {
        $lines++;
        $counter++;
        next if($line =~ /^\n$/);
        if($counter == 1 && $line !~ /^\@/) {
                prtErrorExit("Invalid FASTQ file format.\n\t\tFile: $file");
        }
        if($counter == 3 && $line !~ /^\+/) {
                prtErrorExit("Invalid FASTQ file format.\n\t\tFile: $file");
        }
        if($counter == 4 && $lines < 1000000) {
                chomp $line;
                my @ASCII = unpack("C*", $line);
                $minVal = min(min(@ASCII), $minVal);
                $maxVal = max(max(@ASCII), $maxVal);
        }
        if($counter == 4) {
                $counter = 0;
        }
}
close(F);
my $tseqFormat = 0;
if($minVal >= 33 && $minVal < = 73 && $maxVal >= 33 && $maxVal < = 73) {
        $tseqFormat = 1;
}
elsif($minVal >= 66 && $minVal < = 105 && $maxVal >= 66 && $maxVal < = 105) {
        $tseqFormat = 4;                        # Illumina 1.5+
}
elsif($minVal >= 64 && $minVal < = 105 && $maxVal >= 64 && $maxVal < = 105) {
        $tseqFormat = 3;                        # Illumina 1.3+
}
elsif($minVal >= 59 && $minVal < = 105 && $maxVal >= 59 && $maxVal < = 105) {
        $tseqFormat = 2;                        # Solexa
}
elsif($minVal >= 33 && $minVal < = 74 && $maxVal >= 33 && $maxVal < = 74) {
        $tseqFormat = 5;                        # Illumina 1.8+
}
if($isVariantIdntfcntOn) {
        $seqFormat = $tseqFormat;
}
else {
        if($tseqFormat != $seqFormat) {
                print STDERR "Warning: It seems the specified variant of FASTQ doesn't match the quality values in input FASTQ files.\n";
        }
}




参考以及相关链接

  • 5
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值