Chapter 10 Working with Sequence Data
Nucleotide (and protein) sequences are stored in two plain-text formats widespread in bioinformatics: FASTA and FASTQ—pronounced fast-ah (or fast-A) and fast-Q, respectively. We’ll discuss each format and their limitations in this section, and then see some tools for working with data in these formats.
The FASTA Format
$ head -10 egfr_flank.fasta
>ENSMUSG00000020122|ENSMUST00000138518
CCCTCCTATCATGCTGTCAGTGTATCTCTAAATAGCACTCTCAACCCCCGTGAACTTGGT
TATTAAAAACATGCCCAAAGTCTGGGAGCCAGGGCTGCAGGGAAATACCACAGCCTCAGT
TCATCAAAACAGTTCATTGCCCAAAATGTTCTCAGCTGCAGCTTTCATGAGGTAACTCCA
GGGCCCACCTGTTCTCTGGT
>ENSMUSG00000020122|ENSMUST00000125984
GAGTCAGGTTGAAGCTGCCCTGAACACTACAGAGAAGAGAGGCCTTGGTGTCCTGTTGTC
TCCAGAACCCCAATATGTCTTGTGAAGGGCACACAACCCCTCAAAGGGGTGTCACTTCTT
CTGATCACTTTTGTTACTGTTTACTAACTGATCCTATGAATCACTGTGTCTTCTCAGAGG
CCGTGAACCACGTCTGCAAT
A common
Naming convention is to split the description line into two parts at the first space.
The FASTQ Format
The FASTQ format extends FASTA by including a numeric quality score to each base in the sequence.
@DJB775P1:248:D0MDGACXX:7:1202:12362:49613
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA
+
JJJJJIIJJJJJJHIHHHGHFFFFFFCEEEEEDBD?DDDDDDBDDDABDDCA
@DJB775P1:248:D0MDGACXX:7:1202:12782:49716
CTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTAGATCGG
+
IIIIIIIIIIIIIIIHHHHHHFFFFFFEECCCCBCECCCCCCCCCCCCCCCC
A common pitfall is to treat every line that begins with @ as a description line. However, @ is also a valid quality character.
The Ins and Outs of Counting FASTA/FASTQ Entries
$ grep -c "^>" egfr_flank.fasta
5
$ grep -c "^@" untreated1_chr4.fq
208779
$ wc -l untreated1_chr4.fq
817420 untreated1_chr4.fq
817,420/4 = 204,355
$ bioawk -cfastx 'END{print NR}' untreated1_chr4.fq
204355
Nucleotide Codes
- Lowercase bases are often used to indicate soft masked repeats or low complexity sequences
- Repeats and low-complexity sequences may also be hard masked, where nucleotides are replaced with N (or sometimes an X).
Base Qualities
Qualities are restricted to the printable ASCII characters, ranging from 33 to 126
ASCII decimal to character in Python
>>> qual = "JJJJJJJJJJJJGJJJJJIIJJJJJIGJJJJJIJJJJJJJIJIJJJJHHHHHFFFDFCCC"
>>> [ord(b) for b in qual]
[74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 71, 74, 74, 74, 74, 74, 73,
73, 74, 74, 74, 74, 74, 73, 71, 74, 74, 74