Bioinformatics Data Skills by Oreilly学习笔记-10

本文介绍了生物信息学中广泛使用的两种序列格式:FASTA和FASTQ。FASTA包含序列描述和核苷酸序列,而FASTQ则增加了每个碱基的质量分数。文章讨论了这些格式的特点,包括如何计数序列条目、核苷酸编码、质量评分,并提到了处理低质量碱基的软件如FastQC和sickle。还探讨了Python中的Biopython和readfq模块来解析FASTA/FASTQ文件并计算核苷酸计数。
摘要由CSDN通过智能技术生成

Chapter 10 Working with Sequence Data

Nucleotide (and protein) sequences are stored in two plain-text formats widespread in bioinformatics: FASTA and FASTQ—pronounced fast-ah (or fast-A) and fast-Q, respectively. We’ll discuss each format and their limitations in this section, and then see some tools for working with data in these formats.

The FASTA Format
$ head -10 egfr_flank.fasta
>ENSMUSG00000020122|ENSMUST00000138518
CCCTCCTATCATGCTGTCAGTGTATCTCTAAATAGCACTCTCAACCCCCGTGAACTTGGT
TATTAAAAACATGCCCAAAGTCTGGGAGCCAGGGCTGCAGGGAAATACCACAGCCTCAGT
TCATCAAAACAGTTCATTGCCCAAAATGTTCTCAGCTGCAGCTTTCATGAGGTAACTCCA
GGGCCCACCTGTTCTCTGGT
>ENSMUSG00000020122|ENSMUST00000125984
GAGTCAGGTTGAAGCTGCCCTGAACACTACAGAGAAGAGAGGCCTTGGTGTCCTGTTGTC
TCCAGAACCCCAATATGTCTTGTGAAGGGCACACAACCCCTCAAAGGGGTGTCACTTCTT
CTGATCACTTTTGTTACTGTTTACTAACTGATCCTATGAATCACTGTGTCTTCTCAGAGG
CCGTGAACCACGTCTGCAAT
A common 

Naming convention is to split the description line into two parts at the first space.

The FASTQ Format

The FASTQ format extends FASTA by including a numeric quality score to each base in the sequence.

@DJB775P1:248:D0MDGACXX:7:1202:12362:49613
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA
+
JJJJJIIJJJJJJHIHHHGHFFFFFFCEEEEEDBD?DDDDDDBDDDABDDCA
@DJB775P1:248:D0MDGACXX:7:1202:12782:49716
CTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTAGATCGG
+
IIIIIIIIIIIIIIIHHHHHHFFFFFFEECCCCBCECCCCCCCCCCCCCCCC

A common pitfall is to treat every line that begins with @ as a description line. However, @ is also a valid quality character.

The Ins and Outs of Counting FASTA/FASTQ Entries

$ grep -c "^>" egfr_flank.fasta
5
$ grep -c "^@" untreated1_chr4.fq
208779
$ wc -l untreated1_chr4.fq
817420 untreated1_chr4.fq

817,420/4 = 204,355

$ bioawk -cfastx 'END{print NR}' untreated1_chr4.fq
204355
Nucleotide Codes
  1. Lowercase bases are often used to indicate soft masked repeats or low complexity sequences
  2. Repeats and low-complexity sequences may also be hard masked, where nucleotides are replaced with N (or sometimes an X).
    在这里插入图片描述
    在这里插入图片描述
Base Qualities

Qualities are restricted to the printable ASCII characters, ranging from 33 to 126
ASCII decimal to character in Python

>>> qual = "JJJJJJJJJJJJGJJJJJIIJJJJJIGJJJJJIJJJJJJJIJIJJJJHHHHHFFFDFCCC"
>>> [ord(b) for b in qual]
[74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 71, 74, 74, 74, 74, 74, 73,
73, 74, 74, 74, 74, 74, 73, 71, 74, 74, 74
This practical book teaches the skills that scientists need for turning large sequencing datasets into reproducible and robust biological findings. Many biologists begin their bioinformatics training by learning scripting languages like Python and R alongside the Unix command line. But there's a huge gap between knowing a few programming languages and being prepared to analyze large amounts of biological data. Rather than teach bioinformatics as a set of workflows that are likely to change with this rapidly evolving field, this book demsonstrates the practice of bioinformatics through data skills. Rigorous assessment of data quality and of the effectiveness of tools is the foundation of reproducible and robust bioinformatics analysis. Through open source and freely available tools, you'll learn not only how to do bioinformatics, but how to approach problems as a bioinformatician. Go from handling small problems with messy scripts to tackling large problems with clever methods and tools Focus on high-throughput (or "next generation") sequencing data Learn data analysis with modern methods, versus covering older theoretical concepts Understand how to choose and implement the best tool for the job Delve into methods that lead to easier, more reproducible, and robust bioinformatics analysis Table of Contents Part I. Ideology: Data Skills for Robust and Reproducible Bioinformatics Chapter 1. How to Learn Bioinformatics Part II. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project Chapter 2. Setting Up and Managing a Bioinformatics Project Chapter 3. Remedial Unix Shell Chapter 4. Working with Remote Machines Chapter 5. Git for Scientists Chapter 6. Bioinformatics Data Part III. Practice: Bioinformatics Data Skills Chapter 7. Unix Data Tools Chapter 8. A Rapid Introduction to the R Language Chapter 9. Working with Range Data Chapter 10. Working with Sequence Data Chapter 11. Working with Alignment Data Chapter 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks Chapter 13. Out-of-Memory Approaches: Tabix and SQLite Chapter 14. Conclusion
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值