测序数据处理 —— 读取序列文件
前面我们介绍了测序中常见的一些序列文件,如
FASTA、
FASTQ 和
GenBank 等。这一节,我们简单介绍一些如何在
Python 和
R 中读取这些文件信息。
Python
Python 中有一个好用的生物信息学工具包 Biopython,提供了对多种数据文件的读写,以及访问在线服务器等工具。
我们主要介绍如何使用它来读取序列数据,先安装这个包。
pip install biopython
读取 FASTA
FASTA 文件是一种用于存储核酸或蛋白质序列的文本格式。每个序列以一个描述行(以 > 开头)和随后的序列行表示。
可以从 nucleotide 数据库中获取某个基因的 FASTA 序列
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "A.N.Other@example.com"
with Entrez.efetch(
db="nucleotide", rettype="fasta", retmode="text", id="NM_000546"
) as handle:
seq_record = SeqIO.read(handle, "fasta")
print("%s with %i features" % (seq_record.id, len(seq_record.features)))
# 保存到本地
SeqIO.write(seq_record, 'example.fa', 'fasta')
或者直接读取本地文件
from Bio import SeqIO
fasta_file = "example.fasta"
for record in SeqIO.parse(fasta_file, "fasta"):
print(f"ID: {record.id}")
print(f"Sequence: {record.seq}")
print(f"Description: {record.description}")
读取 FASTQ
FASTQ 文件用于存储带有质量分数的核酸序列,通常用于高通量测序数据。每个记录包含四行:序列标识符、序列、一个加号和质量分数行。
例如,example.fastq 为
@A00920:1238:HNKV7DSX5:1:1101:4128:1000 1:N:0:CGAGGCTG+CTCCTTAC
CATAAATACTTGTGGTGTTTAACATACGGTGAATATTTACAATTTCTCTTCGCATGTTTTCTCCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAGGCTGATCTCGTATGCCGTCTTCGGCTTGAAAAGGGGGGGGGGGGGGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,,F::F::,F,,,,,,F,,,F,,,,,,:,FFFFFFFFFFFFF
@A00920:1238:HNKV7DSX5:1:1101:12500:1000 1:N:0:CGAGGCTG+CTCCTTAC
CTTTTACATACCAAGGTAAATCATCGCCTTTTCCTATTAAATTATTTTTGCCCATGGCTAAAACTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAGGCTGATCTCGTATGACGTGTGAGGCTTGTAAAAGGGGGGGGGGGGGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,FF:,:F,,,,,,F,,F,,,:,,:,FFFFFFFFFFFFFF
使用下面代码读取
from Bio import SeqIO
fastq_file = "example.fastq"
for record in SeqIO.parse(fastq_file, "fastq"):
print(f"ID: {record.id}")
print(f"Sequence: {record.seq}")
print(f"Quality: {record.letter_annotations['phred_quality']}")
读取 GenBank
GenBank 文件是一种包含核酸序列以及注释的格式,通常包含丰富的元数据。
在线获取基因的 GenBank 文件信息
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "A.N.Other@example.com"
with Entrez.efetch(
db="nucleotide", rettype="gb", retmode="text", id="NM_000546"
) as handle:
seq_record = SeqIO.read(handle, "gb")
print("%s with %i features" % (seq_record.id, len(seq_record.features)))
# 保存到本地
SeqIO.write(seq_record, 'example.gb', 'gb')
获取读取本地文件
from Bio import SeqIO
genbank_file = "example.gb"
for record in SeqIO.parse(genbank_file, "genbank"):
print(f"ID: {record.id}")
print(f"Sequence: {record.seq}")
print(f"Description: {record.description}")
for feature in record.features:
print(f"Feature: {feature.type}")
print(f"Location: {feature.location}")
print(f"Qualifiers: {feature.qualifiers}")
R
在 R 中读取序列文件,可以从 Bioconductor 项目中安装几个包,包括 Biostrings、ShortRead 和 genbankr。
需要先安装 Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
读取 FASTA
使用 Biostrings 包来读取 FASTA 文件,先安装包:
# 安装 Biostrings 包(如果尚未安装)
if (!requireNamespace("Biostrings", quietly = TRUE)) {
BiocManager::install("Biostrings")
}
确保安装成功之后
library(Biostrings)
fasta_file <- "example.fasta"
fasta_sequences <- readDNAStringSet(fasta_file)
# 打印每个序列的ID和序列
for (i in seq_along(fasta_sequences)) {
cat("ID:", names(fasta_sequences)[i], "\n")
cat("Sequence:", as.character(fasta_sequences[i]), "\n\n")
}
读取 FASTQ
使用 ShortRead 包来读取 FASTQ 文件:
# 安装 ShortRead 包(如果尚未安装)
if (!requireNamespace("ShortRead", quietly = TRUE)) {
BiocManager::install("ShortRead")
}
library(ShortRead)
fastq_file <- "example.fastq"
fastq_sequences <- readFastq(fastq_file)
# 打印每个序列的ID、序列和质量分数
for (i in seq_along(fastq_sequences)) {
cat("ID:", as.character(id(fastq_sequences[i])), "\n")
print("Sequence:")
print(sread(fastq_sequences[i]))
print("Quality:")
print(quality(fastq_sequences[i]))
}
读取 GenBank
本来使用 genbankr 包来读取 GenBank 文件,但是这个包有问题。换成 ape,直接在线获取 GenBank 序列
if (!requireNamespace("ape", quietly = TRUE)) {
install.packages("ape")
}
library(ape)
tp53 <- read.GenBank("NM_000546", as.character = TRUE)
seq <- paste0(tp53$NM_000546, collapse = "")
也可以用它来读取 FASTA
read.FASTA("~/Downloads/example.fasta")
1 DNA sequence in binary format stored in a list.
Sequence length: 2512
Label:
NM_000546.6 Homo sapiens tumor protein p53 (TP53), transcrip...
Base composition:
a c g t
0.212 0.287 0.247 0.254
(Total: 2.51 kb)
2479

被折叠的 条评论
为什么被折叠?



