Chapter3-Next-Generation Sequencing（更新中）

用一生追杀

已于 2024-10-05 12:55:37 修改

阅读量151

点赞数 2

文章标签： python 笔记

于 2024-09-29 23:50:47 首次发布

本文链接：https://blog.csdn.net/deseqwings/article/details/142590513

版权

这一章中，我们将使用全基因组测序数据学习NGS常见分析方法

原文：Bioinformatics with Python Cookbook

Biopython 文档 — Biopython 1.84 文档

Accessing GenBank and moving around NCBI databases

这一部分我们将学习通过biopython访问genbank数据库，但是为了拓展学习，我会将biopython用户手册-Accessing NCBI’s Entrez databases代替本章内容。

Entrez是一个分子生物学数据库系统，提供对核苷酸和蛋白质序列数据、以基因为中心且基因组图谱信息、3D结构数据、PubMed MEDLINE等的综合访问。该系统由国家生物技术信息中心(NCBI)提供，可通过互联网连接。

EInfo：获取有关 Entrez 数据库的信息

from Bio import Entrez, SeqIO
Entrez.email = 'xxx@qq.com'

导入库并且留下邮箱，尽量遵守Entrez用户要求。

record = Entrez.read(Entrez.einfo())
record['DbList']

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

罗列NCBI上的所有数据库，每个数据库都能获取更详细的信息，下面以pubmed为例。

record = Entrez.read(Entrez.einfo(db="pubmed"))
record['DbInfo'].keys()
for field in record["DbInfo"]["FieldList"]:  
    print(f"{field['Name']}, {field['FullName']}, {field['Description']}")

得到的结果是一个嵌套字典，可以通过遍历字典查询结果，这些结果都可以配合 ESearch一起使用。

ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to publication
FILT, Filter, Limits the records
TITL, Title, Words in title of publication
MESH, MeSH Terms, Medical Subject Headings assigned to publication

略

ESearch：搜索 Entrez 数据库

paper_record = Entrez.read(Entrez.esearch(db = 'pubmed',
                                         term = 'neuroinflammation[title]',
                                         retmax = 40))
paper_record['IdList']

['39361032', '39360420', '39359093', '39358233', '39357895', '39357894', '39357738', '39357591', '39356769', '39355174', '39351900', '39351804', '39351324', '39350557', '39347419', '39344133', '39341961', '39341300', '39340662', '39340661', '39340143', '39339745', '39338361', '39338346', '39337602', '39337316', '39337247', '39334733', '39334486', '39334416', '39334169', '39333683', '39332101', '39331469', '39328139', '39328052', '39326794', '39323437', '39322063', '39321702']

在pubmed中搜索标题中包含neuroinflammation的文章pmid。

cypri = Entrez.read(Entrez.esearch(db = 'nucleotide',
                                  term="Cypripedioideae[Orgn] AND matK[Gene]",
                                  idtype="acc"))
cypri

{'Count': '860', 'RetMax': '20', 'RetStart': '0', 'IdList': ['NC_087860.1', 'PP503063.1', 'PP448181.1', 'PP356909.1', 'OR348669.1', 'OM313455.1', 'OM313454.1', 'NC_084420.1', 'NC_084419.1', 'NC_084418.1', 'OR772093.1', 'OR772092.1', 'OR772091.1', 'OR772090.1', 'OR772089.1', 'OR772088.1', 'OR772087.1', 'OR772086.1', 'OR772085.1', 'OR772084.1'], 'TranslationSet': [{'From': 'Cypripedioideae[Orgn]', 'To': '"Cypripedioideae"[Organism]'}], 'TranslationStack': [{'Term': '"Cypripedioideae"[Organism]', 'Field': 'Organism', 'Count': '81965', 'Explode': 'Y'}, {'Term': 'matK[Gene]', 'Field': 'Gene', 'Count': '275521', 'Explode': 'N'}, 'AND'], 'QueryTranslation': '"Cypripedioideae"[Organism] AND matK[Gene]'}

在核苷酸数据库中搜索Cypripedioideae(种属)matK(基因名)，其中IdList是对应的 GenBank identifier

ESummary：从主 ID 检索摘要

record_cyp = Entrez.read(Entrez.esummary(db = 'nucleotide', id = 'NC_087860.1'))
record_cyp

[{'Item': [], 'Id': '2715319654', 'Caption': 'NC_087860', 'Title': 'Cypripedium x ventricosum chloroplast, complete genome', 'Extra': 'gi|2715319654|ref|NC_087860.1|[2715319654]', 'Gi': IntegerElement(2715319654, attributes={}), 'CreateDate': '2024/04/10', 'UpdateDate': '2024/04/10', 'Flags': IntegerElement(768, attributes={}), 'TaxId': IntegerElement(1088838, attributes={}), 'Length': IntegerElement(175385, attributes={}), 'Status': 'live', 'ReplacedBy': '', 'Comment': '  ', 'AccessionVersion': 'NC_087860.1'}]

以 NC_087860.1为例检索摘要

EFetch：从 Entrez 下载完整记录

stream = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb")
record = SeqIO.read(stream, "genbank")
record.seq

从上面的 Cypripedioideae 例子中，我们下载 GenBank 记录EU490707

Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA')

Performing basic sequence analysis

在这一章中，我们将学习一些DNA序列的基本操作

from Bio import Entrez, SeqIO, SeqRecord
hdl = Entrez.efetch(db='nucleotide', id=['NM_002299'], rettype='gb')
gb_rec = SeqIO.read(hdl, 'gb')

按照前面的方法获取人乳糖苷酶id:NM_002299的核苷酸信息。

for feature in gb_rec.features:
    if feature.type == 'CDS':
        location = feature.location
for feature in gb_rec.features:
    if feature.type == 'CDS':
        print(f"Feature type: {feature.type}")  
        print(f"Location: {feature.location}")

获取编码序列的信息。

Feature type: CDS
Location: [15:5799](+)

cds = SeqRecord.SeqRecord(gb_rec.seq[location.start:location.end],
                          'NM_002299',
                          description='LCT CDS only')
rna = cds.seq.transcribe()
print(rna)
prot = cds.seq.translate()
print(prot)

获得了编码序列之后，可对其进行转录或者翻译。

Working with modern sequence formats

在这部分中，我们将学习现代测序常见输出格式——fastq文件的分析；作者以千人基因组计划中的一个小片段为例(太大了个人电脑带不动) 。

rm -f SRR003265.filt.fastq.gz 2>/dev/null
wget -nd ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA18489/sequence_read/SRR003265.filt.fastq.gz
rm SRR003265.filt.fastq.gz SRR.fastq.gz

这是一个来自西非一个名为约鲁巴的国家的女性测序片段(瞎改名字并不是一个好习惯。。。)

zcat SRR.fastq.gz | head

使用zcat命令查看文件的前十行

@SRR003265.31 3042NAAXX:3:1:1252:1819 length=51
GGGAAAAGAAAAACAAACAAACAAAAACAAAACACAGAAACAAAAAAACCA
+
IIIIIIIIIIIIIIIIIII?8IAD>I1IIAD@IIH7I95=@-@+7=.;588
@SRR003265.216 3042NAAXX:3:1:433:1251 length=51
GAAATTTGTTTGCAGACCTCTGTGCAAACAAATTTCAGATTGGAAGAGCGT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIC;6I?II
@SRR003265.404 3042NAAXX:3:1:1902:1672 length=51
GATAATGATCTGAAGTTTTATTTTTTCACCAGGTCTCTGCCACATTTTTGT

序列标识符：@SRR003265.31 3042NAAXX:3:1:1252:1819 length=51，这是测序读段的唯一标识符，后面的信息通常包括样本ID和其他测序信息。

核苷酸序列：GGGAAAAGAAAAACAAACAAACAAAAACAAAACACAGAAACAAAAAAACCA，这是从样本中提取的DNA序列，长度为51个碱基对。

分隔符：+，表示序列的结束。

质量值：IIIIIIIIIIIIIIIIIII?8IAD>I1IIAD@IIH7I95=@-@+7=.;588，这些字符代表每个碱基的质量分数，每个字母都表示一个phred质量分数。参考：Phred quality score - Wikipedia

import gzip
from Bio import SeqIO
from collections import defaultdict
recs = SeqIO.parse(gzip.open('srr.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
rec = next(recs)
print(rec.letter_annotations)

{'phred_quality': [40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 30, 23, 40, 32, 35, 29, 40, 16, 40, 40, 32, 35, 31, 40, 40, 39, 22, 40, 24, 20, 28, 31, 12, 31, 10, 22, 28, 13, 26, 20, 23, 23]}

SeqIO.parse返还迭代器，查看质量分数

cnt = defaultdict(int)
for rec in recs:
    for letter in rec.seq:
        cnt[letter] += 1
tot = sum(cnt.values())
for letter, cnt in cnt.items():
    print('%s: %.2f %d' % (letter, 100. * cnt / tot, cnt))

G: 20.68 5359329
A: 28.60 7411928
T: 29.58 7666885
C: 21.00 5444044
N: 0.14 37289

查看碱基分布，由于这是一个过滤后的数据片段，所以未知碱基含量不会太高。

recs = SeqIO.parse(gzip.open('srr.fastq.gz', 'rt', encoding='UTF-8'), 'fastq')
n_cnt = defaultdict(int)
for rec in recs:
    for i, letter in enumerate(rec.seq):
        pos = i + 1
        if letter == 'N':
            n_cnt[pos] += 1
seq_len = max(n_cnt.keys())
positions = range(1, seq_len + 1)

记录每个片段Ncalls出现的位置，将其存储在默认字典中，绘制Ncalls数量与位置的折线图，需要注意的是，Fastq文件是从1开始计数的，所以索引需要从1开始。

fig, ax = plt.subplots(figsize=(16, 9), tight_layout=True, dpi=300)
fig.suptitle('Number of N calls as a function of the distance from the start of the sequencer read', fontsize='xx-large')
ax.plot(positions, [n_cnt[x] for x in positions])
ax.set_xlim(1, seq_len)
ax.set_xlabel('Read distance', fontsize='xx-large')
ax.set_ylabel('Number of N Calls', fontsize='xx-large')
fig.savefig('n_calls.png')

recs = SeqIO.parse(gzip.open('srr.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
qual_pos = defaultdict(list)
for rec in recs:
    for i, qual in enumerate(rec.letter_annotations['phred_quality']):
        if i < 25 or qual == 40:
            continue
        pos = i + 1
        qual_pos[pos].append(qual)
vps = []
poses = list(qual_pos.keys())
poses.sort()
for pos in poses:
    vps.append(qual_pos[pos])

绘制质量分数分布图，采用与之前同样的策略，但需要注意的是可以直接略过前25个bp，因为千人基因组计划要求前25个bp的quality必须是40

fig, ax = plt.subplots(figsize=(16,9), dpi=300, tight_layout=True)
sns.boxplot(data=vps, ax=ax)
ax.set_xticklabels([str(x) for x in range(26, max(qual_pos.keys()) + 1)])
ax.set_xlabel('Read distance', fontsize='xx-large')
ax.set_ylabel('PHRED score', fontsize='xx-large')
fig.suptitle('Distribution of PHRED scores as a function of read distance', fontsize='xx-large')
fig.savefig('phred.png')