Fasta文件简介
fasta文件:该文件用于存储序列信息
特点:
由标题部分和序列部分组成
总是以>开头
>到第一个空格为该序列的名称
其余部分为该序列的寿命
一个fasta文件中可以包含多条序列
例子:
X17276.1|kraken:taxid|9646 Giant Panda satellite 1 DNA
GATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCTGGACGCTCTGCTTTGTTACCAATGAGAAGGGCGCTGAATCCTCGAAAATCCTGACCCTTTTAATTCATGCTCCCTTACTCACGAGAGATGATGATCGTTGATATTTCCCTGGACTGTGTGGGGTCTCAGAGACCACTATGGGGCACTCTCGTCAGGCTTCCGCGACCACGTTCCCTCATGTTTCCCTATTAACGAAGGGTGATGATAGTGCTAAGACGGTCCCTGTACGGTGTTGTTTCTGACAGACGTGTTTTGGGCCTTTTCGTTCCATTGCCGCCAGCAGTTTTGACAGGATTTCCCCAGGGAGCAAACTTTTCGATGGAAACGGGTTTTGGCCGAATTGTCTTTCTCAGTGCTGTGTTCGTCGTGTTTCACTCACGGTACCAAAACACCTTGATTATTGTTCCACCCTCCATAAGGCCGTCGTGACTTCAAGGGCTTTCCCCTCAAACTTTGTTTCTTGGTTCTACGGGCTG
GFF/GTF文件简介
GFF/GTF文件: 该文件属于注释文件,包括许多信息,对fasta进行额外的
说明,每条信息有九列,都以/t进行分割,每一列代表的意思如下所示
- seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- source - name of the program that generated this feature, or the data source (database or project name)
- feature - feature type name, e.g. Gene, Variation, Similarity
- start - Start position* of the feature, with sequence numbering starting at 1.
- end - End position* of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on…
- attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
例子
##gff-version 3
NC_045512.2 RefSeq gene 266 21555 . + . ID=gene- GU280_gp01;Dbxref=GeneID:43740578;Name=ORF1ab;gbkey=Gene;gene=ORF1ab; gene_biotype=protein_coding;locus_tag=GU280_gp01
NC_045512.2 RefSeq CDS 266 13468 . + 0 ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab polyprotein;protein_id=YP_009724389.1
习题
提取CDS区域,说明仅仅保留ID= 、Name=、 Locus_tag 其他部分去掉
import re
from Bio import SeqIO
import gzip
def test1():
# 打开.gff文件
fr = open('./**.gff')
fr2 = open('./test1.gff', 'w')
for fr_line in fr.readlines():
# 按行读取文件中内容
if fr_line[0] == '#':
# 将源文件之前的注释写入test1.gff中
fr2.write(fr_line)
continue
# 按照/t进行切片操作
result = re.split(r'\t', fr_line)
# 第八列为说明部分,并以;分割,因此按照;切片找到想要的数据
result2 = re.split(r';', str(result[8]))
if result[2] == "CDS":
# 提取类型为CDS的所有信息
for i in range(8):
fr2.write(result[i] + '\t')
for temp in result2:
# 找到ID,Name,locus_tag信息保存到test1.gff中
# 其他信息不需要
s1 = re.split(r'=', temp)
if s1[0] == 'ID':
fr2.write(s1[0] + '=' + s1[1] + ';')
elif s1[0] == 'Name':
fr2.write(s1[0] + '=' + s1[1] + ';')
elif s1[0] == 'locus_tag':
fr2.write(s1[0] + '=' + s1[1] + '\n')
print('test1.gff文件已写完')
#关闭文件
fr.close()
fr2.close()
if __name__ == '__main__':
test1()
提取1中对应的序列,修改每条序列的描述信息为ID= 、Name=、 Locus_tag信息
import re
from Bio import SeqIO
import gzip
def test2():
# 打开文件句柄
inflie = open('./**.fasta')
# 文件句柄转换为seqs
seqs = SeqIO.parse(inflie, 'fasta')
fr = open('./test1.gff')
fr2 = open('./test2.fasta', 'w')
# 读取fasta文件的每条信息
seq1 = next(seqs)
seq_list = str(seq1._seq)
for fr_line in fr.readlines():
# 按行读取.gff文件获取序列的起始终止位置
if fr_line[0] == '#':
continue
# 按照制表符进行分割
result = re.split(r'\t', fr_line)
# 最后一列按照‘;’进行分割并把‘=’去掉写入文件中
result2 = re.split(r';', result[8])
fr2.write(">" + re.split(r'=', result2[0])[1] + " name:" +
re.split(r'=', result2[1])[1] + " Locus_tag:" +
re.split(r'=', result2[2])[1])
# 根据起始位置获取相应的序列并写入文件中
seq2 = seq_list[int(result[3]) - 1:int(result[4])]
fr2.write(seq2 + '\n')
print('test2.fasta文件已写完')
fr.close()
fr2.close()
if __name__ == '__main__':
test2()
结果展示: