我们首先来看一下GenBank中序列的格式,我从http://www.ncbi.nlm.nih.gov/nuccore/JX118024.1中复制了这个信息到strawberry.gb中
LOCUS JX118024 460 bp DNA linear PLN 25-SEP-2012
DEFINITION Fragaria vesca subsp. americana RNA polymerase beta subunit (rpoC1)
gene, partial cds; plastid.
ACCESSION JX118024
VERSION JX118024.1 GI:402238751
KEYWORDS .
SOURCE plastid Fragaria vesca subsp. americana
ORGANISM Fragaria vesca subsp. americana
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
rosids; fabids; Rosales; Rosaceae; Rosoideae; Potentilleae;
Fragariinae; Fragaria.
REFERENCE 1 (bases 1 to 460)
AUTHORS Njuguna,W., Liston,A., Cronn,R., Ashman,T.L. and Bassil,N.
TITLE Insights into phylogeny, sex function and age of Fragaria based on
whole chloroplast genome sequencing
JOURNAL Mol. Phylogenet. Evol. (2012) In press
PUBMED 22982444
REMARK Publication Status: Available-Online prior to print
REFERENCE 2 (bases 1 to 460)
AUTHORS Njuguna,W., Liston,A., Cronn,R., Ashman,T.-L. and Bassil,N.
TITLE Direct Submission
JOURNAL Submitted (16-MAY-2012) Horticulture, Oregon State University, 4017
ALS, Corvallis, OR 97331, USA
COMMENT ##Assembly-Data-START##
Assembly Method :: yasra v. 2009; mulan v. 2005
Sequencing Technology :: Illumina
##Assembly-Data-END##
FEATURES Location/Qualifiers
source 1..460
/organism="Fragaria vesca subsp. americana"
/organelle="plastid"
/mol_type="genomic DNA"
/sub_species="americana"
/db_xref="taxon:101019"
gene complement(<1..>460)
/gene="rpoC1"
CDS complement(<1..>460)
/gene="rpoC1"
/codon_start=2
/transl_table=11
/product="RNA polymerase beta subunit"
/protein_id="AFQ39140.1"
/db_xref="GI:402238752"
/translation="ELVMCQEKLVQEAVDTLLDNGIRGQPMRDGHNKVYKSFSDVIEG
KEGRFRETLLGKRVDYSGRSVIVVGPSLSLHRCGLPREIAIELFQTFVIRGLIRQHFA
SNIGVAKSKIREKEPVVWEILQEVMQGHPVLLNRAPTLHRLGIQAFQPILV"
ORIGIN
1 tactaaaatg ggctggaacg cctgtatgcc taatctatgc agagtaggcg ccctattcag
61 caatacggga tgcccttgca taacttcctg aagtatttcc catacaaccg gctctttttc
121 ccgaatttta ctcttagcca ctcctatatt cgaggcaaaa tgctgcctaa ttaaaccacg
181 aattacaaat gtctggaaaa gctctattgc tatttcgcga ggcaatccac atcgatgtaa
241 tgaaagtgaa gggcctacaa caatgacaga acgccccgaa taatcgaccc gtttgccaag
301 cagagtctca cggaatcttc cctcttttcc ttcaattaca tctgaaaacg acttgtaaac
361 cttattatga ccgtccctca ttggttgtcc acggattcca ttatcaagaa gcgtatccac
421 agcttcttgt accaatttct cctgacacat gactaattct
//
这是一个很短的序列,我们来观察特征前面都是注释,不是我们想要的信息。从ORIGIN开始的下一行就是我们要获得的信息,但是每一行有数字标识,为了方便查看碱基的个数,这个也不是我们需要的。序列的最后是// 这个也就是结束的标识。
我们写这个程序的主要方法就是:
第一:忽略ORIGIN上面的所有内容
第二:当遇到//的时候停止
第三:把我们获取序列中的所有的数字替换为空
第四:注释和序列之间的区别,我们用了一个标志符in_sequence.在没有遇到ORIGIN的时候,值为0,当遇到ORIGIN是我们把它的值赋值为1.
第五:foreach中判断语句的顺序,我们按照从后往前的顺序,也就是也判读是不是最后的结束行,然后看如果是序列,然后ORI一行,然后是注释。
下面的程序我们把上边的所有注释内容页保存到@annotation中去了,并且也做了输出。
use strict;
use warnings;
my @annotation=( );
my $sequence ='';
my $filename ='';
parse1(\@annotation,\$sequence,\$filename);
print @annotation;
print_sequence($sequence,50);
#parse1
sub parse1
{
my ($annotation,$dna,$filename) = @_;
my $in_sequence = 0;
my @GenBankFile =( );
@GenBankFile =get_file_data($filename);
foreach my $line(@GenBankFile)
{
#用来匹配最后一行
if($line=~/^\/\/\n/)
{
last;
}
elsif($in_sequence)
{
$$dna .=$line;
}
elsif($line=~/^ORIGIN/)
{
$in_sequence = 1;
}
else
{
push(@annotation,$line);
}
}
#remove whitespace adn line numbers from DNA sequence
$$dna =~ s/[\s0-9]//g;
}
sub get_file_data
{
# A subroutine to get data from a file given its filename
#读取文件的子序列
my $dna_filename;
my @filedata;
print "please input the Path just like this f:\\\\perl\\\\data.txt\n";
chomp($dna_filename=<STDIN>);
open(DNAFILENAME,$dna_filename)||die("can not open the file!");
@filedata = <DNAFILENAME>;
close DNAFILENAME;
return @filedata;#子函数的返回值一定要记住写
}
sub print_sequence
{
# A subroutine to format and print sequence data
my ($sequence, $length) = @_;
for (my $pos =0; $pos<length($sequence);$pos+=$length)
{
print substr($sequence,$pos,$length),"\n";
}
}
结果如下:
F:\>perl\a.pl
please input the Path just like this f:\\perl\\data.txt
f:\\perl\strawberry.gb
LOCUS JX118024 460 bp DNA linear PLN 25-SEP-2012
DEFINITION Fragaria vesca subsp. americana RNA polymerase beta subunit (rpoC1)
gene, partial cds; plastid.
ACCESSION JX118024
VERSION JX118024.1 GI:402238751
KEYWORDS .
SOURCE plastid Fragaria vesca subsp. americana
ORGANISM Fragaria vesca subsp. americana
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
rosids; fabids; Rosales; Rosaceae; Rosoideae; Potentilleae;
Fragariinae; Fragaria.
REFERENCE 1 (bases 1 to 460)
AUTHORS Njuguna,W., Liston,A., Cronn,R., Ashman,T.L. and Bassil,N.
TITLE Insights into phylogeny, sex function and age of Fragaria based on
whole chloroplast genome sequencing
JOURNAL Mol. Phylogenet. Evol. (2012) In press
PUBMED 22982444
REMARK Publication Status: Available-Online prior to print
REFERENCE 2 (bases 1 to 460)
AUTHORS Njuguna,W., Liston,A., Cronn,R., Ashman,T.-L. and Bassil,N.
TITLE Direct Submission
JOURNAL Submitted (16-MAY-2012) Horticulture, Oregon State University, 4017
ALS, Corvallis, OR 97331, USA
COMMENT ##Assembly-Data-START##
Assembly Method :: yasra v. 2009; mulan v. 2005
Sequencing Technology :: Illumina
##Assembly-Data-END##
FEATURES Location/Qualifiers
source 1..460
/organism="Fragaria vesca subsp. americana"
/organelle="plastid"
/mol_type="genomic DNA"
/sub_species="americana"
/db_xref="taxon:101019"
gene complement(<1..>460)
/gene="rpoC1"
CDS complement(<1..>460)
/gene="rpoC1"
/codon_start=2
/transl_table=11
/product="RNA polymerase beta subunit"
/protein_id="AFQ39140.1"
/db_xref="GI:402238752"
/translation="ELVMCQEKLVQEAVDTLLDNGIRGQPMRDGHNKVYKSFSDVIEG
KEGRFRETLLGKRVDYSGRSVIVVGPSLSLHRCGLPREIAIELFQTFVIRGLIRQHFA
SNIGVAKSKIREKEPVVWEILQEVMQGHPVLLNRAPTLHRLGIQAFQPILV"
tactaaaatgggctggaacgcctgtatgcctaatctatgcagagtaggcg
ccctattcagcaatacgggatgcccttgcataacttcctgaagtatttcc
catacaaccggctctttttcccgaattttactcttagccactcctatatt
cgaggcaaaatgctgcctaattaaaccacgaattacaaatgtctggaaaa
gctctattgctatttcgcgaggcaatccacatcgatgtaatgaaagtgaa
gggcctacaacaatgacagaacgccccgaataatcgacccgtttgccaag
cagagtctcacggaatcttccctcttttccttcaattacatctgaaaacg
acttgtaaaccttattatgaccgtccctcattggttgtccacggattcca
ttatcaagaagcgtatccacagcttcttgtaccaatttctcctgacacat
gactaattct
F:\>
如果我们把输入的@annotion删除了,我们就只得到我们想要的信息了。结果如下:
F:\>perl\a.pl
please input the Path just like this f:\\perl\\data.txt
f:\\perl\\strawberry.gb
tactaaaatgggctggaacgcctgtatgcctaatctatgcagagtaggcg
ccctattcagcaatacgggatgcccttgcataacttcctgaagtatttcc
catacaaccggctctttttcccgaattttactcttagccactcctatatt
cgaggcaaaatgctgcctaattaaaccacgaattacaaatgtctggaaaa
gctctattgctatttcgcgaggcaatccacatcgatgtaatgaaagtgaa
gggcctacaacaatgacagaacgccccgaataatcgacccgtttgccaag
cagagtctcacggaatcttccctcttttccttcaattacatctgaaaacg
acttgtaaaccttattatgaccgtccctcattggttgtccacggattcca
ttatcaagaagcgtatccacagcttcttgtaccaatttctcctgacacat
gactaattct
F:\>