这几天在学习perl,其中碰到要自己创建新模块的问题,所以把这两天的学习心得写下来,以鼓励自己未来更深入的学习。由于很多脚本要运用到同一个或者几个相同的子程序,因此需要我们将其封装起来,形成一个整合的新模块,以便高效率进行脚本的编写和运行。
如有不当之处,请多多指教。
我们就以生信的一个例子为例,具体如下:
1 创建新模块
01)创建新模块的目的:perl可以对代码进行封装,比如著名的CPAN上就有非常好用的Module,可以极大减轻开发量。
02)模块命名规则:不能随便更换模块名。模块名要以大写字母为首字母来命名。program可使用小写字母进行命名。以pm为后缀来命名模块。
以下即构建的新模块,模块名必须为BeginPerlBioinfo.pm
其中2-9行分别含有注释信息,并且注释的行不可缺少。其它均为自己写好的各个子函数。
1 #!/usr/bin/perl
2
3 package BeginPerlBioinfo; # 模块名
4 use strict;
5 use vars qw($VERSION @ISA @EXPORT); #定义变量名
6 require Exporter; #引入Expporter module
7 @ISA=qw/Exporter/; #继承Exporter
8 @EXPORT=qw/get_file_data codon2aa extract_sequence_from_fasta_data print_sequence/; #查找@INC中合成的模块文件。即默认导出的模块。这些模块在很多脚本里都会用到。
9 $VERSION = '0.01';
10
11 #####################################################################
12 #codon2aa
13 #
14 #A subroutine to translate a DNA 3-charater codon to an amino acid
15 # Version2
16 #####################################################################
17 sub codon2aa {
18 my($codon) = @_;
19
20 if ($codon =~ /GC./i) {return 'A'} # Alanine
21 elsif ($codon =~ /TG[TC]/i) {return 'C'} # Cysteine
22 elsif ($codon =~ /GA[TC]/i) {return 'D'} # Aspartic acid
23 elsif ($codon =~ /GA[AG]/i) {return 'E'} # Glutamic Acid
24 elsif ($codon =~ /TT[TC]/i) {return 'F'} # Phenylalanine
25 elsif ($codon =~ /GG./i) {return 'G'} # Glycine
26 elsif ($codon =~ /CA[TC]/i) {return 'H'} # Histidine
27 elsif ($codon =~ /AT[TCA]/i){return 'I'} # Isoleucine
28 elsif ($codon =~ /AA[AG]/i) {return 'K'} # Lysine
29 elsif ($codon =~ /TT[AG]|CT./i){return 'L'} # Leucine
30 elsif ($codon =~ /ATG/i) {return 'M'} # Methionine
31 elsif ($codon =~ /AA[TC]/i) {return 'N'} # Asparagine
32 elsif ($codon =~ /CC./i) {return 'P'} # Proline
33 elsif ($codon =~ /CA[AG]/i) {return 'Q'} # Glutamine
34 elsif ($codon =~ /CG.|AG[AG]/i) {return 'R'} # Arginine
35 elsif ($codon =~ /TC.|AG[TC]/i) {return 'S'} # Serine
36 elsif ($codon =~ /AC./i) {return 'T'} # Threonine
37 elsif ($codon =~ /GT./i) {return 'V'} # Valine
38 elsif ($codon =~ /TGG/i) {return 'W'} # Tryptophan
39 elsif ($codon =~ /TA[TC]/i) {return 'Y'} # Tyrosine
40 elsif ($codon =~ /TA[AG]|TGA/i) {return '_'} # Stop
41 else {
42 print STDERR "Bad codon \"$codon\"!!\n";
43 exit;
44
45 }
46 }
47
48
49 #get_file_data
50 #
51 #A subroutine to get data from a file given its filename
52
53 sub get_file_data {
54
55 my($filename) = @_;
56
57 #use strict;
58 #use warnings;
59
60 # Initialize variables
61 my @filedata = ( );
62
63 unless( open(GET_FILE_DATA, $filename) ) {
64 print STDERR "Cannot open file \"$filename\"\n\n";
65 exit;
66
67 }
68
69 @filedata = <GET_FILE_DATA>;
70
71 close GET_FILE_DATA;
72
73 return @filedata;
74
75 }
76
77 #extract_sequence_from_fasta_data
78 #
79 # A subroutin to extract FASTA sequence data from an array
80
81
82 sub extract_sequence_from_fasta_data {
83
84 my (@fasta_file_data) = @_;
85
86 #Declare and initialize variables
87 my $sequence = '';
88
89 foreach my $line (@fasta_file_data) {
90
91 # discard blank line
92 if ($line =~ /^\s*$/) {
93 next;
94
95 #discard comment line
96 } elsif($line =~ /^\s*#/) {
97 next;
98
99 # discard fasta header line
100 } elsif($line =~ /^>/) {
101 next;
102 #keep line, add to sequence string
103 } else {
104 $sequence .= $line;
105 }
106 }
107
108 #remove non-sequence data( in this case, whitespace) from $sequence string
109 $sequence =~ s/\s//g;
110
111 return $sequence ;
112
113 }
114
115 #print_sequence
116 #
117 #A subroutine to format and print sequence data
118
119 sub print_sequence {
120
121 my($sequence, $length) = @_;
122
123 use strict;
124 use warnings;
125
126 # Print sequence in lines of $length
127 for ( my $pos = 0; $pos < length($sequence) ; $pos +=$length) {
128
129 print substr($sequence, $pos, $length), "\n";
130 }
131 }
2 导入及运用新模块
将我们上面写好的模块:BeginPrelBioinfo.pm 导入到我们的脚本里即可。在这里我们选用use来引入新模块。
01)使用use 只能引入模块,同时不需要加后缀名。
02)use引入模块的同时也引入了模块的子模块。use在默认@INC里面去寻找,如子程序不在@INC中的话,这时用use是不可以引入的,运行脚本时会报错。
运行以下脚本:脚本名,我命为Extract_annotation_and_sequence_from_GenBank_file.pl
#! /usr/bin/perl
# Extract annotation and sequence from GenBank file
use strict;
use warnings;
use BeginPerlBioinfo; #使用新模块BeginPerlBioinfo
# declare and initialize variables
my @annotation = ( );
my $sequence = '';
my $filename = 'record.gb'; #导入input file
parse1(\@annotation, \$sequence, $filename);
#print the annotation, and then
# print the DNA in new format just to check if we got it okay.
print @annotation;
print_sequence($sequence, 50);
exit;
#########################################################################
###########################
#subroutine
##########################################################################
########################
#parse1
#
#--parse annotation and sequence from GenBank record
sub parse1 {
my($annotation, $dna, $filename) = @_;
# $annotation--reference to array
# $dna --reference to scalar
# $filename --scalar
# declare and initialize variables
my $in_sequence = 0;
my @GenBankFile = ( );
# Get the GenBank data into an array from a file
@GenBankFile = get_file_data($filename);
# Extract all the sequence lines
foreach my $line (@GenBankFile) {
if($line =~ /^\/\/\n/ ) {#If $line is end-of-record line //\n,
last; #break out of the foreach loop.
} elsif($in_sequence) { # If we know we're in a sequence,
$$dna .= $line; # add the current line to $$dna.
} elsif ($line =~ /^ORIGIN/ ) { # If $line begins a sequence,
$in_sequence = 1; # set the $in_sequence flag.
} else{ # Otherwise
push( @annotation, $line); #add the current line to @annotation.
}
}
# remove whitespace and line numbers from DNA sequence
$$dna =~ s/[\s0-9]//g;
}