perl计算基因在基因组上的距离
介绍
根据输入注释文件及目的基因文件,找出目的基因在基因组上的距离并输出到文件内
一、文件格式
基因注释文件:为细菌全基因组利用prokka注释文件,格式如下
gene和CDS前为基因的起始和结束位置,locus_tag为基因编号,product为注释结果,以制表符分隔。
目的基因文件xxx.txt:
以制表符分隔,第一行为描述,第二行为全基因组长度bp,第三行及之后为所要计算距离的基因。
二、代码
这里输入注释文件为ZJSH63.tbl,目的基因文件为genes.txt
代码distance.pl如下:
#! /usr/bin/perl -w
use strict;
my @list = @ARGV;
#运行脚本示例:perl distance.pl ZJSH63.tbl genes.txt
#ZJSH63.tbl文件为prokka注释文件
#genes.txt为所要计算距离的基因编号文件
open(Genes,"$ARGV[1]") or die ("No genes file finded: $!");
my @gene;
my %location;
my @temp1;
my @temp2;
my @temp3;
my $num =0;
my $temp;
while(<Genes>){
chomp;
#geen.txt第0列为信息,第2列为菌株、基因组长度和基因编号
@temp1 = split(/\t/,$_);
#@gene 0位为菌名,1位为基因组长度bp,后续为基因编号
$temp = $temp1[1];
chomp($temp);
if($temp =~/\r$/){chop($temp);}
#print "$temp\t";
$gene[$num] = $temp;
#print "$gene[$num]\t";
$num++;
}
close (Genes) or die ("can not close genes file: $!");
print "number of genes: $num\n";
my $name;
my $left = "left";
my $right = "right";
my $l;
my $r;
foreach $name (@gene[2..$num-1]){
chomp($name);
#print "$name\t";
$l = $left.$name;
$r = $right.$name;
$location{$l} = 0;
$location{$r} = 0;
#print "$l\t";
#print "$location{$l}\t","$r\t","$location{$r}\n";
}
#my @key = keys(%location);
#print "@key\n";
my $judge =0;
my $word1 = "locus_tag";
my $word2 = "CDS";
open(Annotate,"$ARGV[0]") or die ("No annotate file finded: $!");
open(OUT,">result.txt") or die ("can not write result: $!");
print OUT "Genes\tleft\tright\n";
while(<Annotate>){
chomp;
if(/$word1/){
if($judge == 2){$judge =0; next;}
@temp3 = split(/\t/,$_);
$temp = $temp3[-1];
chomp($temp);
if($temp=~/\r$/){chop($temp);}
if(grep /$temp/, @gene){
$l = $left.$temp;
$r = $right.$temp;
print "find $temp\t";
print OUT "$temp\t";
#print "$l\t";
$judge =1;
#$temp_gene = $temp3[-1];
next;
}
}
if($judge ==1){
if(/$word2/){
@temp2 = split(/\t/, $_);
my $te1 = $temp2[0];
chomp($te1);
if($te1=~/\r$/){chop($te1);}
my $te2 = $temp2[1];
chomp($te2);
if($te2=~/\r$/){chop($te2);}
if($te1 > $ te2){$location{$l}=$te2;$location{$r}=$te1;}
else{$location{$l}=$te1;$location{$r}=$te2;}
#print "$l\t";
print "$location{$l}\t";
print OUT "$location{$l}\t";
print "$location{$r}\n";
print OUT "$location{$r}\n";
$judge =2;
#$temp_gene ="";
next;
}
}
}
close (Annotate) or die ("can not close annotate file: $!");
#my %distance;
#print "$gene[1]\n";
my $row1 = join("\t",@gene[2..$num-1]);
#print "$row1\n";
print OUT "\n";
print OUT "Distance\t$row1\n";
my $dis;
my $dis1;
my $dis2;
my $i;
my $j;
#print $num."\n";
foreach $name (@gene[2..$num-1]){
print OUT "$name\t";
#基因1的起始位置key值$l1
my $l1 = $left.$name;
#基因1的终止位置key值$r1
my $r1 = $right.$name;
foreach my $name1 (@gene[2..$num-1]){
#基因2的起始位置key值$l2
my $l2 = $left.$name1;
#基因2的起始位置key值$r2
my $r2 = $right.$name1;
if($name1 eq $gene[$num-1]){
if($location{$l1} >= $location{$r2}){
$dis1 = $location{$l1} - $location{$r2};
$dis2 = $gene[1] - ($location{$r1} - $location{$l2});
if($dis1 < $dis2){$dis = $dis1}else{$dis=$dis2}
#print $dis."\n";
print OUT "$dis\n";
}elsif($location{$l2} >= $location{$r1}){
$dis1 = $location{$l2} - $location{$r1};
$dis2 = $gene[1] - ($location{$r2} - $location{$l1});
if($dis1 < $dis2){$dis = $dis1}else{$dis=$dis2}
#print $dis."\n";
print OUT "$dis\n";
}else{
$dis = $location{$l1}-$location{$l2};
if($dis < 0){$dis = -$dis}
#print $dis."\n";
print OUT "$dis\n";
}
next;
}
if($location{$l1} >= $location{$r2}){
$dis1 = $location{$l1} - $location{$r2};
$dis2 = $gene[1] - ($location{$r1} - $location{$l2});
if($dis1 < $dis2){$dis = $dis1}else{$dis=$dis2}
#print $dis."\t";
print OUT "$dis\t";
}elsif($location{$l2} >= $location{$r1}){
$dis1 = $location{$l2} - $location{$r1};
$dis2 = $gene[1] - ($location{$r2} - $location{$l1});
if($dis1 < $dis2){$dis = $dis1}else{$dis=$dis2}
#print $dis."\t";
print OUT "$dis\t";
}else{
$dis = $location{$l1}-$location{$l2};
if($dis < 0){$dis = -$dis}
#print $dis."\t";
print OUT "$dis\t";
}
}
}
close (OUT) or die ("can not close result file: $!");
原理是利用正则表达式以locus_tag和CDS识别基因编号和起始至终止位置;利用hash数组储存不同目的基因的起始和终止位置;再判断两两基因位置,当两基因A、B不重叠或包含,则计算A终止至B起始碱基数、B终止至A起始之间碱基数,取两者最小值;当两基因A、B重叠或包含时取A起始至B起始之间碱基数。
三、运行代码
以linux为例:
perl distance.pl ZJSH63.tbl genes.txt
三、输出文件
输出文件为result.txt
内容如下:
会打印目的基因的起始和终止位置,再打印两基因距离。
不足
对于NCBI的注释文件,其结构与prokka注释文件结构不一致,如GenBank注释文件xxx.gbff:
该文件识仍可以采用locus_tag和CDS,但是locus_tag后又引号,需要去除,分隔方式也不为制表符;利用CDS识别起始和终止位置也需要特殊处理。
需要直接表征目的基因在基因组上前后关系时,需要更改代码内基因位置判断关系的语句。
(懒得做了…第一次发布,请见谅)