【perl计算基因在基因组上的距离】

jnope

已于 2022-04-18 15:36:33 修改

阅读量1.2k

点赞数 5

文章标签： perl

于 2022-04-17 23:30:41 首次发布

本文链接：https://blog.csdn.net/jnope/article/details/124238363

版权

perl计算基因在基因组上的距离

介绍

根据输入注释文件及目的基因文件，找出目的基因在基因组上的距离并输出到文件内

一、文件格式

基因注释文件：为细菌全基因组利用prokka注释文件，格式如下
gene和CDS前为基因的起始和结束位置，locus_tag为基因编号，product为注释结果，以制表符分隔。

目的基因文件xxx.txt：

以制表符分隔，第一行为描述，第二行为全基因组长度bp，第三行及之后为所要计算距离的基因。

二、代码

这里输入注释文件为ZJSH63.tbl，目的基因文件为genes.txt
代码distance.pl如下：

#! /usr/bin/perl -w
use strict;
my @list = @ARGV;
#运行脚本示例：perl distance.pl	ZJSH63.tbl	genes.txt
#ZJSH63.tbl文件为prokka注释文件
#genes.txt为所要计算距离的基因编号文件
open(Genes,"$ARGV[1]") or die ("No genes file finded: $!");
my @gene;
my %location;
my @temp1;
my @temp2;
my @temp3;
my $num =0;
my $temp;
while(<Genes>){
	chomp;
	#geen.txt第0列为信息，第2列为菌株、基因组长度和基因编号
    @temp1 = split(/\t/,$_);
	#@gene 0位为菌名，1位为基因组长度bp，后续为基因编号
	$temp = $temp1[1];
	chomp($temp);
	if($temp =~/\r$/){chop($temp);}
	#print "$temp\t";
	$gene[$num] = $temp;
	#print "$gene[$num]\t";
	$num++;
}
close (Genes) or die ("can not close genes file: $!");

print "number of genes: $num\n";
my $name;
my $left = "left";
my $right = "right";
my $l;
my $r;
foreach $name (@gene[2..$num-1]){
	chomp($name);
	#print "$name\t";
	$l = $left.$name;
	$r = $right.$name;
	$location{$l} = 0;
	$location{$r} = 0;
	#print "$l\t";
	#print "$location{$l}\t","$r\t","$location{$r}\n";
}
#my @key = keys(%location);
#print "@key\n";
my $judge =0;
my $word1 = "locus_tag";
my $word2 = "CDS";
open(Annotate,"$ARGV[0]") or die ("No annotate file finded: $!");
open(OUT,">result.txt") or die ("can not write result: $!");
print OUT "Genes\tleft\tright\n";
while(<Annotate>){
	chomp;
	if(/$word1/){
		if($judge == 2){$judge =0; next;}
		@temp3 = split(/\t/,$_);
		$temp = $temp3[-1];
		chomp($temp);
		if($temp=~/\r$/){chop($temp);}
		if(grep /$temp/, @gene){
			$l = $left.$temp;
			$r = $right.$temp;
			print "find $temp\t";
			print OUT "$temp\t";
			#print "$l\t";
			$judge =1;
			#$temp_gene = $temp3[-1];
			next;
		}
	}
	if($judge ==1){
		if(/$word2/){
			@temp2 = split(/\t/, $_);
			my $te1 = $temp2[0];
			chomp($te1);
			if($te1=~/\r$/){chop($te1);}
			my $te2 = $temp2[1];
			chomp($te2);
			if($te2=~/\r$/){chop($te2);}
			if($te1 > $ te2){$location{$l}=$te2;$location{$r}=$te1;}
			else{$location{$l}=$te1;$location{$r}=$te2;}
			#print "$l\t";
			print "$location{$l}\t";
			print OUT "$location{$l}\t";
			print "$location{$r}\n";
			print OUT "$location{$r}\n";
			$judge =2;
			#$temp_gene ="";
			next;
		}
	}
}
close (Annotate) or die ("can not close annotate file: $!");

#my %distance;
#print "$gene[1]\n";
my $row1 = join("\t",@gene[2..$num-1]);
#print "$row1\n";
print OUT "\n";
print OUT "Distance\t$row1\n";
my $dis;
my $dis1;
my $dis2;
my $i;
my $j;
#print $num."\n";
foreach $name (@gene[2..$num-1]){
	print OUT "$name\t";
	#基因1的起始位置key值$l1
	my $l1 = $left.$name;
	#基因1的终止位置key值$r1
	my $r1 = $right.$name;
	foreach my $name1 (@gene[2..$num-1]){
		#基因2的起始位置key值$l2
		my $l2 = $left.$name1;
		#基因2的起始位置key值$r2
		my $r2 = $right.$name1;
		if($name1 eq $gene[$num-1]){
			if($location{$l1} >= $location{$r2}){
				$dis1 = $location{$l1} - $location{$r2};
				$dis2 = $gene[1] - ($location{$r1} - $location{$l2});
				if($dis1 < $dis2){$dis = $dis1}else{$dis=$dis2}
				#print $dis."\n";
				print OUT "$dis\n";
			}elsif($location{$l2} >= $location{$r1}){
				$dis1 = $location{$l2} - $location{$r1};
				$dis2 = $gene[1] - ($location{$r2} - $location{$l1});
				if($dis1 < $dis2){$dis = $dis1}else{$dis=$dis2}
				#print $dis."\n";
				print OUT "$dis\n";
			}else{
				$dis = $location{$l1}-$location{$l2};
				if($dis < 0){$dis = -$dis}
				#print $dis."\n";
				print OUT "$dis\n";
			}
			next;
		}
		if($location{$l1} >= $location{$r2}){
			$dis1 = $location{$l1} - $location{$r2};
			$dis2 = $gene[1] - ($location{$r1} - $location{$l2});
			if($dis1 < $dis2){$dis = $dis1}else{$dis=$dis2}
			#print $dis."\t";
			print OUT "$dis\t";
		}elsif($location{$l2} >= $location{$r1}){
			$dis1 = $location{$l2} - $location{$r1};
			$dis2 = $gene[1] - ($location{$r2} - $location{$l1});
			if($dis1 < $dis2){$dis = $dis1}else{$dis=$dis2}
			#print $dis."\t";
			print OUT "$dis\t";
		}else{
			$dis = $location{$l1}-$location{$l2};
			if($dis < 0){$dis = -$dis}
			#print $dis."\t";
			print OUT "$dis\t";
		}
	}
}
close (OUT) or die ("can not close result file: $!");

原理是利用正则表达式以locus_tag和CDS识别基因编号和起始至终止位置；利用hash数组储存不同目的基因的起始和终止位置；再判断两两基因位置，当两基因A、B不重叠或包含，则计算A终止至B起始碱基数、B终止至A起始之间碱基数，取两者最小值；当两基因A、B重叠或包含时取A起始至B起始之间碱基数。

三、运行代码

以linux为例：

perl distance.pl ZJSH63.tbl genes.txt

三、输出文件

输出文件为result.txt
内容如下：
在这里插入图片描述
会打印目的基因的起始和终止位置，再打印两基因距离。

不足

对于NCBI的注释文件，其结构与prokka注释文件结构不一致，如GenBank注释文件xxx.gbff：
在这里插入图片描述
该文件识仍可以采用locus_tag和CDS，但是locus_tag后又引号，需要去除，分隔方式也不为制表符；利用CDS识别起始和终止位置也需要特殊处理。
需要直接表征目的基因在基因组上前后关系时，需要更改代码内基因位置判断关系的语句。
（懒得做了…第一次发布，请见谅）