perl脚本根据gff3文件的提取CDS最长的转录本

Alter X

已于 2024-03-06 12:00:11 修改

阅读量1.1k

点赞数 9

分类专栏：生物信息学学习笔记 perl 文章标签： perl 生物信息学

于 2023-12-05 14:34:11 首次发布

本文链接：https://blog.csdn.net/Alter_X/article/details/134806521

版权

学习笔记同时被 3 个专栏收录

10 篇文章 0 订阅

订阅专栏

生物信息学

8 篇文章 0 订阅

订阅专栏

perl

3 篇文章 0 订阅

订阅专栏

最近要进行比较基因组的分析，下载了各式各样的基因组注释，总会有一些和现有脚本和软件不兼容的，尤其是在CoGe上下载了一个奇怪的gff3注释，有些CDS直接没有parent注释导致频频报错，提个最长转录本都提不出来，直接写个新的脚本好了。

#!/usr/bin/perl
use strict;
use warnings;

my $usage = "$0 <input.gff3> <output.list>";
die($usage) unless (@ARGV == 2);

# 输入文件和输出文件
my $input_file = $ARGV[0];
my $output_file = $ARGV[1];

# 打开输入文件
open(ERR,">",'err_line.txt');
open(my $in,'<',$input_file) or die "无法打开输入文件 $input_file: $!";
# 初始化变量
my %gene_data;
my $gene_id;

# 逐行读取GFF3文件
while (my $line = <$in>) {
    next if $line =~ /^#/;  # 跳过注释行
    chomp $line;
    my @fields = split(/\t/, $line);
    my ($feature, $start, $end, $strand, $attributes) = @fields[2, 3, 4, 6, 8];

    # 提取基因信息
    if ($feature eq 'gene') {
        $gene_id = $1 if($attributes =~ /ID=([^;]+)/);
        $gene_data{$gene_id} = {};
    }

    # 提取转录本信息
    if ($feature eq 'mRNA') {
        my $mRNA_id = $1 if($attributes =~ /ID=([^;]+)/);
        my $Parent_id = $1 if($attributes =~ /Parent=([^;]+)/);
        print ERR "$line\n" and next unless(defined $Parent_id);
        $gene_data{$Parent_id}{$mRNA_id} = {};
    }

    # 提取CDS信息
    if ($feature eq 'CDS') {
        my $CDS_id = $1 if($attributes =~ /ID=([^;]+)/);
        my $Parent_id = $1 if($attributes =~ /Parent=([^;]+)/);
        print ERR "$line\n" and next unless(defined $Parent_id);
        print ERR "$line\n" and next unless(exists $gene_data{$gene_id}{$Parent_id});
        # 存储CDS信息
        # 有的gff没有CDS ID,可以这样通过行号分辨一下
        $CDS_id = "$Parent_id\.CDS.$." unless(defined $CDS_id);
        $gene_data{$gene_id}{$Parent_id}{$CDS_id} = {
            'start' => $start,
            'end'   => $end
        }
    }
}
# 关闭输入文件

close($in);

# 打开输出文件
open(my $out,'>',$output_file) or die "无法打开输出文件 $output_file: $!";
# 遍历基因数据，提取最长转录本
foreach my $gene (keys(%gene_data)){
    my $longest_mRNA;
    my $longest_mRNA_length = 0;
    foreach my $mRNA (keys(%{$gene_data{$gene}})){
        my $mRNA_length = 0;
        # 计算最长转录本长度
        while(my($CDS_key,$CDS_value) = each(%{$gene_data{$gene}{$mRNA}})){
            my $CDS_length = $CDS_value->{end} - $CDS_value->{start} + 1;
            $mRNA_length += $CDS_length;
        }
        if($longest_mRNA_length < $mRNA_length){
            $longest_mRNA_length = $mRNA_length;
            $longest_mRNA = $mRNA;
        }
    }
    print ERR "noCDS: $gene\n" and next unless(defined $longest_mRNA);
    print $out "$longest_mRNA\n";
}
# 关闭输出文件
close($out);
close(ERR);
print "Completed, output file saved in $output_file\n";
print "Some errors saved in err_line.txt\n"

这个脚本根据gff3中每个CDS长度确认最长转录本，最后保存了一个包含最长转录本ID的文件，可以用seqkit从所有转录本中提取一下。

seqkit grep -f ouput.list genome.cds |seqkit seq -i > genome.longest.cds

脚本还会生成一个err_line.txt文件，把一些Parent没指明的行、CDS长度不对或没有CDS的行丢进去了。

Alter X

关注

9
点赞
踩
14

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录