filter_feature_by_attribute_value.pl-脚本03

最新推荐文章于 2024-09-26 16:43:41 发布

让学习成为一种生活方式

最新推荐文章于 2024-09-26 16:43:41 发布

阅读量206

点赞数 1

分类专栏： linux shell 生物信息学文章标签：机器学习深度学习人工智能

本文链接：https://blog.csdn.net/weixin_44874487/article/details/142401767

版权

生物信息学同时被 3 个专栏收录

69 篇文章 5 订阅

订阅专栏

linux

26 篇文章 1 订阅

订阅专栏

shell

13 篇文章 0 订阅

订阅专栏

本脚本旨在根据属性值（第九列）过滤特征。

如果属性存在且值未通过测试，则该特征将被写入输出文件。

如果属性存在且值通过测试，则该特征将被丢弃并写入 `_discarded.gff` 文件。

如果属性标签缺失（无法应用测试），则该特征将默认写入输出文件。如果启用了 `--na_aside` 参数，则这些特征将被写入 `_na.gff` 文件。

属性存储在第九列，格式为：`tag=value`。
/!\ 删除一级或二级特征将自动删除所有关联的子特征。
/!\ 删除一个特征的所有子特征也会自动删除该特征本身（除非启用了 `--keep_parental`）。
/!\ 如果 `--keep_parental` 未启用且 `--na_aside` 已启用，并且记录的所有三级特征在 `_na.gff` 和 `_discarded.gff` 之间被拆分，则父级的一、二级特征将被删除，并最终只保留在 `_na.gff` 文件中。

1 使用方法

agat_sp_filter_feature_by_attribute_value.pl --gff infile.gff --value 1 -t "=" [ --output outfile ]
agat_sp_filter_feature_by_attribute_value.pl --help

2 选项

-f, --reffile, --gff or -ref  
  输入的 GFF3 文件。

-a or --attribute
  指定要分析的属性标签（例如：`tag=value`）。

-p, --type or -l 
  主标签选项，不区分大小写，可接受列表。允许指定要处理的特征类型。可以通过第 3 列中的主标签名称指定特定特征，如：`cds`、`Gene`、`MrNa`。也可以直接指定特定级别的所有特征，如：`level2=mRNA,ncRNA,tRNA`，`level3=CDS,exon,UTR` 等。默认情况下，所有特征都会被考虑。填写 `all` 具有相同的效果。

--value  
  检查属性中的值，区分大小写。多个值须用逗号分隔。

--value_insensitive  
  布尔值，默认关闭。启用时，通过 `--value` 参数提供的值将不区分大小写进行处理。

--na_aside  
  布尔值，默认关闭。默认情况下，如果缺少用于过滤的属性标签，则该特征会写入输出文件。启用后，这些特征将被写入一个名为 `_na.gff` 的单独文件中。

--keep_parental
  布尔值，默认关闭。启用时，即使所有子特征已被删除，父级特征也会保留。

-t or --test  
  应用的测试（> < = ! >= <=），默认值为 `=`。如果使用 `>` 或 `<`，请记得对参数加引号，如 `"<="`，否则终端可能会出错。仅 `=` 和 `!` 可以用于比较字符串值。

-o or --output  
  输出 GFF 文件。如果未指定输出文件，结果将写入标准输出（STDOUT）。

-v  
  调试用途的详细输出选项。

-c or --config  
  字符串 - 输入 AGAT 配置文件。默认情况下，AGAT 会使用当前工作目录下的 `agat_config.yaml` 文件（如果存在），否则使用 AGAT 自带的原始配置文件。要在本地获取配置文件，请运行 `"agat config --expose"`。使用 `--config` 选项可以指定使用自定义的 AGAT 配置文件（位于其他位置或不同名称）。

-h or --help  
  显示此帮助文本。

3 脚本内容

#!/usr/bin/env perl

use strict;
use warnings;
use Getopt::Long;
use File::Basename;
use POSIX qw(strftime);
use Scalar::Util qw(looks_like_number);
use Pod::Usage;
use IO::File;
use AGAT::AGAT;

my $header = get_agat_header();
my $config;
my $primaryTag=undef;
my $opt_output= undef;
my $opt_value = undef;
my $opt_keep_parental = undef;
my $opt_na_aside = undef;
my $opt_value_insensitive = undef;
my $opt_attribute = undef;
my $opt_test = "=";
my $opt_gff = undef;
my $opt_verbose = undef;
my $opt_help;

# OPTION MANAGMENT
my @copyARGV=@ARGV;
if ( !GetOptions( 'f|ref|reffile|gff=s' => \$opt_gff,
                  'value=s'             => \$opt_value,
                  'value_insensitive!'  => \$opt_value_insensitive,
                  'keep_parental!'      => \$opt_keep_parental,
                  'na_aside!'           => \$opt_na_aside, 
                  "p|type|l=s"          => \$primaryTag,
                  'a|attribute=s'       => \$opt_attribute,
                  't|test=s'            => \$opt_test,
                  'o|output=s'          => \$opt_output,
                  'v|verbose!'          => \$opt_verbose,
                  'c|config=s'          => \$config,
                  'h|help!'             => \$opt_help ) )
{
    pod2usage( { -message => 'Failed to parse command line',
                 -verbose => 1,
                 -exitval => 1 } );
}

if ($opt_help) {
    pod2usage( { -verbose => 99,
                 -exitval => 0,
                 -message => "$header\n" } );
}

if ( ! $opt_gff or ! defined($opt_value) or ! $opt_attribute ){
    pod2usage( {
           -message => "$header\nAt least 3 parameters are mandatory:\n1) Input reference gff file: --gff\n".
           "2) An attribute tag: -a\n3) A value (string or int) that will be used for filtering: --value\n\n",
           -verbose => 0,
           -exitval => 2 } );
}

# --- Manage config ---
$config = get_agat_config({config_file_in => $config});

###############
# Test options
if($opt_test ne "<" and $opt_test ne ">" and $opt_test ne "<=" and $opt_test ne ">=" and $opt_test ne "=" and $opt_test ne "!"){
  print "The test to apply is Wrong: $opt_test.\nWe want something among this list: <,>,<=,>=,! or =.";exit;
}

###############
# Manage Output

## FOR GFF FILE
my $gffout_ok_file ;
my $fhout_discarded_file ;
my $ostreamReport_file;
my $fhout_semidDiscarded_file if $opt_na_aside;

if ($opt_output) {
  my ($outfile,$path,$ext) = fileparse($opt_output,qr/\.[^.]*/);

  # set file names
  $gffout_ok_file = $path.$outfile.$ext;
  $fhout_discarded_file = $path.$outfile."_discarded.gff";
  $ostreamReport_file = $path.$outfile."_report.txt";
  $fhout_semidDiscarded_file = $path.$outfile."_na.gff";
}

my $gffout_ok = prepare_gffout($config, $gffout_ok_file);
my $fhout_discarded = prepare_gffout($config, $fhout_discarded_file);
my $ostreamReport = prepare_fileout($ostreamReport_file);
my $fhout_semidDiscarded = prepare_gffout($config, $fhout_semidDiscarded_file) if $opt_na_aside;

# Manage $primaryTag
my @ptagList;
my $print_feature_string;
if(! $primaryTag or $primaryTag eq "all"){
  $print_feature_string = "all features";
  push(@ptagList, "all");
}
elsif($primaryTag =~/^level[123]$/){
  $print_feature_string .= "$primaryTag features ";
  push(@ptagList, $primaryTag);
}
else{
   @ptagList= split(/,/, $primaryTag);
   foreach my $tag (@ptagList){
      if($tag =~/^level[123]$/){
        $print_feature_string .= "$primaryTag features ";
      }
      else{
        $print_feature_string .= "$tag feature ";
      }
   }
}

# Transform value list into hash
my $value_hash = string_sep_to_hash({ string => $opt_value,
                                      separator => ","
                                    });
                                      
foreach my $value (keys %{$value_hash}){
  if( ! looks_like_number($value) ){
    if($opt_test ne "=" and $opt_test ne "!"){
      print "This test $opt_test is not possible with string value.\n";
      exit; 
    }
  }
}

# start with some interesting information
my $stringPrint = strftime "%m/%d/%Y at %Hh%Mm%Ss", localtime;
$stringPrint .= "\nusage: $0 @copyARGV\n";
$stringPrint .= "We will discard $print_feature_string that have the attribute $opt_attribute with the value $opt_test $opt_value";
if ($opt_value_insensitive){
  $stringPrint .= " case insensitive.\n";
}else{
   $stringPrint .= " case sensitive.\n";
}

if ($opt_output){
  print $ostreamReport $stringPrint;
  print $stringPrint;
}
else{ print $stringPrint; }
                          #######################
# >>>>>>>>>>>>>>>>>>>>>>>>#        MAIN         #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                          #######################
my %all_cases = ( 'left' => {'l1' => 0, 'l2' => 0, 'l3' => 0, 'all' => 0},
                  'discarded' => {'l1' => 0, 'l2' => 0, 'l3' => 0, 'all' => 0} );

######################
### Parse GFF input #
my ($hash_omniscient, $hash_mRNAGeneLink) =  slurp_gff3_file_JD({ input => $opt_gff,
                                                                  config => $config
                                                                });
print("Parsing Finished\n");
### END Parse GFF input #
#########################
# sort by seq id
my $hash_sortBySeq = gather_and_sort_l1_by_seq_id($hash_omniscient);

my $removeit=undef;
#################
# == LEVEL 1 == #
#################
foreach my $seqid (sort { (($a =~ /(\d+)$/)[0] || 0) <=> (($b =~ /(\d+)$/)[0] || 0) } keys %{$hash_sortBySeq}){ # loop over all the feature level1

	foreach my $tag_l1 (sort {$a cmp $b} keys %{$hash_omniscient->{'level1'}}){
		foreach my $feature_l1 ( @{$hash_sortBySeq->{$seqid}{$tag_l1}} ){
			my $id_l1 = lc($feature_l1->_tag_value('ID'));

	    $removeit = check_feature($feature_l1, 'level1');
			# we can remove feature L1 now because we are looping over $hash_sortBySeq not $hash_omniscient
	    if ($removeit){
        my $cases;
        if($removeit == 1){ 
          $cases = remove_l1_and_relatives($hash_omniscient, $feature_l1, $fhout_discarded);
          $all_cases{'discarded'}{'l1'} += $cases->{'l1'};
          $all_cases{'discarded'}{'l2'} += $cases->{'l2'};
          $all_cases{'discarded'}{'l3'} += $cases->{'l3'};
          $all_cases{'discarded'}{'all'} += $cases->{'all'};
        } elsif ($removeit == 2 and $opt_na_aside){ 
          $cases = remove_l1_and_relatives($hash_omniscient, $feature_l1, $fhout_semidDiscarded);
          $all_cases{'na'}{'l1'} += $cases->{'l1'};
          $all_cases{'na'}{'l2'} += $cases->{'l2'};
          $all_cases{'na'}{'l3'} += $cases->{'l3'};
          $all_cases{'na'}{'all'} += $cases->{'all'};
        }
				next;
	    }

	    #################
	    # == LEVEL 2 == #
	    #################
			my %list_l2_to_remove;
			my %list_l3_to_remove;
	    foreach my $tag_l2 (sort keys %{$hash_omniscient->{'level2'}}){ # primary_tag_key_level2 = mrna or mirna or ncrna or trna etc...

	      if ( exists_keys( $hash_omniscient, ('level2', $tag_l2, $id_l1) ) ){
	        my @list_fl2 = @{$hash_omniscient->{'level2'}{$tag_l2}{$id_l1}};
	        foreach my $feature_l2 ( @list_fl2 ) {

	          $removeit = check_feature($feature_l2,'level2');
	          if ($removeit){
              if ($removeit == 1){
                push ( @{$list_l2_to_remove{'discarded'}}, [$feature_l2, $tag_l1, $id_l1, $fhout_discarded]);
              } elsif ($removeit == 2 and $opt_na_aside){
                push ( @{$list_l2_to_remove{'na'}}, [$feature_l2, $tag_l1, $id_l1, $fhout_semidDiscarded]);
              }
	            next;
	          }
	          #################
	          # == LEVEL 3 == #
	          #################
	          my $id_l2 = lc($feature_l2->_tag_value('ID'));

	          foreach my $tag_l3 (sort keys %{$hash_omniscient->{'level3'}}){ # primary_tag_key_level3 = cds or exon or start_codon or utr etc...
	            if ( exists_keys( $hash_omniscient, ('level3', $tag_l3, $id_l2) ) ){
	              my @list_fl3 = @{$hash_omniscient->{'level3'}{$tag_l3}{$id_l2}};
	              foreach my $feature_l3 ( @list_fl3 ) {

	                $removeit = check_feature($feature_l3, 'level3');
                  if($removeit ==  1){
                    push ( @{$list_l3_to_remove{'discarded'}}, [$feature_l3, $tag_l1, $id_l1, $tag_l2, $id_l2, $fhout_discarded]); 
                  } elsif ( $removeit == 2 and $opt_na_aside ){
                    push ( @{$list_l3_to_remove{'na'}}, [$feature_l3, $tag_l1, $id_l1, $tag_l2, $id_l2, $fhout_semidDiscarded]);
	                }
	              }
	            }
	          }
	        }
	      }
	    }
			# Should be removed after looping over them to avoid problems
			foreach my $key ( keys %list_l2_to_remove ){
				foreach my $infos ( @{$list_l2_to_remove{$key}} ) {
					my $cases = remove_l2_and_relatives( $hash_omniscient, @$infos, $opt_keep_parental);
					$all_cases{$key}{'l1'} += $cases->{'l1'};
					$all_cases{$key}{'l2'} += $cases->{'l2'};
					$all_cases{$key}{'l3'} += $cases->{'l3'};
					$all_cases{$key}{'all'} += $cases->{'all'};
				}
			}
			foreach my $key ( sort keys %list_l3_to_remove ){
				foreach my $infos ( @{$list_l3_to_remove{$key}} ) {
					my $cases = remove_l3_and_relatives( $hash_omniscient, @$infos, $opt_keep_parental);
					$all_cases{$key}{'l1'} += $cases->{'l1'};
					$all_cases{$key}{'l2'} += $cases->{'l2'};
					$all_cases{$key}{'l3'} += $cases->{'l3'};
					$all_cases{$key}{'all'} += $cases->{'all'};
				}
			}
		}
  }
}

print_omniscient( {omniscient => $hash_omniscient, output => $gffout_ok} );

$stringPrint = "Feature discarded by applying the test (see $fhout_discarded_file file):\n";
$stringPrint .= $all_cases{'discarded'}{'all'}." features removed:\n";
$stringPrint .= $all_cases{'discarded'}{'l1'}." features level1 (e.g. gene) removed\n";
$stringPrint .= $all_cases{'discarded'}{'l2'}." features level2 (e.g. mRNA) removed\n";
$stringPrint .= $all_cases{'discarded'}{'l3'}." features level3 (e.g. exon) removed\n";

if($opt_na_aside){
  $stringPrint .= "Feature left out because the attribute is missing (see $fhout_semidDiscarded_file file):\n";
  $stringPrint .= $all_cases{'na'}{'all'}." features removed:\n";
  $stringPrint .= $all_cases{'na'}{'l1'}." features level1 (e.g. gene) removed\n";
  $stringPrint .= $all_cases{'na'}{'l2'}." features level2 (e.g. mRNA) removed\n";
  $stringPrint .= $all_cases{'na'}{'l3'}." features level3 (e.g. exon) removed\n";
}

if ($opt_output){
  print $ostreamReport $stringPrint;
  print $stringPrint;
} else{ print $stringPrint; }

#######################################################################################################################
        ####################
         #     methods    #
          ################
           ##############
            ############
             ##########
              ########
               ######
                ####
                 ##

sub check_feature{
  my  ($feature, $level)=@_;

  my $removeit=undef;
  my $primary_tag=$feature->primary_tag;

  # check primary tag (feature type) to handle
  foreach my $ptag (@ptagList){

    if($ptag eq "all"){
      $removeit = should_we_remove_feature($feature);
    }
    elsif(lc($ptag) eq $level){
      $removeit = should_we_remove_feature($feature);
    }
    elsif(lc($ptag) eq lc($primary_tag) ){
      $removeit = should_we_remove_feature($feature);
    }
  }
  return $removeit;
}

sub should_we_remove_feature{
  my ($feature)=@_;

  if ($feature->has_tag($opt_attribute)){

    # get list of values for the attribute
    my @file_values = $feature->get_tag_values($opt_attribute);

    # if we found among the values one pass the test we return 1
    foreach my $file_value (@file_values){

      foreach my $given_value (keys %{$value_hash}){
        # Deal with insensitive for template
        if ($opt_value_insensitive){
          $given_value = lc($given_value);
          $file_value = lc($file_value);
        }
        # for string values replace = by eq and ! by ne and avoid other type of test
        if ( ! looks_like_number ($given_value) or ! looks_like_number ($file_value)){
          print "String case\n" if $opt_verbose;
          if ($opt_test eq "="){
            if ($file_value eq $given_value) { print "equal\n" if $opt_verbose; return 1; }
            else { print "not equal\n" if $opt_verbose; }
          }
          elsif ($opt_test eq "!"){
            if ($file_value ne $given_value){ print "different\n" if $opt_verbose; return 1; }
            else { print "not different\n" if $opt_verbose; }
          }
        } 
        else{
          print "Number case\n" if $opt_verbose;
          if ($opt_test eq "="){
            if ($file_value == $given_value){return 1; }
          }
          elsif ($opt_test eq "!"){
            if ($file_value != $given_value){return 1; }
          }
          elsif ($opt_test eq ">"){
            if ($file_value > $given_value){return 1; }
          }
          elsif ($opt_test eq "<"){
            if ($file_value < $given_value){return 1; }
          }
          elsif ($opt_test eq "<="){
            if ($file_value <= $given_value){return 1; }
          }
          elsif ($opt_test eq ">="){
            if ($file_value >= $given_value){return 1; }
          }
        }
      }
    }
    return 0;
  } else {
    print "Attribute not found  case\n" if $opt_verbose;
    return 2;
  }
}

__END__

=head1 NAME

agat_sp_filter_feature_by_attribute_value.pl

=head1 DESCRIPTION

The script aims to filter features according to attribute value (9th column).
- If the attribute exists and the value do not pass the test, the feature is written into <output>.
- If the attribute exists and the value pass the test, the feature is discarded and written into <output>_discarded.gff.
- If the attribute tag is missing (test cannot be applyed), the feature will be written into <output> by default. If --na_aside parameter 
is activated then it will be written into <output>_na.gff.

Attribute are stored in the 9th column and have this shape: tag=value
/!\ Removing a level1 or level2 feature will automatically remove all linked subfeatures.
/!\ Removing all children of a feature will automatically remove this feature too (excepted if --keep_parental is activated).
/!\ If --keep_parental is not activated and --na_aside is activated, and all level3 features of a record are split between both <output>_na.gff and <output>_discarded.gff, 
then the parental level1 and level2 features are removed and will end up in the <output>_na.gff file only.

=head1 SYNOPSIS

    agat_sp_filter_feature_by_attribute_value.pl --gff infile.gff --value 1 -t "=" [ --output outfile ]
    agat_sp_filter_feature_by_attribute_value.pl --help

=head1 OPTIONS

=over 8

=item B<-f>, B<--reffile>, B<--gff>  or B<-ref>

Input GFF3 file that will be read

=item  B<-a> or B<--attribute>

Attribute tag to specify the attribute to analyse (attribute example: tag=value).

=item B<-p>,  B<--type> or  B<-l>

primary tag option, case insensitive, list. Allow to specied the feature types that will be handled.
You can specified a specific feature by given its primary tag name (column 3) as: cds, Gene, MrNa
You can specify directly all the feature of a particular level:
      level2=mRNA,ncRNA,tRNA,etc
      level3=CDS,exon,UTR,etc
By default all feature are taking into account. fill the option by the value "all" will have the same behaviour.

=item B<--value>

Value(s) to check in the attribute. Case sensitive. List of values must be coma separated.

=item B<--value_insensitive>

Bolean. Deactivated by default. When activated the values provided by the --value parameter are handled case insensitive.

=item B<--na_aside>

Bolean. Deactivated by default. By default if the attribute tag on which the filter is based is missing, the feature will be written into <output>.
When activated, such features will be written into a separate file called <output>_na.gff.

=item B<--keep_parental>

Bolean. Deactivated by default. When activated even if all child features have been removed, the parental one will be kept.

=item B<-t> or B<--test>

Test to apply (> < = ! >= <=). default value "=". 
If you use one of these two character >, <, please don't forget to quote the
parameter like that "<=" otherwise your terminal will complain.
Only = and ! tests can be used to compare string values.

=item B<-o> or B<--output>

Output GFF file. If no output file is specified, the output will be
written to STDOUT.

=item B<-v>

Verbose option for debugging purpose.

=item B<-c> or B<--config>

String - Input agat config file. By default AGAT takes as input agat_config.yaml file from the working directory if any, 
otherwise it takes the orignal agat_config.yaml shipped with AGAT. To get the agat_config.yaml locally type: "agat config --expose".
The --config option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).

=item B<-h> or B<--help>

Display this helpful text.

=back

=head1 FEEDBACK

=head2 Did you find a bug?

Do not hesitate to report bugs to help us keep track of the bugs and their
resolution. Please use the GitHub issue tracking system available at this
address:

            https://github.com/NBISweden/AGAT/issues

 Ensure that the bug was not already reported by searching under Issues.
 If you're unable to find an (open) issue addressing the problem, open a new one.
 Try as much as possible to include in the issue when relevant:
 - a clear description,
 - as much relevant information as possible,
 - the command used,
 - a data sample,
 - an explanation of the expected behaviour that is not occurring.

=head2 Do you want to contribute?

You are very welcome, visit this address for the Contributing guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md

=cut

AUTHOR - Jacques Dainat