根据位置从gff中提取gene

27 篇文章 4 订阅

小程序-根据位置从gff中提取gene

1 需要的文件格式

1:  染色体 位置

2 程序

 1:  use strict;
 2:  use warnings;
 3:   
 4:  my (@information,$chr,%hash,$key1,$key2,);
 5:  
 6:  #构建gff的hash
 7:  
 8:  my  $in_in = "saccharomyces_cerevisiae_R64-1-1_20110208.gff";
 9:  open  my $in, '<', $in_in or die "cannot open\n";
10:  while(<$in>)
11:  {
12:      chomp;
13:      next if /^#/;
14:      @information=split/\s+/,$_;
15:      if ($information[2] eq "gene")
16:      {
17:          $information[0]=~/^chr(.+)/;
18:          $chr=$1;
19:          $information[8]=~/^ID=(.+?);/;
20:          $hash{$chr}{$information[3]}{$information[4]} = $1;
21:      }
22:  }
23:  close  $in;
24:   
25:   
26:  my  $out_out = "pos_with_gene.txt";
27:  open  my $out, '>', $out_out or die  "failed open$!\n";
28:  #根据染色体和位置信息,遍历寻找 
29:   
30:  my  $in1_in = "pos_to_find_gen.txt";
31:  open  my $in1, '<', $in1_in or die "cannot open\n";
32:  while(<$in1>)
33:  {
34:      chomp;
35:      next if /^\s+$/;
36:      @information=split/\s+/,$_;
37:      foreach $key1 (sort keys %{$hash{$information[0]}})
38:      {
39:          foreach $key2 (sort keys %{$hash{$information[0]}{$key1}})
40:          {
41:              if ($information[1]>=$key1 and $information[1]<=$key2)
42:              {
43:                  print $out "@information[0,1] $hash{$information[0]}{$key1}{$key2}\n";
44:                  last;
45:              }
46:              else
47:              {
48:                  next;
49:              }
50:          }
51:      }
52:  }
53:  close  $in1;
54:  close  $out;

Author: grc <grc@grc>

Date: 2013-06-29 22:50:30 CST

HTML generated by org-mode 6.33x in emacs 23


可以使用Biopython和pandas库来解析基因文件gff3文件,并提取启动子序列。 首先,需要从基因文件读取DNA序列。假设基因文件是fasta格式,可以使用Biopython的SeqIO模块读取序列: ```python from Bio import SeqIO genome_file = "genome.fasta" genome_seq = SeqIO.read(genome_file, "fasta").seq ``` 接下来,需要从gff3文件提取基因信息和其位置。可以使用pandas库读取gff3文件,并筛选出基因信息: ```python import pandas as pd gff_file = "genome.gff3" gff_df = pd.read_csv(gff_file, sep="\t", comment="#", header=None) gff_df.columns = ["seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes"] gene_df = gff_df[gff_df["type"]=="gene"] ``` 然后,可以根据基因位置提取其启动子序列。假设启动子长度为1000个碱基,可以根据基因的方向,从基因的上游或下游位置提取启动子序列: ```python upstream_len = 1000 promoter_seqs = [] for index, row in gene_df.iterrows(): gene_start = row["start"] gene_end = row["end"] gene_strand = row["strand"] if gene_strand == "+": promoter_start = max(gene_start - upstream_len, 0) promoter_end = gene_start else: promoter_start = gene_end promoter_end = gene_end + upstream_len if promoter_end > len(genome_seq): promoter_end = len(genome_seq) promoter_seq = genome_seq[promoter_start:promoter_end] promoter_seqs.append(promoter_seq) ``` 最后,可以将启动子序列保存到文件: ```python with open("promoters.fasta", "w") as f: for i, promoter_seq in enumerate(promoter_seqs): f.write(">promoter_{}\n{}\n".format(i+1, promoter_seq)) ```
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值