基因组拼接完成甚至注释完之后,由于有个染色体因为Hi-C信号不是很丰富,导致着丝粒前的序列和着丝粒后的序列装反了,直到用jcvi开始共线性分析才发现,所以想办法直接对现有的两个文件,该染色体序列和该染色体对应的基因结构注释进行一个顺序变换。不过本操作有两个前提: 1、染色体仍是存在gap的;2、跨gap的基因应当被去除。 根据现在的情况,我要将下图右边的块移动到最左边。
由于跨gap的基因应该被去除,所以我只要将这条染色体根据gap再打断成contig(实际上不是传统意义上的contig,这里专指被N分开的若干段序列),重排contig之后就能获得新染色体序列,而每个contig上的基因相对于contig是不会变化的,只需要计算contig移动之后基因的新坐标就可以获得新染色体基因注释。
1、使用Perl脚本根据gap打断染色体
#!/usr/bin/perl
use strict;
use warnings;
my $author = "Alter X";
my $usage = "$0 <input_chr.fa> <output_broken_contig.list>";
die($usage) unless (@ARGV == 2);
my $fasta_input_file = $ARGV[0];
my $broken_contig_file = $ARGV[1];
my $seq_id;
my $seq = "";
my @seq_data;
open(my $fasta_input,"<",$fasta_input_file) or die "$!";
while(<$fasta_input>){
chomp;
$seq_id = $1 and next if (m/^>(.*)/);
$seq .= $_;
}
close($fasta_input);
#最精华的一条代码,实现根据gap打断染色体,同时保留gap的序列,分别从存入数组seq_data中
@seq_data = $seq =~ /([ACGT]+|N+)/g;
my $num = 1;
my $length;
my $start = 0;
my $end = 0;
open(my $broken_contig,">",$broken_contig_file);
for my $sub_seq (@seq_data){
$start = $end + 1;
$length = length($sub_seq);
$end = $start + $length - 1;
if ($sub_seq =~ /[ACGT]+/){
my @output = ($seq_id,"contig".$num,$start,$end,$length,"W");
print $broken_contig join("\t",@output)."\n";
$num += 1;
}elsif ($sub_seq =~ /N+/){
my @output = ($seq_id,"contig".$num,$start,$end,$length,"U");
print $broken_contig join("\t",@output)."\n";
$num += 1;
}else{
die "Split seq failed: $!";
}
}
close($broken_contig);
2、根据重排需求手动调整表格顺序
通过上述perl脚本可以获得如下表格,存储了按gap分割之后每个contig的坐标和长度,以及是否为gap contig。
Chromosome17 contig1 1 181000 181000 W
Chromosome17 contig2 181001 181100 100 U
Chromosome17 contig3 181101 182100 1000 W
...
Chromosome17 contig157 10331638 10332637 1000 W
Chromosome17 contig158 10332638 10332737 100 U
Chromosome17 contig159 10332738 10333737 1000 W
然后通过TRF重复序列注释结果、基因注释结果、Hi-C热图、共线性等等来判断应该将哪个contig移动到哪里。我在前面的jcvi共线性图中的两个大共线性块之间的contig99中发现了端粒重复序列,进一步确定了确实是装反了,所以我要将contig99-contig159移动到contig1之前,所以修改这一表格得到如下结果。
Chromosome17 contig99 6126377 6251547 125171 W
Chromosome17 contig100 6251548 6251647 100 U
Chromosome17 contig101 6251648 6435647 184000 W
...
Chromosome17 contig95 5868875 5878874 10000 W
Chromosome17 contig96 5878875 5878974 100 U
Chromosome17 contig97 5878975 6126276 247302 W
3、使用python脚本获得修复好的染色体序列和基因注释
(需要提前安装seqkit并放入环境变量)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import re
import sys
import argparse
import logging
import gffutils
import subprocess
#Author = Alter X
def fix(args):
new_contig_list_file = args.list
gene_overlap_contigs_file = '%s.gene_overlap_contigs_file.gff3' % args.prefix
new_gff3_file = '%s.fixed.gff3' % args.prefix
old_gff3_file = args.gff
new_chr_fasta_file = '%s.fixed.fa' % args.prefix
old_chr_fasta_file = args.chr
gff3_db = gffutils.create_db(old_gff3_file, dbfn = "gff3.db", force = True, keep_order = True)
new_contig_data = []
with open(new_contig_list_file, "r") as contig_file:
for line in contig_file:
columns = line.strip().split("\t")
new_contig_data.append(columns)
new_start = 1
new_end = 1
err = open(gene_overlap_contigs_file, "w")
new_gff3 = open(new_gff3_file, "w")
new_chr_fasta = open(new_chr_fasta_file, "w")
seq_id = None
for data_row in new_contig_data:
chr_id = data_row[0]
old_start = int(data_row[2])
old_end = int(data_row[3])
length = int(data_row[4])
seqkit_command = "seqkit subseq -r {}:{} {}".format(old_start,old_end,old_chr_fasta_file)
subseq = subprocess.run(seqkit_command, shell = True, stdout = subprocess.PIPE, stderr = subprocess.PIPE, text = True)
if subseq.returncode == 0:
lines = subseq.stdout.split('\n')
for line in lines:
match = re.match(r'^>(.*)',line)
if match and seq_id is None:
seq_id = match.group(1)
new_chr_fasta.write(">"+seq_id+"\n")
continue
elif not match:
stripped_line = line.strip()
new_chr_fasta.write(stripped_line)
else:
print(subseq.stderr)
overlap_element_db = gff3_db.region(seqid = chr_id, start = old_start, end = old_end, completely_within = False)
for element in overlap_element_db:
if element.start < old_start or element.end > old_end:
err.write(str(element) + "\n")
continue
#最重要的一段,计算contig坐标的变换
#每个元素的旧起始/终止位点减去contig的旧起始位点,就能得到每个元素相对于contig的偏移值
#再加上contig的新起始位点,就能得到每个元素的新坐标
element.start = element.start - old_start + new_start
element.end = element.end - old_start + new_start
new_gff3.write(str(element) + "\n")
#基于contig的长度
new_start += length
new_gff3.close()
err.close()
new_chr_fasta.close()
subprocess.run("seqkit seq -w 60 {} > {}".format(new_chr_fasta_file, new_chr_fasta_file + ".temp"), shell=True)
subprocess.run("mv {} {}".format(new_chr_fasta_file + ".temp", new_chr_fasta_file), shell=True)
def main():
logging.basicConfig(
stream=sys.stderr,
level=logging.INFO,
format="[%(levelname)s] %(message)s"
)
parser = argparse.ArgumentParser(
formatter_class=argparse.RawTextHelpFormatter,
description="Fix gff3 file with reorder broken contigs, must run this script on a single chromosome once time")
parser.add_argument('-g', '--gff', required=True, help='old gff3 file')
parser.add_argument('-c', '--chr', required=True, help='old Chromosome fasta file')
parser.add_argument('-l', '--list', required=True, help='a file with reorder broken contigs\n'
'chr\tb_contig\tstart\tend\tlength\tgap(W/U)\n'
'chr1\tb_contig1\t1\t5000\t5000\tW\n'
'chr1\tb_contig2\t5001\t5100\t100\tU\n')
parser.add_argument('-p', '--prefix', default='result', help='prefix of output, final output contains: \n'
'"prefix.gene_overlap_contigs_file.gff3" - a file show old genes with overlap on different contigs, especially overlap on gap\n'
'"prefix.fixed.fa" - a file which fixed the Chromosome fasta file with reorder broken contigs\n'
'"prefix.fixed.gff3" - a file which fixed the coordinate with reorder broken contigs')
args = parser.parse_args()
fix(args)
if __name__ == "__main__":
main()
使用帮助如下:
./FixChr.py -h
usage: FixChr.py [-h] -g GFF -c CHR -l LIST [-p PREFIX]
Fix gff3 file with reorder broken contigs, must run this script on a single chromosome once time
optional arguments:
-h, --help show this help message and exit
-g GFF, --gff GFF old gff3 file
-c CHR, --chr CHR old Chromosome fasta file
-l LIST, --list LIST a file with reorder broken contigs
chr b_contig start end length gap(W/U)
chr1 b_contig1 1 5000 5000 W
chr1 b_contig2 5001 5100 100 U
-p PREFIX, --prefix PREFIX
prefix of output, final output contains:
"prefix.gene_overlap_contigs_file.gff3" - a file show old genes with overlap on different contigs, especially overlap on gap
"prefix.fixed.fa" - a file which fixed the Chromosome fasta file with reorder broken contigs
"prefix.fixed.gff3" - a file which fixed the coordinate with reorder broken contigs
最后的prefix.fixed.fa
就是修复好的染色体序列,prefix.fixed.gff3
就是修复好的基因注释,prefix.gene_overlap_contigs_file.gff3
则储存了跨contig,尤其是跨gap的基因注释元素,可能会存在gene跨gap,但mRNA、CDS、exon没有跨,所以需要根据这个文件中的基因ID将对应的其它元素再从prefix.fixed.gff3
中去除掉。
想确认基因注释的坐标到底有没有修复正确,可以使用gffread将修复前后的转录本/蛋白序列提出来,用seqkit sort一下,再将两个文件diff一下,如果完全一样,或者只有存在prefix.gene_overlap_contigs_file.gff3
中的元素不一样,就说明坐标修复是没有问题的。