Python+Perl联合解决基因组染色体拼接错误问题-CSDN博客

本文链接：https://blog.csdn.net/Alter_X/article/details/135093755

基因组拼接完成甚至注释完之后，由于有个染色体因为Hi-C信号不是很丰富，导致着丝粒前的序列和着丝粒后的序列装反了，直到用jcvi开始共线性分析才发现，所以想办法直接对现有的两个文件，该染色体序列和该染色体对应的基因结构注释进行一个顺序变换。不过本操作有两个前提： 1、染色体仍是存在gap的；2、跨gap的基因应当被去除。 根据现在的情况，我要将下图右边的块移动到最左边。
正好反了，把后面一大段移到最前面就能解决
由于跨gap的基因应该被去除，所以我只要将这条染色体根据gap再打断成contig（实际上不是传统意义上的contig，这里专指被N分开的若干段序列），重排contig之后就能获得新染色体序列，而每个contig上的基因相对于contig是不会变化的，只需要计算contig移动之后基因的新坐标就可以获得新染色体基因注释。

1、使用Perl脚本根据gap打断染色体

#!/usr/bin/perl
use strict;
use warnings;

my $author = "Alter X";
my $usage = "$0 <input_chr.fa> <output_broken_contig.list>";
die($usage) unless (@ARGV == 2);

my $fasta_input_file = $ARGV[0];
my $broken_contig_file = $ARGV[1];

my $seq_id;
my $seq = "";
my @seq_data;

open(my $fasta_input,"<",$fasta_input_file) or die "$!";
while(<$fasta_input>){
    chomp;
    $seq_id = $1 and next if (m/^>(.*)/);
    $seq .= $_;
}
close($fasta_input);

#最精华的一条代码，实现根据gap打断染色体，同时保留gap的序列，分别从存入数组seq_data中
@seq_data = $seq =~ /([ACGT]+|N+)/g;

my $num = 1;
my $length;
my $start = 0;
my $end = 0;

open(my $broken_contig,">",$broken_contig_file);
for my $sub_seq (@seq_data){
    $start = $end + 1;
    $length = length($sub_seq);
    $end = $start + $length - 1;
    if ($sub_seq =~ /[ACGT]+/){
        my @output = ($seq_id,"contig".$num,$start,$end,$length,"W");
        print $broken_contig join("\t",@output)."\n";
        $num += 1;
    }elsif ($sub_seq =~ /N+/){
        my @output = ($seq_id,"contig".$num,$start,$end,$length,"U");
        print $broken_contig join("\t",@output)."\n";
        $num += 1;
    }else{
        die "Split seq failed: $!";
    }
}
close($broken_contig);

2、根据重排需求手动调整表格顺序

通过上述perl脚本可以获得如下表格，存储了按gap分割之后每个contig的坐标和长度，以及是否为gap contig。

Chromosome17	contig1	1	181000	181000	W
Chromosome17	contig2	181001	181100	100	U
Chromosome17	contig3	181101	182100	1000	W
...
Chromosome17	contig157	10331638	10332637	1000	W
Chromosome17	contig158	10332638	10332737	100	U
Chromosome17	contig159	10332738	10333737	1000	W

然后通过TRF重复序列注释结果、基因注释结果、Hi-C热图、共线性等等来判断应该将哪个contig移动到哪里。我在前面的jcvi共线性图中的两个大共线性块之间的contig99中发现了端粒重复序列，进一步确定了确实是装反了，所以我要将contig99-contig159移动到contig1之前，所以修改这一表格得到如下结果。

Chromosome17	contig99	6126377	6251547	125171	W
Chromosome17	contig100	6251548	6251647	100	U
Chromosome17	contig101	6251648	6435647	184000	W
...
Chromosome17	contig95	5868875	5878874	10000	W
Chromosome17	contig96	5878875	5878974	100	U
Chromosome17	contig97	5878975	6126276	247302	W

3、使用python脚本获得修复好的染色体序列和基因注释

（需要提前安装seqkit并放入环境变量）

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import re
import sys
import argparse
import logging
import gffutils
import subprocess
#Author = Alter X

def fix(args):
    new_contig_list_file = args.list
    gene_overlap_contigs_file = '%s.gene_overlap_contigs_file.gff3' % args.prefix
    new_gff3_file = '%s.fixed.gff3' % args.prefix
    old_gff3_file = args.gff
    new_chr_fasta_file = '%s.fixed.fa' % args.prefix
    old_chr_fasta_file = args.chr

    gff3_db = gffutils.create_db(old_gff3_file, dbfn = "gff3.db", force = True, keep_order = True)
    new_contig_data = []
    with open(new_contig_list_file, "r") as contig_file:
        for line in contig_file:
            columns = line.strip().split("\t")
            new_contig_data.append(columns)

    new_start = 1
    new_end = 1

    err = open(gene_overlap_contigs_file, "w")
    new_gff3 = open(new_gff3_file, "w")
    new_chr_fasta = open(new_chr_fasta_file, "w")

    seq_id = None

    for data_row in new_contig_data:
        chr_id = data_row[0]
        old_start = int(data_row[2])
        old_end = int(data_row[3])
        length = int(data_row[4])

        seqkit_command = "seqkit subseq -r {}:{} {}".format(old_start,old_end,old_chr_fasta_file)
        subseq = subprocess.run(seqkit_command, shell = True, stdout = subprocess.PIPE, stderr = subprocess.PIPE, text = True)
        if subseq.returncode == 0:
            lines = subseq.stdout.split('\n')
            for line in lines:
                match = re.match(r'^>(.*)',line)
                if match and seq_id is None:
                    seq_id = match.group(1)
                    new_chr_fasta.write(">"+seq_id+"\n")
                    continue
                elif not match:
                    stripped_line = line.strip()
                    new_chr_fasta.write(stripped_line)
        else:
            print(subseq.stderr)

        overlap_element_db = gff3_db.region(seqid = chr_id, start = old_start, end = old_end, completely_within = False)

        for element in overlap_element_db:
            if element.start < old_start or element.end > old_end:
                err.write(str(element) + "\n")
                continue

			#最重要的一段，计算contig坐标的变换
			#每个元素的旧起始/终止位点减去contig的旧起始位点，就能得到每个元素相对于contig的偏移值
			#再加上contig的新起始位点，就能得到每个元素的新坐标
            element.start = element.start - old_start + new_start
            element.end = element.end - old_start + new_start
            new_gff3.write(str(element) + "\n")

		#基于contig的长度
        new_start += length

    new_gff3.close()
    err.close()
    new_chr_fasta.close()

    subprocess.run("seqkit seq -w 60 {} > {}".format(new_chr_fasta_file, new_chr_fasta_file + ".temp"), shell=True)
    subprocess.run("mv {} {}".format(new_chr_fasta_file + ".temp", new_chr_fasta_file), shell=True)

def main():
    logging.basicConfig(
        stream=sys.stderr,
        level=logging.INFO,
        format="[%(levelname)s] %(message)s"
    )

    parser = argparse.ArgumentParser(
        formatter_class=argparse.RawTextHelpFormatter,
        description="Fix gff3 file with reorder broken contigs, must run this script on a single chromosome once time")

    parser.add_argument('-g', '--gff', required=True, help='old gff3 file')
    parser.add_argument('-c', '--chr', required=True, help='old Chromosome fasta file')
    parser.add_argument('-l', '--list', required=True, help='a file with reorder broken contigs\n'
                                                            'chr\tb_contig\tstart\tend\tlength\tgap(W/U)\n'
                                                            'chr1\tb_contig1\t1\t5000\t5000\tW\n'
                                                            'chr1\tb_contig2\t5001\t5100\t100\tU\n')
    parser.add_argument('-p', '--prefix', default='result', help='prefix of output, final output contains: \n'
                                                                 '"prefix.gene_overlap_contigs_file.gff3" - a file show old genes with overlap on different contigs, especially overlap on gap\n'
                                                                 '"prefix.fixed.fa" - a file which fixed the Chromosome fasta file with reorder broken contigs\n'
                                                                 '"prefix.fixed.gff3" - a file which fixed the coordinate with reorder broken contigs')
    args = parser.parse_args()

    fix(args)

if __name__ == "__main__":
    main()

使用帮助如下：

./FixChr.py -h
usage: FixChr.py [-h] -g GFF -c CHR -l LIST [-p PREFIX]

Fix gff3 file with reorder broken contigs, must run this script on a single chromosome once time

optional arguments:
  -h, --help            show this help message and exit
  -g GFF, --gff GFF     old gff3 file
  -c CHR, --chr CHR     old Chromosome fasta file
  -l LIST, --list LIST  a file with reorder broken contigs
                        chr	b_contig	start	end	length	gap(W/U)
                        chr1	b_contig1	1	5000	5000	W
                        chr1	b_contig2	5001	5100	100	U
  -p PREFIX, --prefix PREFIX
                        prefix of output, final output contains: 
                        "prefix.gene_overlap_contigs_file.gff3" - a file show old genes with overlap on different contigs, especially overlap on gap
                        "prefix.fixed.fa" - a file which fixed the Chromosome fasta file with reorder broken contigs
                        "prefix.fixed.gff3" - a file which fixed the coordinate with reorder broken contigs

最后的prefix.fixed.fa就是修复好的染色体序列，prefix.fixed.gff3就是修复好的基因注释，prefix.gene_overlap_contigs_file.gff3则储存了跨contig，尤其是跨gap的基因注释元素，可能会存在gene跨gap，但mRNA、CDS、exon没有跨，所以需要根据这个文件中的基因ID将对应的其它元素再从prefix.fixed.gff3中去除掉。

想确认基因注释的坐标到底有没有修复正确，可以使用gffread将修复前后的转录本/蛋白序列提出来，用seqkit sort一下，再将两个文件diff一下，如果完全一样，或者只有存在prefix.gene_overlap_contigs_file.gff3中的元素不一样，就说明坐标修复是没有问题的。