用python完成二代测序捕获区间设计

Panel的设计其实很简单,根据实验目的来选择需要捕获的区域,我们需要做的就是把这些需要捕获的区域做成一个bed文件。

下面就以BRCA1/2两个基因来举例子,一般bed都是设计在基因的CDS区,因为内含子区域往往包含很多低复杂度区域(比如重复区域),所以内含子的捕获性能往往较差,后期分析难度也高。

我们需要先准备基因组注释文件,我从NCBI下载的最新版gtf文件(https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz)

下载下来可以直接解压

wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz
gzip -d GRCh37_latest_genomic.gtf.gz

然后先观察文件格式

less GRCh37_latest_genomic.gtf

格式大概有以下几类

#gtf-version 2.2
#!genome-build GRCh37.p13
#!genome-build-accession NCBI_Assembly:GCF_000001405.25
#!annotation-date 10/22/2020
#!annotation-source NCBI Homo sapiens Updated Annotation Release 105.20201022
NC_000017.10	BestRefSeq	gene	41196312	41277381	.	-	.	gene_id "BRCA1"; transcript_id ""; db_xref "GeneID:672"; db_xref "HGNC:HGNC:1100"; db_xref "MIM:113705"; description "BRCA1 DNA repair associated"; gbkey "Gene"; gene "BRCA1"; gene_biotype "protein_coding"; gene_synonym "BRCAI"; gene_synonym "BRCC1"; gene_synonym "BROVCA1"; gene_synonym "FANCS"; gene_synonym "IRIS"; gene_synonym "PNCA4"; gene_synonym "PPP1R53"; gene_synonym "PSCP"; gene_synonym "RNF53";
NC_000017.10	BestRefSeq	exon	41277199	41277381	.	-	.	gene_id "BRCA1"; transcript_id "NR_027676.2"; db_xref "GeneID:672"; gbkey "misc_RNA"; gene "BRCA1"; product "BRCA1 DNA repair associated, transcript variant 6"; exon_number "1";
NC_000017.10	BestRefSeq	CDS	41219625	41219712	.	-	0	gene_id "BRCA1"; transcript_id "NM_007298.3"; db_xref "CCDS:CCDS11454.2"; db_xref "GeneID:672"; gbkey "CDS"; gene "BRCA1"; note "isoform 4 is encoded by transcript variant 4"; product "breast cancer type 1 susceptibility protein isoform 4"; protein_id "NP_009229.2"; exon_number "15";

'#'开头的注释信息,第一列为染色体号(NC_000017.10对应chr17),第四列为起始位置,第五列为终止位置,我们按CDS区设计就只需要提取第三列为CDS的行(exon区域还包含两端的UTR区域,如果需要UTR区则按照exon设计)。

接下来我们开始提取BRCA1及BRCA2的CDS区域,为了便于后期扩展,把基因名放到单独的列表里:

gene.list
----
BRCA1
BRCA2

然后新建一个python脚本,把GRCh37染色体对应信息写进去:

make_bed.py
----
chr_dic = {"NC_000001.10": '1',
           "NC_000002.11": "2",
           "NC_000003.11": "3",
           "NC_000004.11": "4",
           "NC_000005.9": "5",
           "NC_000006.11": "6",
           "NC_000007.13": "7",
           "NC_000008.10": "8",
           "NC_000009.11": "9",
           "NC_000010.10": "10",
           "NC_000011.9": "11",
           "NC_000012.11": "12",
           "NC_000013.10": "13",
           "NC_000014.8": "14",
           "NC_000015.9": "15",
           "NC_000016.9": "16",
           "NC_000017.10": "17",
           "NC_000018.9": "18",
           "NC_000019.9": "19",
           "NC_000020.10": "20",
           "NC_000021.8": "21",
           "NC_000022.10": "22",
           "NC_000023.10": "X",
           "NC_000024.9": "Y",
           "NC_012920.1": "MT"}

接下来,将gtf文件转化为一个字典,字典的key为基因名,值为CDS起始与终止位置,基因名/转录本/外显子等信息都在gtf最后一列里提取

def phase_gtf(gtf_path, extract_type='CDS'):
    gene_dict = {}
    trans_list = []
    with open(gtf_path, 'r') as f:
        datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
    datas = [x for x in datas if x[2] == extract_type and x[1] == 'BestRefSeq']
    for _data in datas:
        gene = _data[-1].split(';')[0].replace('gene_id ', '').replace('\"', '')
        trans = _data[-1].split(';')[1].replace('transcript_id ', '').replace('\"', '').replace(' ', '')
        exon = _data[-1].split(';')[-2].replace('exon_number', '').replace('\"', '').replace(' ', '')
        start = _data[3]
        end = _data[4]
        if _data[0] not in chr_dic:
            continue
        if gene not in gene_dict:
            gene_dict[gene] = [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
        else:
            gene_dict[gene] = gene_dict[gene] + [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
    return gene_dict

然后从字典里拿出来放到文件里即可:

gene_dict = phase_gtf('GRCh37_latest_genomic.gtf')

with open('gene.list', 'r') as f, open('raw.bed', 'w+') as f_o:
    for l in f:
        l = l.rstrip()
        if l not in gene_dict:
            print(l, "gene name not found")
            continue
        gene_detail = gene_dict[l]
        for _trans, _exon, _chr, _start, _end in gene_detail:
            f_o.write('\t'.join([_chr, str(_start), str(_end), l, _trans, _exon])+'\n')

把脚本合并起来:

make_bed.py
----
chr_dic = {"NC_000001.10": '1',
           "NC_000002.11": "2",
           "NC_000003.11": "3",
           "NC_000004.11": "4",
           "NC_000005.9": "5",
           "NC_000006.11": "6",
           "NC_000007.13": "7",
           "NC_000008.10": "8",
           "NC_000009.11": "9",
           "NC_000010.10": "10",
           "NC_000011.9": "11",
           "NC_000012.11": "12",
           "NC_000013.10": "13",
           "NC_000014.8": "14",
           "NC_000015.9": "15",
           "NC_000016.9": "16",
           "NC_000017.10": "17",
           "NC_000018.9": "18",
           "NC_000019.9": "19",
           "NC_000020.10": "20",
           "NC_000021.8": "21",
           "NC_000022.10": "22",
           "NC_000023.10": "X",
           "NC_000024.9": "Y",
           "NC_012920.1": "MT"}
           
def phase_gtf(gtf_path, extract_type='CDS'):
    gene_dict = {}
    trans_list = []
    with open(gtf_path, 'r') as f:
        datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
    datas = [x for x in datas if x[2] == extract_type and x[1] == 'BestRefSeq']
    for _data in datas:
        gene = _data[-1].split(';')[0].replace('gene_id ', '').replace('\"', '')
        trans = _data[-1].split(';')[1].replace('transcript_id ', '').replace('\"', '').replace(' ', '')
        exon = _data[-1].split(';')[-2].replace('exon_number', '').replace('\"', '').replace(' ', '')
        start = _data[3]
        end = _data[4]
        if _data[0] not in chr_dic:
            continue
        if gene not in gene_dict:
            gene_dict[gene] = [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
        else:
            gene_dict[gene] = gene_dict[gene] + [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
    return gene_dict
 
if __name__ == "__main__":
    gene_dict = phase_gtf('GRCh37_latest_genomic.gtf')

    with open('gene.list', 'r') as f, open('raw.bed', 'w+') as f_o:
        for l in f:
            l = l.rstrip()
            if l not in gene_dict:
                print(l, "gene name not found")
                continue
            gene_detail = gene_dict[l]
            for _trans, _exon, _chr, _start, _end in gene_detail:
                f_o.write('\t'.join([_chr, str(_start), str(_end), l, _trans, _exon])+'\n')

这样会生成一个文件,格式如下

17   41258473        41258543        BRCA1   NM_007297.4     exon3
17   41256885        41256973        BRCA1   NM_007297.4     exon4
17   41256139        41256278        BRCA1   NM_007297.4     exon5
17   41251792        41251897        BRCA1   NM_007297.4     exon6
17   41249261        41249306        BRCA1   NM_007297.4     exon7

当然这不是最终格式,因为这不合bed文件的规范,而且我习惯于把多个转录本的CDS区合并了过后设计,仔细看这份文件会发现它包含了多个转录本,也可以直接选最长的转录本,我们还是来合并CDS区间,另起一个python脚本

先读取刚刚生成的文件

def read_bed(bed_path):
    with open(bed_path, 'r') as f:
        datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
    datas = [[x[0], int(x[1]), int(x[2]), x[3], x[4]] for x in datas]
    chr_list = list(set([x[0] for x in datas]))
    return chr_list, datas

然后我们需要对有重叠的区间合并,并在合并后对区间排序,这里设计成分染色体来处理

def handle_chr(data, _chr):
    data = [x for x in data if x[0] == _chr]
    check_num = 0
    while check_num == 0:
        check_num = 1
        data_backup = data[:]
        for _idx, _data in enumerate(data):
            for i in range(_idx+1, len(data)):
                if data[i][1] <= data[_idx][1] <= data[i][2] or data[i][1] <= data[_idx][2] <= data[i][2] or \
                        (data[_idx][1] <= data[i][1] and data[_idx][2] >= data[i][2]):
                    num_list = [data[i][1], data[i][2], data[_idx][1], data[_idx][2]]
                    num_list.sort()
                    new_line = [data[i][0], num_list[0], num_list[-1], data[i][3], data[i][4]]
                    data_backup.pop(i)
                    data_backup.pop(_idx)
                    data_backup.append(new_line)
                    check_num = 0
                    break
            if check_num == 0:
                break
        data = data_backup[:]
    check_num = 0
    while check_num == 0:
        check_num = 1
        for i in range(0, len(data)-1):
            if data[i][1] > data[i+1][1]:
                tmp_data = data[i][:]
                data[i] = data[i+1][:]
                data[i+1] = tmp_data[:]
                check_num = 0
    return data

然后就是写文件了,所以这个脚本合起来就是

merge_bed.py
---
def read_bed(bed_path):
    with open(bed_path, 'r') as f:
        datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
    datas = [[x[0], int(x[1]), int(x[2]), x[3], x[4]] for x in datas]
    chr_list = list(set([x[0] for x in datas]))
    return chr_list, datas
    
def handle_chr(data, _chr):
    data = [x for x in data if x[0] == _chr]
    check_num = 0
    while check_num == 0:
        check_num = 1
        data_backup = data[:]
        for _idx, _data in enumerate(data):
            for i in range(_idx+1, len(data)):
                if data[i][1] <= data[_idx][1] <= data[i][2] or data[i][1] <= data[_idx][2] <= data[i][2] or \
                        (data[_idx][1] <= data[i][1] and data[_idx][2] >= data[i][2]):
                    num_list = [data[i][1], data[i][2], data[_idx][1], data[_idx][2]]
                    num_list.sort()
                    new_line = [data[i][0], num_list[0], num_list[-1], data[i][3], data[i][4]]
                    data_backup.pop(i)
                    data_backup.pop(_idx)
                    data_backup.append(new_line)
                    check_num = 0
                    break
            if check_num == 0:
                break
        data = data_backup[:]
    check_num = 0
    while check_num == 0:
        check_num = 1
        for i in range(0, len(data)-1):
            if data[i][1] > data[i+1][1]:
                tmp_data = data[i][:]
                data[i] = data[i+1][:]
                data[i+1] = tmp_data[:]
                check_num = 0
    return data

if __name__ == "__main__":
    chr_list, datas = read_bed("raw.bed")
    total_data = []
    for i in range(1, 23):
        chr_data = handle_chr(datas, str(i))
        total_data += chr_data
    for _chr in ["X", "Y", "MT"]:
        total_data += handle_chr(datas, _chr)
    with open('merged_cds.bed', 'w') as f:
        f.write('\n'.join(["\t".join([str(y) for y in x]) for x in total_data])+'\n')

运行完会生成merged_cds.bed,完成合并后检查检查,没问题就做成bed格式:

awk '{print $1"\t"$2"\t"$3"\t+\t"$1"_"$2"_"$3}' merged_cds.bed > final.bed

生成的文件格式如下:

17   41258473        41258543        +   17_41258473_41258543
17   41256885        41256973        +   17_41256885_41256973
17   41256139        41256278        +   17_41256139_41256278

(PS: 别忘了cat一个表头进去,有些软件需要表头)

  • 5
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
Sanger测序效率可以通过多种方法进行计算,以下是一种简单的Python实现: ```python # 输入参数 read_length = 500 # 读长 coverage = 10 # 覆盖度 genome_size = 1e6 # 基因组大小 # 计算每个位置的错误率 q = 10 ** (-0.1 * read_length) # 计算每个位置的正确率 p = 1 - q # 计算每个位点被正确测序的概率 prob_correct = p ** coverage # 计算整个基因组被正确测序的概率 prob_genome_correct = prob_correct ** (genome_size / read_length) # 计算测序效率 efficiency = 1 - (1 - prob_genome_correct) ** coverage # 输出结果 print("Sanger测序效率为:{:.2%}".format(efficiency)) ``` 解释一下代码中的每个部: - `read_length`:读长,即每次测序的DNA片段长度。 - `coverage`:覆盖度,即每个位点被测序的次数。 - `genome_size`:基因组大小。 - `q`:计算每个位置的错误率,使用Sanger测序的质量值公式计算。其中0.1为质量值的转换因子。 - `p`:计算每个位置的正确率。 - `prob_correct`:计算每个位点被正确测序的概率,即每个位点被测序的所有次数都正确的概率。 - `prob_genome_correct`:计算整个基因组被正确测序的概率,即每个位点都正确测序的概率的连乘积。 - `efficiency`:计算Sanger测序的效率,即至少有一个位点错误的概率。 需要注意的是,这只是一种简单的计算方法,实际上Sanger测序的效率还受到许多其他因素的影响,如测序质量、测序深度、序列比对等。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值