pyensemble注释基因名称
1、什么是pyensembl包
一个conda安装包,安装时已下载hg19、hg38基因组。通过输入染色体chr、基因组位置pos即可返回基因名称(可能会返回0、1或多个基因名),适用于批量非精确查询;
更详细的信息可以到官网查阅,比如获取基因的所有外显子、转录本的基因名等。
2、安装pyensembl
# (1) 安装pyensembl包
conda install pyensembl
# (2) 下载ensembl数据
# 75是hg19, 76是hg38, 下载很慢,hg19越需1h,供参考
pyensembl install --release 75 76 --species human
3、使用pyensembl
import argparse
from pyensembl import EnsemblRelease
data = EnsemblRelease(75)
# chr_num、start_pos为int类型
gene_names_list = data.gene_names_at_locus(contig=chr_num, position=start_pos)
尤其是针对bed文件,非常有用,示例如下图:
import argparse
from pyensembl import EnsemblRelease
data = EnsemblRelease(75)
parser = argparse.ArgumentParser()
parser.add_argument('-i', "--input_file", required=True)
args = parser.parse_args()
input_file = args.input_file
output_file = "anno_" + str(input_file).split('/')[-1]
ff = open(output_file, 'w')
with open(input_file, 'r') as f:
for i in f.readlines():
if i.startswith('Chr'):
ff.write(i.strip() + "\tstart_gene_names" + "\tend_gene_names\n")
continue
temp_list = i.strip().split('\t')
chr_num = int(temp_list[0].replace('chr', ''))
start_pos = int(temp_list[1])
end_pos = int(temp_list[2])
start_gene_names_list = data.gene_names_at_locus(contig=chr_num, position=start_pos)
if len(start_gene_names_list) == 1:
start_gene_names = start_gene_names_list[0]
elif len(start_gene_names_list) == 0:
start_gene_names = "XXX"
else:
print(start_gene_names_list, i.strip(), "起始位置有多个基因名称,请核查!")
start_gene_names = "_".join(start_gene_names_list)
end_gene_names_list = data.gene_names_at_locus(contig=chr_num, position=end_pos)
if len(end_gene_names_list) == 1:
end_gene_names = end_gene_names_list[0]
elif len(end_gene_names_list) == 0:
end_gene_names = "XXX"
else:
end_gene_names = "_".join(end_gene_names_list)
new_str = i.strip() + "\t" + start_gene_names + "\t" + end_gene_names
ff.write(new_str + "\n")
ff.close()
该包很强大,目前用到的功能极少,后续随着使用会更新。