简介
TCGA
美国癌症基因组图谱
(The Cancer Genome Atlas,TCGA
)是由美国国家癌症研究所(National Cancer Institute,NCI)和国家人类基因组研究所(NationalHuman Genome Research Institute,NHGRI)合作开发的。
癌症基因组图谱是一个具有里程碑意义的癌症基因组学项目,对11000多例原发性癌症样本进行了测序和分子表征。目前它包含了33种癌症的数据,每种癌症都涉及关键基因组变化的全面、多维的图谱。TCGA数据库储存有2.5PB的数据,对超过1.1万多名患者的肿瘤组织及配对正常组织进行描述,目前已被广泛应用于研究领域。这些数据已为独立研究人员进行的癌症研究或者TCGA研究网络出版物做出了超过1千多项的贡献。
基因组注释文件
基因组注释文件
将各物种的每个染色体编号,并将其每个碱基位点编号,然后将人们已知的元件区间用起始位点和终止位点记录
。
这样在测序时我们便知道匹配到的片段是落在哪个基因或者转录本上,准确的是落在了基因内、基因间、内含子、外显子上,亦或是在正链还是负链上。进一步地,我们可以将各个区域的reads统计counts,算出每个基因或元件的RPKM和TPM用于后续分析。
基因组注释文件主要包含GFF,GTF两种主要格式,用于高通量测序中对已经map到参考基因组的reads做注释。
gencode datbase
GeneCode数据库中包含人和鼠的基因组注释文件,其中的基因组注释文件是GTF格式的。
gencode数据库官网地址:https://www.gencodegenes.org/
本样例用到的数据:https://www.gencodegenes.org/human/
gencode.v38.chr_patch_hapl_scaff.basic.annotation.gtf文件下载地址:
http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.chr_patch_hapl_scaff.basic.annotation.gtf.gz
代码
def get_gene_info(gencode_filename):
gene_dict = {}
with open(gencode_filename, "r") as gencode_file, open("gene_dict.csv", "w+") as gene_dict_file:
effective_genes_num = 0
for line in gencode_file: # read line by line
if line[0:3] != "chr": # skip useless lines
continue
# separate line into 9 parts
# …… \t …… \t line_info_type \t …… \t …… \t …… \t …… \t …… \t gene_info \t
line_info = line.split("\t")
if line_info[2] != "gene": # skip unimportant lines
continue
# eg.gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; ……
gene_info = line_info[8].split(" ")
gene_id = gene_info[1].split(".")[0][1:]
gene_name = gene_info[5][1:-2]
gene_dict[gene_id] = gene_name
gene_dict_info = gene_id + "," + gene_name + "\n"
gene_dict_file.write(gene_dict_info)
effective_genes_num += 1
print("Effective genes number found in " + gencode_filename + " is " + str(effective_genes_num))
return gene_dict
def transformENSGid(gencode_filename, tcga_filename):
gene_dict = get_gene_info(gencode_filename)
transformedENSGid_num = 0
with open(tcga_filename, "r") as tcga_data, open("TCGAfpkm_transformed.xls", "w+") as new_tcga_data_file:
for line in tcga_data:
if line[0:4] != "ENSG": # skip useless lines
new_tcga_data_file.write(line)
continue
# separate line into several parts
line_info = line.split("\t")
# get gene_id
gene_id = line_info[0].split(".")[0]
# transform
try:
line_info[0] = gene_dict[gene_id]
except:
continue # skip useless lines
# generate new line
new_line = "\t"
new_line = new_line.join(line_info)
new_tcga_data_file.write(new_line)
transformedENSGid_num += 1
print("Transformed ENSGID in " + tcga_filename + " is " + str(transformedENSGid_num) + ".")
if __name__ == "__main__":
# get file "gencode.v38.chr_patch_hapl_scaff.basic.annotation.gtf" from gencode database
# https://www.gencodegenes.org/human/
transformENSGid("gencode.v38.chr_patch_hapl_scaff.basic.annotation.gtf", "TCGAfpkm.xls")
# transformENSGid("gencode.v38.chr_patch_hapl_scaff.annotation.gtf","TCGAfpkm.xls")