生信分析：TCGA数据id转换(python)

最新推荐文章于 2024-04-30 19:24:04 发布

Dennis F

最新推荐文章于 2024-04-30 19:24:04 发布

阅读量1.7k

点赞数

分类专栏：生信分析文章标签： python 生信分析 TCGA

本文链接：https://blog.csdn.net/wzz_2363173126/article/details/118703938

版权

生信分析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

简介
代码
参考文献

简介

TCGA

美国癌症基因组图谱(The Cancer Genome Atlas，TCGA)是由美国国家癌症研究所(National Cancer Institute，NCI)和国家人类基因组研究所(NationalHuman Genome Research Institute，NHGRI)合作开发的。
癌症基因组图谱是一个具有里程碑意义的癌症基因组学项目，对11000多例原发性癌症样本进行了测序和分子表征。目前它包含了33种癌症的数据，每种癌症都涉及关键基因组变化的全面、多维的图谱。TCGA数据库储存有2.5PB的数据，对超过1.1万多名患者的肿瘤组织及配对正常组织进行描述，目前已被广泛应用于研究领域。这些数据已为独立研究人员进行的癌症研究或者TCGA研究网络出版物做出了超过1千多项的贡献。

基因组注释文件

       基因组注释文件将各物种的每个染色体编号，并将其每个碱基位点编号，然后将人们已知的元件区间用起始位点和终止位点记录。
       这样在测序时我们便知道匹配到的片段是落在哪个基因或者转录本上，准确的是落在了基因内、基因间、内含子、外显子上，亦或是在正链还是负链上。进一步地，我们可以将各个区域的reads统计counts，算出每个基因或元件的RPKM和TPM用于后续分析。
       基因组注释文件主要包含GFF，GTF两种主要格式，用于高通量测序中对已经map到参考基因组的reads做注释。

gencode datbase

GeneCode数据库中包含人和鼠的基因组注释文件，其中的基因组注释文件是GTF格式的。

基因组注释文件格式 – （二）GTF文件格式

       gencode数据库官网地址：https://www.gencodegenes.org/
       本样例用到的数据：https://www.gencodegenes.org/human/

       gencode.v38.chr_patch_hapl_scaff.basic.annotation.gtf文件下载地址：
       http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.chr_patch_hapl_scaff.basic.annotation.gtf.gz

代码

def get_gene_info(gencode_filename):
    gene_dict = {}
    with open(gencode_filename, "r") as gencode_file, open("gene_dict.csv", "w+") as gene_dict_file:
        effective_genes_num = 0

        for line in gencode_file:  # read line by line

            if line[0:3] != "chr":  # skip useless lines
                continue

            # separate line into 9 parts
            # …… \t …… \t line_info_type \t …… \t …… \t …… \t …… \t …… \t gene_info \t
            line_info = line.split("\t")
            if line_info[2] != "gene":  # skip unimportant lines
                continue

            # eg.gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; ……
            gene_info = line_info[8].split(" ")
            gene_id = gene_info[1].split(".")[0][1:]
            gene_name = gene_info[5][1:-2]

            gene_dict[gene_id] = gene_name
            gene_dict_info = gene_id + "," + gene_name + "\n"
            gene_dict_file.write(gene_dict_info)

            effective_genes_num += 1

        print("Effective genes number found in " + gencode_filename + " is " + str(effective_genes_num))
        return gene_dict


def transformENSGid(gencode_filename, tcga_filename):
    gene_dict = get_gene_info(gencode_filename)

    transformedENSGid_num = 0

    with open(tcga_filename, "r") as tcga_data, open("TCGAfpkm_transformed.xls", "w+") as new_tcga_data_file:
        for line in tcga_data:

            if line[0:4] != "ENSG":  # skip useless lines
                new_tcga_data_file.write(line)
                continue

            # separate line into several parts
            line_info = line.split("\t")

            # get gene_id
            gene_id = line_info[0].split(".")[0]

            # transform
            try:
                line_info[0] = gene_dict[gene_id]
            except:
                continue  # skip useless lines

            # generate new line
            new_line = "\t"
            new_line = new_line.join(line_info)
            new_tcga_data_file.write(new_line)

            transformedENSGid_num += 1

        print("Transformed ENSGID in " + tcga_filename + " is " + str(transformedENSGid_num) + ".")


if __name__ == "__main__":
    # get file "gencode.v38.chr_patch_hapl_scaff.basic.annotation.gtf" from gencode database
    # https://www.gencodegenes.org/human/

    transformENSGid("gencode.v38.chr_patch_hapl_scaff.basic.annotation.gtf", "TCGAfpkm.xls")
    # transformENSGid("gencode.v38.chr_patch_hapl_scaff.annotation.gtf","TCGAfpkm.xls")