如何选择参考基因组和注释文件

原创已于 2022-06-12 17:46:24 修改

· 1.1w 阅读

67 ·

版权

文章标签：

#生物信息学

于 2021-04-15 10:08:18 首次发布

注释文件&富集分析&ID转换同时被 2 个专栏收录

7 篇文章

订阅专栏

基础生信

5 篇文章

订阅专栏

参考基因组是生信分析的基础，重测、芯片、转录组等测序数据都需要首先与参考基因组进行比对，才能进行后续分析

需要注意的是，下载的参考基因组一定要使用与其对应的注释文件，不能再Ensemble中下载参考基因组，却在NCBI下载注释文件；也不能下载3.0版本基因组，却是用4.0的注释文件

绵羊为例，一般在NCBI和Ensemble中下载参考基因组，个人还是比较倾向于Ensemble中下载：

1. Ensemble：可下载参考基因组、cDNA、CDs、ncRNA序列和注释文件，

1.1 主要包含三种形式的参考基因组：

Ensemble提供两种组装形式和3种重复序列处理方式的参考基因组，分别是primary、toplevel 、unmasked(dna) 、soft-masked(dna_sm) 和masked(dna_rm) 。

一般选择dna.primary或dna_sm.primary。

1.2 文件命名规则：

<species>.<assembly>.<sequence type>.<id type>.<id>.fa.gz

<species>：物种名称The systematic name of the species.
<assembly>: 基因组组装名称
<sequence type>:包含dna、dna_rm和dna_sm三种类型
  * 'dna' - unmasked genomic DNA sequences. 未处理的基因组序列
  * 'dna_rm' - masked genomic DNA. 用“RepeatMasker tool”工具识别处理的基因组序列，重复区和低 
                                   复杂区碱基用N替代。
  * 'dna_sm' - soft-masked genomic DNA. 重复区和低复杂区碱基用小写字母替代。
<id type>：包含chromosome、nonchromosomal和seqlevel
  * 'chromosome'：染色体
  * 'nonchromosomal'：未组装到染色体上的碱基序列。
  * 'seqlevel'       - This is usually sequence scaffolds, chunks or clones.
     -- 'scaffold'   - Larger sequence contigs from the assembly of shorter
        sequencing reads (often from whole genome shotgun, WGS) which could
        not yet be assembled into chromosomes. Often more genome sequencing
        is needed to narrow gaps and establish a tiling path.
     -- 'chunk' -  While contig sequences can be assembled into large entities,
        they sometimes have to be artificially broken down into smaller entities
        called 'chunks'. This is due to limitations in the annotation
        pipeline and the finite record size imposed by MySQL which stores the
        sequence and annotation information.
     -- 'clone' - In general this is the smallest sequence entity.  It is often
        identical to the sequence of one BAC clone, or sequence region
        of one BAC clone which forms the tiling path.
<id>: The actual sequence identifier. Depending on the <id type> the <id>
          could represent the name of a chromosome, a scaffold, a contig, a clone ..
          Field is empty for seqlevel files
fa：FASTQ文件
gz：压缩文件

1.3 toplevel还是primary_assembly参考基因组：

TOPLEVEL：包含所有染色体序列、未组装到染色体序列和用N填充的单倍型/补丁区域（N padded haplotype/patch regions）

PRIMARY ASSEMBLY：用于序列比对的最完善的基因组，去除了单倍型/补丁区域（excluding haplotype/patch regions）。若没有'primary_assembly'文件，'toplevel'文件具有相同的效用。

在绵羊中没有primary_assembly的参考基因组，而在人、小鼠、斑马鱼等模式生物中有
  Primary assembly sequences unmasked:
    Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
  Primary assembly soft/hard masked sequences:
    Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz
    Homo_sapiens.GRCh37.dna_rm.primary_assembly.fa.gz

1.4 unmasked、rm_masked还是sm masked的参考基因组？
Masked基因组：指所有重复区和低复杂区被N代替的基因组序列，比对时就不会有reads比对到这些区域。一般不推荐用masked的基因组，因为它造成了信息的丢失，由此带来的一个问题是uniquely比对到masked基因组上的reads实际上可能不是unique的。而且masked基因组还会带来比对错误，使得在允许错配的情况下，本来来自重复区的reads比对到基因组的其它位置。另外检测重复区和低复杂区的软件不可能是完美的，这就造成遮盖住的重复序列和低复杂区并不一定是100%准确和敏感的。
soft-masked基因组：是指把所有重复区和低复杂区的序列用小写字母标出的基因组，由于主要的比对软件，比如BWA、bowtie2等都忽略这些soft-mask，直接把小写字母当做大写字母比对，所以使用soft-masked基因组的比对效果和使用unmasked基因组的比对效果是相同的。

因此，在这里我们选择：Ovis_aries_rambouillet.Oar_rambouillet_v1.0.dna.toplevel.fa.gz

1.5 注释文件

分别包含三种类型的.gtf（general tranfer format）和.gff（general feature format）注释文件，根据自己需求选择合适注释信息

gtf：全部的注释信息
chr：染色体注释信息
abinitio：预测基因集注释信息

2. NCBI中有三个参考基因组下载入口：

入口1：可以直截了当进行下载，需要注意下载的基因组文件和注释文件分别是.fna和.gff格式，使用时还需要转成.fa和.gtf格式

入口2：包含参考基因组、CDs、RNA和蛋白序列，以及相应的注释信息，不同的文件信息可以通过README文件获取。

也是最新版本的参考基因组，看起来不如Ensemble简洁明了。但是对参考基因组的各种统计信息和说明较为完善。

入口3：各个版本的参考基因组（懵逼）

第一次知道绵羊有7个版本的参考基因组，有选择困难症的人就不要进来了，而且某些文件夹没有找到注释文件