构建index所需的参考基因组以及各种版本的注释文件

本文介绍了构建基因组分析index时所需的参考基因组和注释文件,包括UCSC、ensemble、NCBI和gencode四个来源。详细讲解了如何下载fasta格式的参考基因组和GFF/GTF注释文件,并提供了各站点的下载链接。
摘要由CSDN通过智能技术生成


一、参考基因组

Mapping or Alignment,是测序分析中重要的步骤。笼统的说,这一步骤就是把reads贴到参考基因组或转录组构建的index上的过程。那么,构建index,必不可少的就是fasta格式的参考基因组与参考转录组。
UCSC,ensemble,NCBI,genecode是下载参考基因组的来源网站。
几个版本基因组的对应关系:

NCBI Ensembl UCSC
GRCh36 release_52 hg18
GRCh37 release_59/61/64/68/69/75 hg19
GRCh38 release_76/77/78/80/81/82(更新中) hg38

【以下操作,都hg38(GRCh38)为例】

1. UCSC

UCSC 网址:http://genome.ucsc.edu/
在这里插入图片描述
主页中的download -> genome Data,选human,就可以按需下载各种参考基因组了,一般通过ftp实现这个需求。

在这里插入图片描述
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/
在这里插入图片描述
一般流程需要的参考基因组以及参考转录组都在这里
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
在这里插入图片描述

通过ftp下载参考基因组:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
--2021-05-10 08:07:34--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 983659424 (938M) [application/x-gzip]
Saving to: ‘hg38.fa.gz’

hg38.fa.gz                  100%[===========================================>] 938.09M  9.55MB/s    in 95s     

2021-05-10 08:09:11 (9.82 MB/s) - ‘hg38.fa.gz’ saved [983659424/983659424]

938 MB,大小没问题,解压缩,浏览一下:

$ gunzip hg38.fa.gz 
$ grep ">" hg38.fa 
>chr1
>chr10
>chr11
>chr11_KI270721v1_random
>chr12
>chr13
>chr14
>chr14_GL000009v2_random
>chr14_GL000225v1_random
>chr14_KI270722v1_random
>chr14_GL000194v1_random
>chr14_KI270723v1_random
>chr14_KI270724v1_random
>chr14_KI270725v1_random
>chr14_KI270726v1_random
>chr15
>chr15_KI270727v1_random
>chr16
>chr16_KI270728v1_random
>chr17
>chr17_GL000205v2_random
>chr17_KI270729v1_random
>chr17_KI270730v1_random
>chr18
>chr19

值得注意的是,来自于UCSC的fasta,以“chr1”,“chr2”的方式标注染色体,找gtf或者gff注释的时候,要对于用对应的“chr1”,“chr2”标注。

2. ensemble

ensemble http://asia.ensembl.org/index.html 亚洲镜像,人类直接点Human就行了,默认GRCh38.p13。
在这里插入图片描述
http://asia.ensembl.org/Homo_sapiens/Info/Index 页面一目了然,左边基因组文件fasta,右边注释文件gtf,齐活。
在这里插入图片描述
同样通过ftp实现下载
http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/
在这里插入图片描述

$ wget http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
--2021-05-10 08:42:35--  (try: 2)  http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 881211416 (840M), 749449533 (715M) remaining [application/octet-stream]
Saving to: ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’

Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz               100%[++++++++++++++++++++=======================================================================================================================>] 840.39M   250KB/s    in 52m 57s 

2021-05-10 09:35:33 (230 KB/s) - ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’ saved [881211416/881211416]
$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
$ grep ">" Homo_sapiens.GRCh38.dna.primary_assembly.fa 
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
>11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF
>12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF
>13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF
>14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF
>15 dna:chromosome chromosome:GRCh38:15:1:101991189:1 REF
>16 dna:chromosome chromosome:GRCh38:16:1:90338345:1 REF
>17 dna:chromosome chromosome:GRCh38:17:1:83257441:1 REF
>18 dna:chromosome chromosome:GRCh38:18:1:80373285:1 REF
>19 dna:chromosome chromosome:GRCh38:19:1:58617616:1 REF
>2 dna:chromosome chromosome:GRCh38:2:1:242193529:1 REF
>20 dna:chromosome chromosome:GRCh38:20:1:64444167:1 REF
>21 dna:chromosome chromosome:GRCh38:21:1:46709983:1 REF
>22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF
>3 dna:chromosome chromosome:GRCh38:3:1:198295559:1 REF
>4 dna:chromosome chromosome:GRCh38:4:1:190214555:1 REF
>5 dna:chromosome chromosome:GRCh38:5:1:181538259:1 REF
>6 dna:chromosome chromosome:GRCh38:6:1:170805979:1 REF
>7 dna:chromosome chromosome:GRCh38:7:1:159345973:1 REF
>8 dna:chromosome chromosome:GRCh38:8:1
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值