经验总结 | 建索引的五种方式

最新推荐文章于 2024-03-10 23:56:22 发布

今天也是个妖精头子呀

最新推荐文章于 2024-03-10 23:56:22 发布

阅读量1w

点赞数 3

分类专栏：谱系追踪

本文链接：https://blog.csdn.net/weixin_40640700/article/details/116891230

版权

谱系追踪专栏收录该内容

82 篇文章 11 订阅

订阅专栏

最近由于项目的原因，陆陆续续有接触到不同的建立索引的方式。所以觉得有必要在这里进行总结。

首先，建议索引的第一步，是从数据库中下载参考基因组的fasta格式的序列。根据项目背景的不同，可以选择hg19，hg38等不同的参考基因组（以人类为例）。所以，在我们建立索引之前，应该准备好这部分的文件。其次，我们为什么需要建立索引。建索引通常出现的条件是在测序产生的reads片段与参考基因组进行比对时，为了比对更加快速，需要建立索引（相对于目录性质），帮助比对软件更快速的找到目标区域。
（以上为个人理解）

1。使用Bowite2构建索引

bowtie2-build -f hg19.fna human

参数介绍：
bowtie2-build :是bowtie2建立索引的常用的指令。
-f 参考基因组(默认为fasta个格式)
hg19.fna 参考基因组所在的相对/绝对位置。
human 指的是索引输入的文件夹的名称。

详细：

Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
reference_in comma-separated list of files with ref sequences
bt2_index_base write bt2 data to files with this dir/basename

*** Bowtie 2 indexes work only with v2 (not v1). Likewise for v1 indexes. ***
Options:
-f reference files are Fasta (default)
-c reference sequences given on cmd line (as
<reference_in>)
–large-index force generated index to be ‘large’, even if ref
has fewer than 4 billion nucleotides
–debug use the debug binary; slower, assertions enabled
–sanitized use sanitized binary; slower, uses ASan and/or UBSan
–verbose log the issued command
-a/–noauto disable automatic -p/–bmax/–dcv memory-fitting
-p/–packed use packed strings internally; slower, less memory
–bmax max bucket sz for blockwise suffix-array builder
–bmaxdivn max bucket sz as divisor of ref len (default: 4)
–dcv diff-cover period for blockwise (default: 1024)
–nodc disable diff-cover (algorithm becomes quadratic)
-r/–noref don’t build .3/.4 index files
-3/–justref just build .3/.4 index files
-o/–offrate SA is sampled every 2^ BWT chars (default: 5)
-t/–ftabchars # of chars consumed in initial lookup (default: 10)
–threads # of threads
–seed seed for random number generator
-q/–quiet verbose output (for debugging)
-h/–help print detailed description of tool and its options
–usage print this usage message
–version print version information and quit

bowtie2建立得到的索引文件是以.bt2结尾。

2。使用Hisat构建索引

hisat-build hg19.fa human

用法类似于bowite2-build,甚至比其更简洁一点。

由于这是我之前运行的，所以关于结果文件的格式，已经记不太清。所以，此处略。

3。使用bwa构建索引

bwa index -a bwtsw hg19.fa

这个指令是最近用到的，bwa也是一种比对的工具（主要是DNA）。
-a bwstw其实是一起用的。说明构建使用的算法是bwtsw。
hg19.fa 就是参考基因组的文件。

详细信息：

Usage: bwa index [options] <in.fasta>
Options: -a STR BWT construction algorithm: bwtsw, is or rb2 [auto]
-p STR prefix of the index [same as fasta name]
-b INT block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
-6 index files named as <in.fasta>.64.* instead of <in.fasta>.*
Warning: -a bwtsw' does not work for short genomes, while-a is’ and
`-a div’ do not work not for long genomes.

最终得到的索引文件有5种，相对比较大，也是我比较吃惊的。

hg19.fa.ann
hg19.fa.pac
hg19.fa.amb
hg19.fa.bwt
hg19.fa.sa

另外，在建索引的过程中，会出现许多的问题，这点要留心。

4。使用samtools构建索引

samtools faidx hg19.fa

samtools 一般是对bam文件构建索引。当然也有比较特殊的情况就是，对fasta文件构建索引。结果文件为.fai格式，在某些条件下，会要求这种格式的文件。

详细信息：

Usage: samtools faidx <file.fa|file.fa.gz> [ […]]
Option:
-o, --output FILE Write FASTA to file.
-n, --length INT Length of FASTA sequence line. [60]
-c, --continue Continue after trying to retrieve missing region.
-r, --region-file FILE File of regions. Format is chr:from-to. One per line.
-i, --reverse-complement Reverse complement sequences.
–mark-strand TYPE Add strand indicator to sequence name
TYPE = rc for /rc on negative strand (default)
no for no strand indicator
sign for (+) / (-)
custom,, for custom indicator
–fai-idx FILE name of the index file (default file.fa.fai).
–gzi-idx FILE name of compressed file index (default file.fa.gz.gzi).
-f, --fastq File and index in FASTQ format.
-h, --help This message.

结果文件：

ls
hg.fa.fai

5。使用gatk构建索引

gatk CreateSequenceDictionary -R hg19.fa

原先gatk构建索引的方式是gatk-launch，但是目前了解到这种方式已经不再适用。现在换用gatk。

详细信息：

结果文件：

ls
hg19.dict

另外，由于构建索引的过程实在是太耗功夫了，所以很有必要及时的备份。

今天也是个妖精头子呀

关注

3
点赞
踩
32

收藏

觉得还不错? 一键收藏
0
评论
经验总结 | 建索引的五种方式

最近真的是遇到了各种类型的索引，所以觉得有必要总结一下。bowite索引bwa索引dict索引fai索引最后我将这些索引，上传到百度网盘中，以备不时之须。hg19因为构建索引的过程实在是太耗功夫了。所以很有必要及时的备份。另外，我觉得我的移动硬盘马上就要到了，我要把ORBC的分析流程移动到硬盘上去弄，到时候只需要调用绝对路径即可。...
复制链接

扫一扫