使用TaxAss对基于淡水微生物16S的OTU数据进行分类学注释的摸索

在做一些环境微生物16S的工作,之前由于是拜托的朋友进行的测序和下机拆包操作,自己得到的是已经聚类好的OTU表和fasta文件,用的数据库应该是RDP。

后来发现有个专门用于注释淡水微生物的数据库FWMFG

R. J. Newton, S. E. Jones, A. Eiler, K. D. McMahon, S. Bertilsson.
(2011) “A Guide to the natural history of freshwater lake bacteria”.
Microbiology and Molecular Biology Reviews 75(1):14-49.

想用那个数据库注释一下,看看有没有什么新发现,于是硬着头皮给老外发邮件,老外给我回了个github链接: https://github.com/mcmahon-uw/FWMFG
还附带一个叫TaxAss的分析流程: https://github.com/McMahonLab/TaxAss

TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale
Taxonomic Resolution Robin R Rohwer, Joshua J Hamilton, Ryan J Newton,
Katherine D McMahon mSphere; doi:
https://doi.org/10.1128/mSphere.00327-18

需要linux环境。之前一直非常懒,过年回来感觉再不努力可能就要出事了,于是决定安装linux,然后从头摸索TaxAss的流程,记录如下。

使用TaxAss必须安装python,R.

环境的配置

安装R

https://bbs.deepin.org/forum.php?mod=viewthread&tid=138956
请问怎么安装在deepin系统下安装linux R语言软件

sudo apt-get update
sudo apt-get install r-base r-base-dev

配置python

https://www.cnblogs.com/vaster/p/7501612.html
deepin系统下部署Python3.5的开发及运行环境
https://www.cnblogs.com/mountain2011/p/6978847.html linux上python安装相关
http://www.cnblogs.com/tinywan/p/7230039.html Linux命令详解(三)./configure、make、make install 命令

从官网下载Python-3.7.2.tgz,解压缩后进入目录,执行如下命令

./configure --with-ssl --prefix=/usr/local/python35
make 
sudo make install

报错

ModuleNotFoundError: No module named '_ctypes'

参考https://blog.csdn.net/wang725/article/details/79905612 安装python3.7时候,报错ModuleNotFoundError: No module named ‘_ctypes’
执行如下命令

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
sudo apt-get install build-essential python-dev python-setuptools python-pip python-smbus
sudo apt-get install build-essential libncursesw5-dev libgdbm-dev libc6-dev
sudo apt-get install zlib1g-dev libsqlite3-dev tk-dev
sudo apt-get install libssl-dev openssl
sudo apt-get install libffi-dev

再次编译,安装成功。
下一步是安装mothur,需要安装libz, bzip2,boost,其中bzip2在deepin中默认已安装

安装libz

https://blog.csdn.net/u013310119/article/details/81031613
linux下编译安装zlib
https://blog.csdn.net/yuhengyue/article/details/78131758
ubuntu如何安装libz库
http://www.zlib.net/
下载zlib-1.2.11.tar.gz

libz库对应的安装包名字就是zlib
解压缩后在根目录执行如下命令

./configure
make
sudo make install

安装boost

https://www.boost.org/users/history/version_1_69_0.html
下载最新版boost
https://www.cnblogs.com/LyndonYoung/articles/5288618.html
linux下boost库的安装
https://www.cnblogs.com/tanshaoxiaoji/p/vi.html
Linux指令 vi编辑,保存及退出

解压后进入目录,执行如下命令

./bootstrap.sh

之后会产生bjam和b2两个工具

sudo ./b2 install

确定已经安装了g++与gcc,此过程会花费一些时间)这个时候你的/usr/local/include下会产生boost的头文件,
/usr/local/lib下面会产生boost库。用的时间很久。

切换到cd /etc/profile.d目录下,执行如下命令,使用超级用户创建文件boost.sh

sudo vi boost.sh

里面添加如下内容

#!/bin/sh
BOOST_ROOT=/home/Lyndon/boost_1_60_0(boost的解压路径)
BOOST_INCLUDE=/usr/local/include/boost
BOOST_LIB=/usr/local/lib
export BOOST_INCLUDE BOOST_LIB BOOST_ROOT

按ESC,输入:wq,保存并离开。

修改boost.sh的权限:

sudo chmod +x boost.sh

执行source boost.sh

至此,boost安装完毕。

安装mothur

http://www.zhimengzhe.com/linux/128671.html
mothur的安装
https://github.com/mothur/mothur/releases/tag/v1.39.5
下载mothur

解压缩后进入根目录,执行命令安装(时间很长)

make

验证编译是否成功:

./mothur 

查看是否运行成功,用quit()退出

此时提示

 ./mothur: error while loading shared libraries: libboost_iostreams.so.1.69.0: cannot open shared object file: No such file or directory

https://blog.csdn.net/ariessurfer/article/details/7984001
error while loading shared libraries错误解决

https://blog.csdn.net/whatday/article/details/83825491
Linux中error while loading shared libraries错误解决办法

那就表示系统不知道xxx.so放在哪个目录下,这个时候就要在/etc/ld.so.conf中加入xxx.so所在的目录。首先查找存在的目录,结果为/usr/local/lib/libboost_iostreams.so,具体操作如下:

sudo vi /etc/ld.so.conf

"include ld.so.conf.d/*.conf"下方增加"/usr/local/lib"
保存后,在命令行终端执行:
sudo /sbin/ldconfig -v
其作用是将文件/etc/ld.so.conf列出的路径下的库文件缓存到/etc/ld.so.cache以供使用,因此当安装完一些库文件,或者修改/etc/ld.so.conf增加了库的新搜索路径,需要运行一下ldconfig,使所有的库文件都被缓存到文件/etc/ld.so.cache中,如果没做,可能会找不到刚安装的库。

此时执行./mothur ,提示运行成功。

在/opt下新建文件夹/biosoft,把mothur-1.39.5文件夹移动到/opt/biosoft/中,执行如下命令,把该文件夹加入系统变量

echo 'PATH=/opt/biosoft/mothur-1.39.5:$PATH' >> ~/.bashrc
source  ~/.bashrc

安装blast

https://www.cnblogs.com/renping/p/6899661.html Blast
如何使用Blast+(Linux)转载

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
下载ncbi-blast-2.8.1±x64-linux.tar.gz

解压至/opt/biosoft/ncbi-blast-2.8.1+/

执行

echo 'PATH=/opt/biosoft/ncbi-blast-2.8.1+/bin:$PATH' >> ~/.bashrc  #加入环境变量
source ~/.bashrc

TaxAss工作流程

主要参考了如下流程,解释什么的就不翻译了,我其实也是有点一知半解。

https://htmlpreview.github.io/?https://github.com/McMahonLab/TaxAss/blob/master/tax-scripts/TaxAss_Directions.html

看了看TaxAss自带的淡水微生物数据库可能还要比老外那个版本更新一点,于是决定用那个,需要注意把TaxAss提供的淡水微生物数据库的fasta和taxonomy文件重命名为custom.fasta和taxonomy.fasta

下载silva数据库,把解压好的文件,自己准备的otu.fasta和TaxAss自己提供的【基于silva数据库制作的淡水微生物数据库】都放放在/tax-scripts文件夹中。进入/tax-scripts文件夹,在终端中执行如下操作进行unalign

./un-align_silva.sh silva.nr_v132.align general.fasta silva.nr_v132.tax general.taxonomy

unalign后,得到体积缩小的silva数据库fasta文件和taxonomy文件

需要准备如下文件作为输入:

文件名功能描述
custom.fastafasta sequences in your small, ecosystem-specific taxonomy database
custom.taxonomytaxonomy names in your small, ecosystem-specific taxonomy database
general.fastafasta sequences in your large, general comrpehensive taxonomy database
general.taxonomytaxonomy names in your large, general comprehensive taxonomy database
otus.fastafasta sequences for each of your OTUs (OTUs can be clustered or unique sequences)
otus.abundrelative abundance of each OTU (i.e. the OTU table) (可以没有)

1 建立blast数据库

https://blog.csdn.net/herokoking/article/details/70237180https://blog.csdn.net/herokoking/article/details/70237180
新版blast+的使用简介

   当需要进行大量比对的时候,将BLAST数据库本地化能极大提高效率。


    makeblastdb -dbtype nucl -in custom.fasta -input_type fasta -parse_seqids -out custom.db
    输出结果:

    Building a new DB, current time: 02/21/2019 18:54:04
    New DB name: /media/heymonica_42/新加卷/TaxAss-master/tax-scripts/custom.db
    New DB title: custom.fasta
    Sequence type: Nucleotide
    Keep MBits: T
    Maximum file size: 1000000000B
    Adding sequences from FASTA; added 1318 sequences in 0.052067 seconds

.

2 进行blast

将otu中每条数据与生成的数据库比对,找到各自hit数最高的条目

blastn -query otu.fasta -task megablast -db custom.db -out otus.custom.blast -outfmt 11 -max_target_seqs 5 

需要2~3分钟(1.9万条otu)

3 对blast结果进行格式重整

重整后,使之成为R能处理的格式

blast_formatter -archive otus.custom.blast -outfmt "6 qseqid pident length qlen qstart qend" -out otus.custom.blast.table

4 重新计算总体pident

Rscript calc_full_length_pident.R otus.custom.blast.table otus.custom.blast.table.modified

5 过滤blast结果

Rscript filter_seqIDs_by_pident.R otus.custom.blast.table.modified ids.above.98 98 TRUE 
Rscript filter_seqIDs_by_pident.R otus.custom.blast.table.modified ids.below.98 98 FALSE

结果提示:

Added 765 sequences matching the custom database with >= 98 % sequence identity to ids.above.98 
Assign taxonomy to these sequences using the Small Custom database

Added 19026 sequences matching the custom database with < 98 % sequence identity to ids.below.98 
Assign taxonomy to these sequences using the Large General database

6 检查blast参数设置是否合适

Rscript plot_blast_hit_stats.R otus.custom.blast.table.modified 98 plots

可能需要保持联网状态,因为如果有些包没有安装,R会自动联网安装
看图,保证98% pident cutoff处hit5几乎没有,即说明可以使用。

7 复原缺失的seqIDs

python find_seqIDs_blast_removed.py otu.fasta otus.custom.blast.table.modified ids.missing
cat ids.below.98 ids.missing > ids.below.98.all

8 创建seqIDs的fasta文件

python create_fastas_given_seqIDs.py ids.above.98 otu.fasta otus.above.98.fasta
python create_fastas_given_seqIDs.py ids.below.98.all otu.fasta otus.below.98.fasta

9 注释物种

高于98%cutoff的使用自定义的数据库(淡水)进行注释,低于98%cutoff的使用通用数据库(silva)进行注释

mothur "#classify.seqs(fasta=otus.above.98.fasta, template=custom.fasta,  taxonomy=custom.taxonomy, method=wang, probs=T, processors=2, cutoff=0)"
mothur "#classify.seqs(fasta=otus.below.98.fasta, template=general.fasta, taxonomy=general.taxonomy, method=wang, probs=T, processors=2, cutoff=0)"

返回:

It took 1 secs to classify 765 sequences.
It took 0 secs to create the summary file for 765 sequences.
Output File Names: 
otus.above.98.custom.wang.taxonomy
otus.above.98.custom.wang.tax.summary

It took 389 secs to classify 19066 sequences.
It took 0 secs to create the summary file for 19066 sequences.
Output File Names: 
otus.below.98.general.wang.taxonomy
otus.below.98.general.wang.tax.summary
otus.below.98.general.wang.flip.accnos

10 组合分类文件

之后对taxonomy文件的格式进行处理,使之成为R能处理的格式

sed 's/[[:blank:]]/\;/' <otu.custom.wang.taxonomy >otus.98.taxonomy.reformatted
mv otus.98.taxonomy.reformatted otus.98.taxonomy

sed 's/[[:blank:]]/\;/' <otu.general.wang.taxonomy >otus.general.taxonomy.reformatted
mv otus.general.taxonomy.reformatted otus.general.taxonomy

11 生成只有通用物种注释的分类表

mothur "#classify.seqs(fasta=otu.fasta, template=general.fasta, taxonomy=general.taxonomy, method=wang, probs=T, processors=2, cutoff=0)"
cat otu.general.wang.taxonomy > otus.general.taxonomy

12 重整格式

sed 's/[[:blank:]]/\;/' <otus.98.taxonomy >otus.98.taxonomy.reformatted
mv otus.98.taxonomy.reformatted otus.98.taxonomy

# (optional- only do if did step 11):
sed 's/[[:blank:]]/\;/' <otus.general.taxonomy >otus.general.taxonomy.reformatted
mv otus.general.taxonomy.reformatted otus.general.taxonomy

13 比较分类文件

mkdir conflicts_98
Rscript find_classification_disagreements.R otus.98.taxonomy otus.general.taxonomy ids.above.98 conflicts_98 98 80 80

14 选择合适的pident cutoff

再以94%、96% cutoff重复第5, 7-10, & 12-13步

# 94% cutoff:

Rscript filter_seqIDs_by_pident.R otus.custom.blast.table.modified ids.above.94 94 TRUE 
Rscript filter_seqIDs_by_pident.R otus.custom.blast.table.modified ids.below.94 94 FALSE

python find_seqIDs_blast_removed.py otu.fasta otus.custom.blast.table.modified ids.missing
cat ids.below.94 ids.missing > ids.below.94.all

python create_fastas_given_seqIDs.py ids.above.94 otu.fasta otus.above.94.fasta
python create_fastas_given_seqIDs.py ids.below.94.all otu.fasta otus.below.94.fasta

mothur "#classify.seqs(fasta=otus.above.94.fasta, template=custom.fasta,  taxonomy=custom.taxonomy, method=wang, probs=T, processors=2, cutoff=0)"
mothur "#classify.seqs(fasta=otus.below.94.fasta, template=general.fasta, taxonomy=general.taxonomy, method=wang, probs=T, processors=2, cutoff=0)"

返回:

# It took 1 secs to classify 2899 sequences.
# It took 0 secs to create the summary file for 2899 sequences.
# Output File Names: 
# otus.above.94.custom.wang.taxonomy
# otus.above.94.custom.wang.tax.summary

# It took 348 secs to classify 16932 sequences.
# It took 0 secs to create the summary file for 16932 sequences.
# Output File Names: 
# otus.below.94.general.wang.taxonomy
# otus.below.94.general.wang.tax.summary
# otus.below.94.general.wang.flip.accnos

输入

sed 's/[[:blank:]]/\;/' <otu.custom.wang.taxonomy >otus.94.taxonomy.reformatted
mv otus.94.taxonomy.reformatted otus.94.taxonomy

sed 's/[[:blank:]]/\;/' <otu.general.wang.taxonomy >otus.general.taxonomy.reformatted
mv otus.general.taxonomy.reformatted otus.general.taxonomy
 
sed 's/[[:blank:]]/\;/' <otus.94.taxonomy >otus.94.taxonomy.reformatted
mv otus.94.taxonomy.reformatted otus.94.taxonomy

# (optional- only do if did step 11):
sed 's/[[:blank:]]/\;/' <otus.general.taxonomy >otus.general.taxonomy.reformatted
mv otus.general.taxonomy.reformatted otus.general.taxonomy

mkdir conflicts_94
Rscript find_classification_disagreements.R otus.94.taxonomy otus.general.taxonomy ids.above.94 conflicts_94 94 80 80

# 96% cutoff
Rscript filter_seqIDs_by_pident.R otus.custom.blast.table.modified ids.above.96 96 TRUE 
Rscript filter_seqIDs_by_pident.R otus.custom.blast.table.modified ids.below.96 96 FALSE

python find_seqIDs_blast_removed.py otu.fasta otus.custom.blast.table.modified ids.missing
cat ids.below.96 ids.missing > ids.below.96.all

python create_fastas_given_seqIDs.py ids.above.96 otu.fasta otus.above.96.fasta
python create_fastas_given_seqIDs.py ids.below.96.all otu.fasta otus.below.96.fasta


mothur "#classify.seqs(fasta=otus.above.96.fasta, template=custom.fasta, taxonomy=custom.taxonomy, method=wang, probs=T, processors=2, cutoff=0)"
mothur "#classify.seqs(fasta=otus.below.96.fasta, template=general.fasta, taxonomy=general.taxonomy, method=wang, probs=T, processors=2, cutoff=0)"

#返回:

# It took 1 secs to classify 1645 sequences.
# It took 0 secs to create the summary file for 1645 sequences.
# Output File Names: 
# otus.above.96.custom.wang.taxonomy
# otus.above.96.custom.wang.tax.summary


# It took 374 secs to classify 18186 sequences.
# It took 1 secs to create the summary file for 18186 sequences.
# Output File Names: 
# otus.below.96.general.wang.taxonomy
# otus.below.96.general.wang.tax.summary
# otus.below.96.general.wang.flip.accnos

输入

sed 's/[[:blank:]]/\;/' <otu.custom.wang.taxonomy >otus.96.taxonomy.reformatted
mv otus.96.taxonomy.reformatted otus.96.taxonomy


sed 's/[[:blank:]]/\;/' <otu.general.wang.taxonomy >otus.general.taxonomy.reformatted
mv otus.general.taxonomy.reformatted otus.general.taxonomy
sed 's/[[:blank:]]/\;/' <otus.96.taxonomy >otus.96.taxonomy.reformatted
mv otus.96.taxonomy.reformatted otus.96.taxonomy


# (optional- only do if did step 11):
sed 's/[[:blank:]]/\;/' <otus.general.taxonomy >otus.general.taxonomy.reformatted
mv otus.general.taxonomy.reformatted otus.general.taxonomy
mkdir conflicts_96
Rscript find_classification_disagreements.R otus.96.taxonomy otus.general.taxonomy ids.above.96 conflicts_96 96 80 80

最后再执行如下命令

Rscript plot_classification_disagreements.R otus.abund plots regular NA NA conflicts_94 ids.above.94 94 conflicts_96 ids.above.96 96 conflicts_98 ids.above.98 98

15 生成最终分类表

将所有低于cutoff的均标记为unclassified

Rscript plot_classification_disagreements.R otus.abund MakeSeqIDReadsOnly
Rscript find_classification_disagreements.R otus.98.taxonomy quickie ids.above.98 conflicts_98 98 80 80 final
Rscript find_classification_disagreements.R otus.96.taxonomy quickie ids.above.96 conflicts_96 96 80 80 final
Rscript find_classification_disagreements.R otus.94.taxonomy quickie ids.above.94 conflicts_94 94 80 80 final

得到

otus.98.80.80.taxonomy
otus.96.80.80.taxonomy
otus.94.80.80.taxonomy

16 清理

rm custom.db.* custom.8mer custom.custom* custom.tree* general.8mer general.general* general.tree* *wang* mothur.*.logfile otus.custom.blast* ids* otus.below*.fasta otus.above*.fasta otus.[0-9][0-9].taxonomy otus.[0-9][0-9][0-9].taxonomy otus.general.taxonomy otus.custom.taxonomy otus.custom.[0-9]* custom.general* *pvalues total* final*names

mkdir scripts
mv *.py *.R *.sh *.md *.Rmd *.html scripts

mkdir analysis/
mv conflicts* plots/ analysis/

mkdir data 
mv otus* data/

mkdir databases
mv *.taxonomy *.fasta databases

17 重整分类表格式

用这个数据库注释的微生物名称后面会带着括号,里面应该是置信值什么的,为了方便后续操作,使用正则表达式把分类表中的括号去掉

sed "s/([0-9]*)//g" otus.94.80.80.taxonomy -i
  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值