SMURF(5R)-Science封面文章使用的16S新流程(二)_short multiple regions framework, smurf-CSDN博客

本文链接：https://blog.csdn.net/zd200572/article/details/113427103

本文档记录了一次将SMURF流程转换为6R流程的尝试，主要涉及环境配置、脚本安装、文件结构调整以及运行过程中遇到的错误和警告。在6R流程中，对肿瘤微生物组数据进行了分析，虽然遇到了MATLAB代码的适配问题，但最终得到了物种分类结果。然而，尝试将6区的流程应用时，由于缺少必要的文件和可能的代码不兼容问题导致运行失败，作者决定转向qiime2-slide插件作为替代方案。

摘要由CSDN通过智能技术生成

前面介绍的SMURF流程的运行以失败告终了，不过这个是这篇文章的参考方法，至于这篇文章改进过的方法，还没有试过，这就试一下，顺便考虑是否能把6区的移植过来，搞个6R呢，可能，算法上有略微的区别，毕竟这篇Science研究的是肿瘤中的含量很少的微生物，用了严格的去污染策略，不管怎样，试试吧！

1、环境准备

类似上次那个流程，更加简单了些，只需要安装解压下。

# 安装MCR,这次是新版本的9.7
# 这是无图形界面安装的命令
sudo ./install -mode silent -agreeToLicense  yes
/usr/local/MATLAB/MATLAB_Runtime/v97/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v97/bin/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v97/sys/os/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v97/extern/bin/glnxa64

#　下载地址，速度还可以，毕竟大公司
wget https://ssd.mathworks.com/supportfiles/downloads/R2019b/Release/4/deployment_files/installer/complete/glnxa64/MATLAB_Runtime_R2019b_Update_4_glnxa64.zip
#　解压，然后安装,这次不需要加环境变量了，因为变成了脚本的一个参数
# 建议新建一个文件夹解压，否则文件挺乱
sudo ./install
#　克隆脚本
git clone https://github.com/NoamShental/5R.git
# 开始运行示例还是出现了摄氏，发现是fastq压缩成了zip，解压就好啦,在这个文件夹example_fastq
# 文件结构如下
example_fastq/
├── RDB123_ATGAGTGC
│   ├── RDB123_ATGAGTGC_L006_R1_001.fastq
│   └── RDB123_ATGAGTGC_L006_R2_001.fastq
└── RDB1_TTGGTGCA
    ├── RDB1_TTGGTGCA_L001_R1_001.fastq
    └── RDB1_TTGGTGCA_L001_R2_001.fastq
# 后面发现不建立一个样本一个文件夹也是可以的，脚本会自动复制文件到一个新的Samples文件夹
#　结构如下:
	Samples
        └── RDB1_TTGGTGCA
            ├── RDB1_TTGGTGCA_L001_R1_001.fastq
            └── RDB1_TTGGTGCA_L001_R2_001.fastq
# 运行，看结果啦，好像全程使用不超过双核呢，还是样本少，跑不起来
./5R_linux/run_main_5R.sh /usr/local/MATLAB/MATLAB_Runtime/v97 ./example_fastq/RDB1_TTGGTGCA ./GG_5R ./example_results/5R_SMURF_example.txt 126

运行过程比较顺畅，基本不报错，除了两个warning

------------------------------------------
Setting up environment variables
---
LD_LIBRARY_PATH is .:/usr/local/MATLAB/MATLAB_Runtime/v97/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v97/bin/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v97/sys/os/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v97/sys/opengl/lib/glnxa64
Input params are: 
Working on files in directory: ./example_fastq/RDB1_TTGGTGCA
Reconstruction using kmers of length: 126
WORKING ON SAMPLE : RDB1_TTGGTGCA
Part 1/1 - Block 1/2
Part 1/1 - Block 2/2
Number of reads: 299346
Percent of long enough reads: 1
Percent of good reads: 0.95608
Counting fasta write: 1
历时 5.675209 秒。
Mapped to primers 88% of unique reads
Mapped to primers 98% of read counts
Loading bacterial DB for region 1 out of 5 from original region 1
...
Loading bacterial DB for region 5 out of 5 from original region 5
Region 1 out of 5
Keep high freq: 8% of reads
Keep high freq: 89% of counts
Building matrix M
Building matrix A
--------------------------------------------
...
--------------------------------------------
Region 5 out of 5
Keep high freq: 5% of reads
Keep high freq: 90% of counts
Building matrix M
Building matrix A
--------------------------------------------
警告: Make sure PE is supported properly
> In solve_iterative_noisy (line 4)
  In reconstruction_func (line 40)
  In main_multiple_regions (line 56)
  In main_5R (line 57)
Region 1 out of 5
Keeping reads matched to DB: 95% of reads
Keeping reads matched to DB: 97% of counts
--------------------------------------------
...
--------------------------------------------
Region 5 out of 5
Keeping reads matched to DB: 98% of reads
Keeping reads matched to DB: 99% of counts
--------------------------------------------
Filter out columns (bacteria)
警告: TAKE PROPER CARE OF NOT AMPLIFIED REGIONS
> In solve_iterative_noisy (line 90)
  In reconstruction_func (line 40)
  In main_multiple_regions (line 56)
  In main_5R (line 57)
Normalize frequency counts
Build matrix A_L2
Making columns of A unique...
Removing included bacterias...
Removed 8844 out of 10189
警告: Found 721 bacterias with non even number of reads mapped
> In solve_iterative_noisy (line 178)
  In reconstruction_func (line 40)
  In main_multiple_regions (line 56)
  In main_5R (line 57)
Starting iterations...
Total iterations time: 1.1721
Building the Scott files for level: species
Loaded RDB1_TTGGTGCA

从过程来看，６Ｖ区的运行过程几乎完全一致，看看结果：

Total # of reads                                                        205605
domain  phylum  class   order   family  genus   species RDB1_TTGGTGCA
Bacteria        Actinobacteria  Acidimicrobiia  Acidimicrobiales        Unknown family  Unknown genus271        Unknown species1        0.002213
Bacteria        Actinobacteria  Actinobacteria  Actinomycetales Actinomycetaceae        Actinomyces     Actinomyces massiliensis        0.000603
Bacteria        Actinobacteria  Actinobacteria  Actinomycetales Actinomycetaceae        Actinomyces     Actinomyces naeslundii  0.000746
Bacteria        Actinobacteria  Actinobacteria  Actinomycetales Actinomycetaceae        Actinomyces     Actinomyces odontolyticus       0.000939

终于见到物种分类结果啦，未分类的它进行了自编号。

试试６Ｖ区行不行

# 复制一份出来，开始
cp -r 5R 6R
cd 6R
# 观察文件，替换６Ｒ需要的文件
│   ├── GreenGenes_201305_unique_up_to_3_ambiguous_16S_ffpe5regions_2mm_RL160_region1.mat
│   ├── GreenGenes_201305_unique_up_to_3_ambiguous_16S_ffpe5regions_2mm_RL160_region2.mat
│   ├── GreenGenes_201305_unique_up_to_3_ambiguous_16S_ffpe5regions_2mm_RL160_region3.mat
│   ├── GreenGenes_201305_unique_up_to_3_ambiguous_16S_ffpe5regions_2mm_RL160_region4.mat
│   ├── GreenGenes_201305_unique_up_to_3_ambiguous_16S_ffpe5regions_2mm_RL160_region5.mat
│   ├── GreenGenes_201305_unique_up_to_3_ambiguous_16S_headers.mat
│   └── taxonomy_db.mat
#　应该就是这几个文件啦,先删除,文件夹名字就不改啦，防止脚本不识别
rm GG_5R/*
# 物种注释文件吧？
cp ../SMURF/Green_Genes_201305/unique_up_to_3_ambiguous_16S/gg_rdp_taxa_name_calls.mat GG_5R/taxonomy_db.mat
# 序列文件
cp ../SMURF/Green_Genes_201305/unique_up_to_3_ambiguous_16S_amp6Regions_2mm_RL130/*  GG_5R/
# 再把测序文件拉来试试，先删除原来的
rm -r example_fastq/*
cp ../SMURF/Example/* example_fastq/

运行下，看看行不行啦

./5R_linux/run_main_5R.sh /usr/local/MATLAB/MATLAB_Runtime/v97 ./example_fastq ./GG_5R ./example_results/5R_SMURF_example.txt 126
# 报少个文件，好像是总的，复制原来的过来试试
cp ../5R/GG_5R/GreenGenes_201305_unique_up_to_3_ambiguous_16S_headers.mat GG_5R/
# 因为没有解压而报错
# WORKING ON SAMPLE : Samples
# 索引超出数组元素的数目(0)。
rm -r example_fastq/Samples/
gunzip example_fastq/*
# 还是报错了，应该是要修改ｍatlab代码才能解决，还是转向qiime2-slide这个插件吧
Mapped to primers 0% of unique reads
Mapped to primers 0% of read counts
索引超出数组元素的数目(0)。

出错 load_bact_DB (line 33)

出错 reconstruction_func (line 9)

出错 main_multiple_regions (line 56)

出错 main_5R (line 57)

MATLAB:badsubscript