从序列到特征表:DADA2/Deblur/UNOISE in VSEARCH三种流程的分析脚本

From 16s rDNA amplicon sequence to feature table

序列是公司测序返回的数据,格式如下:

sample1_R1.fq.gz
sample1_R2.fq.gz
sample2_R1.fq.gz
sample2_R2.fq.gz
sample3_R1.fq.gz
sample3_R2.fq.gz

每个文件的格式如下(测试数据下载):

@测序的各种编号
正向引物+mysequence
+
测序质量
@测序的各种编号
反向引物+mysequence
+
测序质量
@测序的各种编号
反向引物+mysequence
+
测序质量
@测序的各种编号
正向引物+mysequence
+
测序质量

由于样本比较多,所以我用R生成的QIIME2 需要的manifest文件,并用Rmarkdown整理了了分析的脚本

knitr::opts_chunk$set(echo = TRUE,eval = FALSE)
knitr::opts_chunk$set(engine.opts = list(bash = "-l"))

用R生成manifest文件

library(stringr)
library(dplyr)
setwd("YOURPATH of sequences")
f <- dir("./")
f1 <- f[seq(1,length(f),2)]
f2 <- f[seq(2,length(f),2)]
#根据自身的序列名称进行拆分
id1 <- sapply(str_split(f1,"_"),function(x) x[1])
id2 <- sapply(str_split(f1,"_"),function(x) x[1])
sum(!(id1 == id2))
sampleid <- paste("sample",id1,sep="")
path <- "YOURPATH"
manifest <- tibble(`sample-id`=sampleid,
                   `forward-absolute-filepath`=paste(path,f1,sep=""),
                   `reverse-absolute-filepath`=paste(path,f2,sep=""))
write.table(manifest,
            "./16s-manifest.tsv",
            sep = "\t",
            quote = FALSE,
            col.names = TRUE,
            row.names = FALSE)

Vseach pipline for 16s amplicon sequences

Import manifest file
manifest <- readr::read_table(
  "YOURPATH of manifest file",
  col_names = TRUE)
Merge paired ends and relabel them

不怎么会用shell编程,就用R脚本来完成了,对不同的序列进行拼接。

setwd("YOURPATH")
for (i in 1:nrow(manifest)){
  comstr1 <- paste("vsearch --fastq_mergepairs ",
        manifest[i,2],
        " --reverse ",
        manifest[i,3],
        " --fastqout ",
        manifest[i,1],
        ".fq ",
        " --relabel ",
        manifest[i,1],
        ".",
        sep="")
  #comstr2 <- paste("gzip ",manifest[i,1],".fq",sep="")
  print(comstr1)
  system(comstr1)
  #print(comstr2)
  #system(comstr2)
}
Pool all merged sequences
mkdir ./2_merged
mv *.fq ./2_merged
cat ./2_merged/*.fq >> ./all.fq
Trim primer sequence

合并后的序列格式如下:

  • GTGYCAGCMGCCGCGGTAAmysequenceAAACTCAAAKRAATTGACGGCCbarcode

  • GGCCGYCAATTYMTTTRAGTTTmysequenceTTACCGCGGCKGCTGRCACbarcode

cutadapt -a TTACCGCGGCKGCTGRCAC \
	-a AAACTCAAAKRAATTGACGGCC \
	-g GGCCGYCAATTYMTTTRAGTTT \
    -g GTGYCAGCMGCCGCGGTAA \
	-o all_trimmed2.fq \
	-j 10 \
 	all_trimmed.fq
Quality control
vsearch --fastx_filter ./all_trimmed.fq \
  --fastq_maxee_rate 0.01 \
  --fastaout ./filtered.fa
Dereplication
vsearch --derep_fulllength ./filtered.fa \
                --sizeout --minuniquesize 2 \
                --output ./uniques.fa
Unoise
vsearch --cluster_unoise uniques.fa \
                --centroids zotus.fa \
                --uc uc_zotus.uc \
                --log unoise_log.txt \
                --threads 14
Chimeras detection
vsearch  --uchime3_denovo zotus.fa \
  --nonchimeras zotus_nochime.fa \
  --fasta_width 0 \
  --relabel zotu \
  --xsize \
  --log unchime3_log.txt

不知为何生成的zotus_nochime.fa文件中有小写的字母(懂的留言一下哈),需要用如下代码处理一下:

# from Denoising 16S Sequence Reads using UNOISE3 via VSEARCH and QIIME2
awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' \
  zotus_nochime.fa > zotus_nochime_nomask.fa

这项代码的含义如下(from chitgpt):

This command is using the awk programming language to process input line by line. Here’s a breakdown of the command:

  1. awk: This is the command itself, which invokes the awk programming language interpreter.
  2. 'BEGIN{FS=" "}': This is an awk block that is executed before processing any input lines. In this case, it sets the field separator (FS) to a space character, meaning that each line will be split into fields based on spaces.
  3. {if(!/>/){print toupper($0)}else{print $1}}: This is the main block of the awk command, which is executed for each input line. It contains an if-else statement.
    • if (!/>/): This condition checks if the input line does not contain the > character.
    • print toupper($0): If the condition is true, it prints the entire input line ($0) in uppercase using the toupper() function.
    • else: If the condition is false (i.e., the line contains >), it executes the following block.
    • print $1: It prints the first field ($1) of the input line, i.e., it prints everything before the first space character.

So, this command takes input lines, converts them to uppercase if they don’t contain >, and prints only the first field (before the first space character) if they do contain >.

Build feature table
vsearch --usearch_global filtered.fa \
       --db zotus_nochime.fa \
       --id 0.97 \
       --otutabout unoise3_zotu_table.txt \
       --biomout unoise3_zotu_table.biom  \
       --threads 14 \
       --log otutab_log.txt

DADA2 in QIIME2

conda activate qiime2-2022.8
DADA2 pipline
Import demultiplexed sequence into qiime2
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path ./16s-manifest.tsv \
  --output-path 16s.qza \
  --input-format PairedEndFastqManifestPhred33V2

qiime cutadapt trim-paired \
--i-demultiplexed-sequences 16s.qza \
--p-cores 10 \
--p-front-f GTGYCAGCMGCCGCGGTAA \
--p-front-f GGCCGYCAATTYMTTTRAGTTT \
--p-front-r GTGYCAGCMGCCGCGGTAA \
--p-front-r GGCCGYCAATTYMTTTRAGTTT \
--o-trimmed-sequences 16s-trimmed.qza  \
--quiet
Generate the summerize of trimmed sequences
qiime demux summarize \
  --i-data 16s-trimmed.qza \
  --o-visualization 16s-trimmed-summarize.qzv

#View the result to determine the value of parameters
#in next command "qiime dada2 denoise-paired"
#--p-trunc-len-f and --p-trunc-len-r
#Q20 was chosen as the threshold to determine the truncated length of reads
#What is the minimum of your sequence length?

qiime tools view 16s-trimmed-summarize.qzv
DADA2 denoise
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs 16s-trimmed.qza\
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 215 \
  --p-trunc-len-r 215 \
  --p-max-ee-f 2 \
  --p-max-ee-r 2 \
  --o-table 16s-table.qza \
  --o-representative-sequences 16s-rep-seqs.qza \
  --o-denoising-stats 16s-denoising-stats.qza \
  --p-n-threads 0

qiime metadata tabulate \
  --m-input-file 16s-denoising-stats.qza \
  --o-visualization 16s-denoising-stats.qzv

qiime tools view 16s-denoising-stats.qzv

biom convert -i feature-table.biom -o feature-table.txt --to-tsv
Taxonomy classification
#Extract reference reads 
qiime feature-classifier extract-reads \
  --i-sequences 2022.10.backbone.full-length.fna.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA \
  --p-r-primer GGCCGYCAATTYMTTTRAGTTT \
  --p-min-length 300 \
  --p-max-length 500 \
  --o-reads ref-seqs.qza

#Train the classifier 
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ref-seqs.qza \
  --i-reference-taxonomy 2022.10.backbone.tax.qza \
  --o-classifier classifier.qza

qiime feature-classifier classify-sklearn \
  --i-classifier classifier.qza \
  --i-reads 16s-rep-seqs.qza \
  --o-classification 16s-taxonomy.qza

qiime metadata tabulate \
  --m-input-file 16s-taxonomy.qza \
  --o-visualization 16s-taxonomy.qzv

#qiime tools view 16s-taxonomy.qzv

qiime taxa barplot \
  --i-table 16s-table.qza \
  --i-taxonomy 16s-taxonomy.qza \
  --m-metadata-file 16s-manifest.tsv\
  --o-visualization 16s-taxa-bar-plots.qzv

#qiime tools view 16s-taxa-bar-plots.qzv
Phylogeny construction
qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences 16s-rep-seqs.qza \
  --o-alignment 16s-aligned-rep-seqs.qza \
  --o-masked-alignment 16s-masked-aligned-rep-seqs.qza \
  --o-tree 16s-unrooted-tree.qza \
  --o-rooted-tree 16s-rooted-tree.qza

Deblur in QIIME2

conda activate qiime2-2022.8
导入序列到QIIME2
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path ./16s-manifest.tsv \
  --output-path 16s.qza \
  --input-format PairedEndFastqManifestPhred33V2
去掉引物即引物前面的barcode
qiime cutadapt trim-paired \
--i-demultiplexed-sequences 16s.qza \
--p-cores 10 \
--p-front-f GTGYCAGCMGCCGCGGTAA \
--p-front-f GGCCGYCAATTYMTTTRAGTTT \
--p-front-r GTGYCAGCMGCCGCGGTAA \
--p-front-r GGCCGYCAATTYMTTTRAGTTT \
--o-trimmed-sequences 16s-trimmed.qza  \
--quiet
拼接
qiime vsearch join-pairs \
  --i-demultiplexed-seqs 16s-trimmed.qza  \
  --o-joined-sequences demux-joined.qza \
  --p-threads 8
质控
qiime quality-filter q-score \
  --i-demux demux-joined.qza \
  --p-min-quality 20 \
  --o-filtered-sequences demux-joined-filtered.qza \
  --o-filter-stats demux-joined-filter-stats.qza
deblur需要序列长度相同,确定裁剪的长度
qiime demux summarize \
  --i-data demux-joined-filtered.qza \
  --o-visualization demux-joined-filtered.qzv

qiime tools view demux-joined-filtered.qzv
deblur需要序列长度相同
qiime deblur denoise-16S \
  --i-demultiplexed-seqs demux-joined-filtered.qza \
  --p-trim-length 380 \
  --p-sample-stats \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza \
  --o-stats deblur-stats.qza \
  --p-jobs-to-start 10

biom convert -i feature-table.biom -o feature-table.txt --to-tsv
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

道阻且长1994

您的鼓励有助于我创作更多高质量

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值