RNA病毒基因组的重头组装-内含tophat2报错的快速解决办法-CPIV3数据分析-2023-07-13

垚垚爸爱学习

已于 2023-07-18 16:11:14 修改

阅读量465

点赞数

文章标签：数据分析数据挖掘 linux

于 2023-07-15 10:11:52 首次发布

本文链接：https://blog.csdn.net/xiaomotong123/article/details/131700322

版权

该博客详细记录了RNA病毒基因组组装过程，遇到Tophat2在比对时的错误，并提供了解决方案。作者使用STAR软件作为替代，对牛基因组和CPIV3病毒基因组进行比对，并运用Velvet进行从头组装。过程中涉及数据质控、比对、去除宿主序列和组装等步骤。

摘要由CSDN通过智能技术生成

1、使用Trim Galore软件对两次数据进行质控，去掉20bp以下的reads

vim新建RNA_seq_script_1对CPIV3测序数据进行质控分析

#!/bin/bash
# 上面一行宣告这个script的语法使用bash语法，当程序被执行时，能够载入bash的相关环境配置文件。
# Program
#     This program is used for RNA-seq data analysis.
# History
#     2023/07/13       zexing            First release
# 设置变量${dir}为常用目录
dir=/home/customer/lizexing/projects/CPIV3/

# 使用fastqc软件对数据进行质控分析
# fastqc -t 8 -o ${dir}/fastqc_report/ ${dir}/raw_data/*.fq.gz

# 对数据利用Trim_galore去掉20bp以下的接头
trim_galore -q 8 --phred33 --stringency 3 --length 20 -e 0.1 -j 4 --paired \
${dir}/CleanData/C3_1.clean.fq \
${dir}/CleanData/C3_2.clean.fq \
-o ${dir}/clean_data/

后台运行RNA_seq_script_1：

nohup bash RNA_seq_script_1 > RNA_seq_script_1_log &

2、使用Bowtie2软件对牛的基因组[bosTau9]构建索引

vim新建RNA_seq_script_2对牛的基因组构建索引

# 参数说明
# bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]
#  <bt2-idx>  Index filename prefix (minus trailing .X.bt2).
#            NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
#  <m1>       Files with #1 mates, paired with files in <m2>.
#            Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
#  <m2>       Files with #2 mates, paired with files in <m1>.
#             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
#  <r>        Files with unpaired reads.
#             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
#  <i>        Files with interleaved paired-end FASTQ/FASTA reads
#             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
#  <bam>      Files are unaligned BAM sorted by read name.
#  <sam>      File for SAM output (default: stdout)
#  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.
# bowtie2-build 基因组序列.fq 输出Index前缀 -p CPU数目
bowtie2-build GCF_002263795.2_ARS-UCD1.3_genomic.fna Bos -p 8

运行如下脚本：

nohup bash RNA_seq_script_2 > RNA_seq_script_2_log &

3、使用Tophat2软件对测序数据与牛的基因组进行比对

之前没有使用过Tophat2进行序列比对，这次因为第一次做病毒基因组从头组装，参考RNA病毒基因组组装指南、TopHat的使用以及输出文件安装、使用Tophat2软件进行比对。

(1)、先从Tophat官网下载相应的软件包进行安装；
(2)、使用Tophat需要提供bowtie软件提供的Index，官网已经提供了很多物种的Index及GTF格式文件，我自己还弄了半天；
(3)、tophat2报错：软件解压缩后，运行依赖于python2才可以运行，而现在的服务器默认的多为python3，看了网上很多办法，基本都是通过conda建立一个python2的环境后再运行Tophat2，我这边因为只能连接校园网，还没跟清华镜像联网，导致conda无法建立新环境（实属无奈），上网查到这个简便方法：

# 在tophat文件夹中打开tophat脚本
vim tophat

# 对该脚本的第一行进行python版本的指定
原命令为：#！/usr/bin/env python
新命令为：#！/usr/bin/env python2
# 仅添加了一个2的版本，省却了我很大的麻烦，必须记录、推广一下

(4)、通过help命令查看软件运行情况

(base) [customer@node01 tophat]$ bash tophat2 -h
Usage:
    tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] \
                                    [quals1,[quals2,...]] [quals1[,quals2,...]]

(5)、按照默认参数对测序结果进行比对，命令如下：

#!/bin/bash
# 上面一行宣告这个script的语法使用bash语法，当程序被执行时，能够载入bash的相关环境配置文件。
#     This program is used for RNA-seq data analysis.
# History
#     2023/07/13       zexing            First release

# 设置变量${dir}为常用目录
# dir=/home/customer/lizexing/projects/CPIV3/

# tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] [quals1,[quals2,...]] [quals1[,quals2,...]]

#-o 输出文件位置
#-p 几个CPU
#-G gtf文件
#index文件
#reads1
#reads2

tophat2 -o /home/customer/lizexing/projects/CPIV3/clean_data/mapped \
-p 8 \
-G /home/customer/lizexing/references/NCBI/Bostaurus/GCF_002263795.2.gtf \
/home/customer/lizexing/references/NCBI/Bostaurus/bt2_index/Bos \
/home/customer/lizexing/projects/CPIV3/clean_data/C3_1.clean_val_1.fq \
/home/customer/lizexing/projects/CPIV3/clean_data/C3_2.clean_val_2.fq

运行的时间比较长，我是利用nohup命令将其放到了后台运行。

# 软件运行记录如下：
[2023-07-14 13:32:23] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2023-07-14 13:32:23] Checking for Bowtie
                  Bowtie version:        2.4.2.0
[2023-07-14 13:32:23] Checking for Bowtie index files (genome)..
[2023-07-14 13:32:23] Checking for reference FASTA file
        Warning: Could not find FASTA file /home/customer/lizexing/references/NCBI/Bostaurus/bt2_index/Bos.fa
[2023-07-14 13:32:23] Reconstituting reference FASTA file from Bowtie index
  Executing: /home/customer/.conda/envs/RNA-ChIP-Seq/bin/bowtie2-inspect /home/customer/lizexing/references/NCBI/Bostaurus/bt2_index/Bos > /home/customer/lizexing/projects/CPIV3/tophat_out/tmp/Bos.fa
[2023-07-14 13:34:02] Generating SAM header for /home/customer/lizexing/references/NCBI/Bostaurus/bt2_index/Bos
[2023-07-14 13:34:04] Reading known junctions from GTF file
[2023-07-14 13:34:24] Preparing reads
         left reads: min. length=20, max. length=150, 15779194 kept reads (18 discarded)
        right reads: min. length=20, max. length=150, 15779192 kept reads (20 discarded)
[2023-07-14 13:43:14] Building transcriptome data files /home/customer/lizexing/projects/CPIV3/tophat_out/tmp/GCF_002263795.2
[2023-07-14 13:44:09] Building Bowtie index from GCF_002263795.2.fa
[2023-07-14 14:39:34] Mapping left_kept_reads to transcriptome GCF_002263795.2 with Bowtie2
[2023-07-14 15:08:02] Mapping right_kept_reads to transcriptome GCF_002263795.2 with Bowtie2
[2023-07-14 15:39:45] Resuming TopHat pipeline with unmapped reads
[2023-07-14 15:39:46] Mapping left_kept_reads.m2g_um to genome Bos with Bowtie2
[2023-07-14 15:49:49] Mapping left_kept_reads.m2g_um_seg1 to genome Bos with Bowtie2 (1/6)
[2023-07-14 15:50:58] Mapping left_kept_reads.m2g_um_seg2 to genome Bos with Bowtie2 (2/6)
[2023-07-14 15:52:20] Mapping left_kept_reads.m2g_um_seg3 to genome Bos with Bowtie2 (3/6)
[2023-07-14 15:53:30] Mapping left_kept_reads.m2g_um_seg4 to genome Bos with Bowtie2