1、使用Trim Galore软件对两次数据进行质控,去掉20bp以下的reads
vim新建RNA_seq_script_1对CPIV3测序数据进行质控分析
#!/bin/bash
# 上面一行宣告这个script的语法使用bash语法,当程序被执行时,能够载入bash的相关环境配置文件。
# Program
# This program is used for RNA-seq data analysis.
# History
# 2023/07/13 zexing First release
# 设置变量${dir}为常用目录
dir=/home/customer/lizexing/projects/CPIV3/
# 使用fastqc软件对数据进行质控分析
# fastqc -t 8 -o ${dir}/fastqc_report/ ${dir}/raw_data/*.fq.gz
# 对数据利用Trim_galore去掉20bp以下的接头
trim_galore -q 8 --phred33 --stringency 3 --length 20 -e 0.1 -j 4 --paired \
${dir}/CleanData/C3_1.clean.fq \
${dir}/CleanData/C3_2.clean.fq \
-o ${dir}/clean_data/
后台运行RNA_seq_script_1:
nohup bash RNA_seq_script_1 > RNA_seq_script_1_log &
2、 使用Bowtie2软件对牛的基因组[bosTau9]构建索引
vim新建RNA_seq_script_2对牛的基因组构建索引
# 参数说明
# bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]
# <bt2-idx> Index filename prefix (minus trailing .X.bt2).
# NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
# <m1> Files with #1 mates, paired with files in <m2>.
# Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
# <m2> Files with #2 mates, paired with files in <m1>.
# Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
# <r> Files with unpaired reads.
# Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
# <i> Files with interleaved paired-end FASTQ/FASTA reads
# Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
# <bam> Files are unaligned BAM sorted by read name.
# <sam> File for SAM output (default: stdout)
# <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
# bowtie2-build 基因组序列.fq 输出Index前缀 -p CPU数目
bowtie2-build GCF_002263795.2_ARS-UCD1.3_genomic.fna Bos -p 8
运行如下脚本:
nohup bash RNA_seq_script_2 > RNA_seq_script_2_log &
3、使用Tophat2软件对测序数据与牛的基因组进行比对
之前没有使用过Tophat2进行序列比对,这次因为第一次做病毒基因组从头组装,参考RNA病毒基因组组装指南、TopHat的使用以及输出文件安装、使用Tophat2软件进行比对。
(1)、先从Tophat官网下载相应的软件包进行安装;
(2)、使用Tophat需要提供bowtie软件提供的Index,官网已经提供了很多物种的Index及GTF格式文件,我自己还弄了半天;
(3)、tophat2报错:软件解压缩后,运行依赖于python2才可以运行,而现在的服务器默认的多为python3,看了网上很多办法,基本都是通过conda建立一个python2的环境后再运行Tophat2,我这边因为只能连接校园网,还没跟清华镜像联网,导致conda无法建立新环境(实属无奈),上网查到这个简便方法:
# 在tophat文件夹中打开tophat脚本
vim tophat
# 对该脚本的第一行进行python版本的指定
原命令为:#!/usr/bin/env python
新命令为:#!/usr/bin/env python2
# 仅添加了一个2的版本,省却了我很大的麻烦,必须记录、推广一下
(4)、通过help命令查看软件运行情况
(base) [customer@node01 tophat]$ bash tophat2 -h
Usage:
tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] \
[quals1,[quals2,...]] [quals1[,quals2,...]]
(5)、按照默认参数对测序结果进行比对,命令如下:
#!/bin/bash
# 上面一行宣告这个script的语法使用bash语法,当程序被执行时,能够载入bash的相关环境配置文件。
# This program is used for RNA-seq data analysis.
# History
# 2023/07/13 zexing First release
# 设置变量${dir}为常用目录
# dir=/home/customer/lizexing/projects/CPIV3/
# tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] [quals1,[quals2,...]] [quals1[,quals2,...]]
#-o 输出文件位置
#-p 几个CPU
#-G gtf文件
#index文件
#reads1
#reads2
tophat2 -o /home/customer/lizexing/projects/CPIV3/clean_data/mapped \
-p 8 \
-G /home/customer/lizexing/references/NCBI/Bostaurus/GCF_002263795.2.gtf \
/home/customer/lizexing/references/NCBI/Bostaurus/bt2_index/Bos \
/home/customer/lizexing/projects/CPIV3/clean_data/C3_1.clean_val_1.fq \
/home/customer/lizexing/projects/CPIV3/clean_data/C3_2.clean_val_2.fq
运行的时间比较长,我是利用nohup命令将其放到了后台运行。
# 软件运行记录如下:
[2023-07-14 13:32:23] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2023-07-14 13:32:23] Checking for Bowtie
Bowtie version: 2.4.2.0
[2023-07-14 13:32:23] Checking for Bowtie index files (genome)..
[2023-07-14 13:32:23] Checking for reference FASTA file
Warning: Could not find FASTA file /home/customer/lizexing/references/NCBI/Bostaurus/bt2_index/Bos.fa
[2023-07-14 13:32:23] Reconstituting reference FASTA file from Bowtie index
Executing: /home/customer/.conda/envs/RNA-ChIP-Seq/bin/bowtie2-inspect /home/customer/lizexing/references/NCBI/Bostaurus/bt2_index/Bos > /home/customer/lizexing/projects/CPIV3/tophat_out/tmp/Bos.fa
[2023-07-14 13:34:02] Generating SAM header for /home/customer/lizexing/references/NCBI/Bostaurus/bt2_index/Bos
[2023-07-14 13:34:04] Reading known junctions from GTF file
[2023-07-14 13:34:24] Preparing reads
left reads: min. length=20, max. length=150, 15779194 kept reads (18 discarded)
right reads: min. length=20, max. length=150, 15779192 kept reads (20 discarded)
[2023-07-14 13:43:14] Building transcriptome data files /home/customer/lizexing/projects/CPIV3/tophat_out/tmp/GCF_002263795.2
[2023-07-14 13:44:09] Building Bowtie index from GCF_002263795.2.fa
[2023-07-14 14:39:34] Mapping left_kept_reads to transcriptome GCF_002263795.2 with Bowtie2
[2023-07-14 15:08:02] Mapping right_kept_reads to transcriptome GCF_002263795.2 with Bowtie2
[2023-07-14 15:39:45] Resuming TopHat pipeline with unmapped reads
[2023-07-14 15:39:46] Mapping left_kept_reads.m2g_um to genome Bos with Bowtie2
[2023-07-14 15:49:49] Mapping left_kept_reads.m2g_um_seg1 to genome Bos with Bowtie2 (1/6)
[2023-07-14 15:50:58] Mapping left_kept_reads.m2g_um_seg2 to genome Bos with Bowtie2 (2/6)
[2023-07-14 15:52:20] Mapping left_kept_reads.m2g_um_seg3 to genome Bos with Bowtie2 (3/6)
[2023-07-14 15:53:30] Mapping left_kept_reads.m2g_um_seg4 to genome Bos with Bowtie2