RNA-seq全流程基础版之上游操作

小文学生信

已于 2023-04-10 19:35:30 修改

阅读量598

点赞数

分类专栏：生物信息学笔记文章标签： linux 学习

于 2023-02-12 23:35:53 首次发布

本文链接：https://blog.csdn.net/weixin_50614492/article/details/129000144

版权

生物信息学笔记专栏收录该内容

2 篇文章 3 订阅

订阅专栏

全流程介绍

公共数据库下载数据
转换数据库格式成为fasta.gz
fastqc或multiqc质控，trim-galore去接头和低质量序列
hisat2比对到基因组获取sam文件
samtools将sam文件转化为bam文件
subread进行count计数得到表达矩阵

软件安装(linux系统)
安装miniconda:wget -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh 清华源经常崩溃可自行更换镜像源。
安装后
source .bashrc
配置镜像

conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/
conda config --set show_channel_urls yes

miniconda安装后下列软件均可用conda安装也可以上官网找到对应版本链接wget安装
安装sra-tools、fastqc、multiqc、trim-galore、hisat2、samtools、subread。

conda install -y sra-tools # 出现三个done,即表示安装成功，否则重新提交命令，下同。
conda install -y fastqc
conda install -y fastqc
conda install -y trim-galore
conda install -y samtools
conda install -y subread

安装比对软件；hisat2并建立索引（常见物种有官方索引）

conda install -y hisat2

wget -c ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg38.tar.gz

**下载数据 **
sra-tools下载单个数据

prefetch SRR号

sra-tools批量下载数据：

nohup prefetch --option-file SRR_Acc_List.txt &

for id in *sra; do pfastq-dump --threads 4 ./$id --gzip; done

for id in *sra; do pfastq-dump --threads 4 ./$id --split-3 --gzip; done

for id in *sra
do
echo $id
fastq-dump --split-3 --gzip $id
done

质控
fastqc（生成两个文件* .html和*fastqc.gz），multiqc

ls SRR*.gz | while read id;do (nohup fastqc $id &);done

for id in *.gz
do 
echo $i
nohup fastqc $id -t 4 -o ./ &
done

multiqc：使用依赖python3（创建python=3的rnaseq2环境）

multiqc ./#对当前目录所有文件进行multiqc

去接头和低质量序列
trim-galore常用参数。
–quality：设定Phred quality score阈值，默认为20。分析时可改成25，稍微严格一些。
–phred33：选择-phred33或者-phred64，表示测序平台使用的Phred quality score。一般phred33对应(Sanger/Illumina 1.9+ encoding)，-phred64对应(Illumina 1.5 encoding)
–stringency：设定可以忍受的前后adapter重叠的碱基数，默认为1（非常苛刻）。
–length：设定输出reads长度阈值，小于设定值会被抛弃。
–paired：对于双端测序结果，一对reads中，如果有一个被剔除，那么另一个会被同样抛弃，而不管是否达到标准。
–gzip和–dont_gzip：清洗后的数据zip打包或者不打包。
–output_dir：输入目录。需要提前建立目录。

for（ i=71;i<=80；i++）in SRR144284$i.fastq.gz
do 
echo $i
nohup trim_galore -q 25 --length 36 --phred33 --paired --gzip -o ./clean/ &
done

ls *_1.fastq.gz >1
ls *_2.fastq.gz >2
paste 1 2 > config
cat config|while read id
do
        arr=($id)
        fq1=${arr[0]}
        fq2=${arr[1]}
nohup trim_galore -q 25 --phred33 --length 35 -e 0.1 --stringency 3 --gzip --paired -o ./clean/ $fq1 $fq2 &
done

for ((i=71;i<=80;i++));trim_galore -q 25 --length 36 --phred33 --paired --gzip -o ./clean/ &;done

比对到参考基因组（获取sam文件）
hisat2（有官方的索引文件）
建立基因组索引：

hisat2-build –p 4 genome.fa genome

建立基因组，转录组，SNP索引：

hisat2-build -p 8 genome.fa --snp genome.snp --ss genome.ss --exon genome.exon genome_snp_tran_index

hisat2用法：hisat2 -x 索引文件的前缀 -1 第一个文件 -2 第二个文件 -S 指定输出路径
hisat2常用参数。
-x：参考基因组索引文件的前缀。
-1：双端测序结果的第一个文件。若有多组数据，使用逗号将文件分隔。Reads的长度可以不一致。
-2：双端测序结果的第二个文件。若有多组数据，使用逗号将文件分隔，并且文件顺序要和-1参数对应。Reads的长度可以不一致。
-U：单端数据文件。若有多组数据，使用逗号将文件分隔。可以和-1、-2参数同时使用。Reads的长度可以不一致。
–sra-acc：输入SRA登录号，比如SRR353653，SRR353654。多组数据之间使用逗号分隔。HISAT将自动下载并识别数据类型，进行比对。
-S：指定输出的SAM文件。

 for ((i=71;i<=80;i++));
do nohup hisat2 -t -x ./reference/grch38/genome \
-1 ./rna_seq/data/SRR144284${i}.sra_1.fastq.gz \
-2 ./rna_seq/data/SRR144284${i}.sra_2.fastq.gz \
-S SRR144284${i}.sam &;done

将sam转化成bam并排序（bam是sam文件的二进制形式，更加节约空间）

for ((i=56;i<=62;i++));do samtools sort SRR144284${i}.bam -o SRR144284${i}_sorted.bam;done

hisat2比对的同时转换成bam文件

index=./reference/grch38/genome
ls *_1_val_1.fq.gz|while read id
do
id=${id/_1_val_1.fq.gz/}
hisat2 -p 1 -x ${index} \
-1 ${id}_1_val_1.fq.gz \
-2 ${id}_1_val_1.fq.gz \
2>${id}.hisat2.log| \
samtools sort -@ 2 -o ${id}.hisat2.sorted.bam
done

count计数
subread（含featureCounts）
subread常用参数
-input file：输入的bam/sam文件，支持多个文件输入
-a < string > ：参考gtf文件名，支持Gzipped文件格式
-F ：参考文件的格式，一般为GTF/SAF，C语言版本默认的格式为GTF格式
-A：提供一个逗号分割为两列的文件，一列为gtf中的染色体名，另一列为read中对应的染色体名，用于将gtf和read中的名称进行统一匹配，注意该文件提交时不需要列名
-g：gene_id表示meta-feature名称为gene_id(ensembl名称)
-M：如果设置-M，多重map的read将会被统计到
-o < string >：输出文件的名字
-O：允许多重比对，即当一个read比对到多个feature或多个metafeature的时候，这条read会被统计多次
-p：双端测序时使用
-T ：线程数目，1~32

运行featureCounts计数

featureCounts -T 6 -p -t exon -g gene_id -a ./gtf/gencode.v25.annotation.gtf.gz -o ./all.id.txt *sort.bam

简化结果，去除特征列，只保留基因计数的列

cat all.id.txt | cut -f1,7- > counts.txt

小文学生信

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
RNA-seq全流程基础版之上游操作

sra-tools>multiqc>trim-galore>hisat2>samtools>subread流程的RNA-Seq。
复制链接

扫一扫

专栏目录