上游分析
下载bulk RNA-seq原始数据
已发布在这篇推文
在国家生物信息中心下载单细胞和普通转录组数据
数据清洗
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ ls raw_data | grep "f1" > gz1
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ cat gz1
CRR785672_f1.fq.gz
CRR785673_f1.fq.gz
CRR785674_f1.fq.gz
CRR785675_f1.fq.gz
CRR785676_f1.fq.gz
CRR785677_f1.fq.gz
CRR785678_f1.fq.gz
CRR785679_f1.fq.gz
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ ls raw_data | grep "r2" > gz2
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ cat gz2
CRR785672_r2.fq.gz
CRR785673_r2.fq.gz
CRR785674_r2.fq.gz
CRR785675_r2.fq.gz
CRR785676_r2.fq.gz
CRR785677_r2.fq.gz
CRR785678_r2.fq.gz
CRR785679_r2.fq.gz
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ paste gz1 gz2>config_file
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ cat config_file
CRR785672_f1.fq.gz CRR785672_r2.fq.gz
CRR785673_f1.fq.gz CRR785673_r2.fq.gz
CRR785674_f1.fq.gz CRR785674_r2.fq.gz
CRR785675_f1.fq.gz CRR785675_r2.fq.gz
CRR785676_f1.fq.gz CRR785676_r2.fq.gz
CRR785677_f1.fq.gz CRR785677_r2.fq.gz
CRR785678_f1.fq.gz CRR785678_r2.fq.gz
CRR785679_f1.fq.gz CRR785679_r2.fq.gz
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ conda activate software_1
按照文章的要求,修改了几个参数:
Raw reads were filtered to remove reads with adapters, reads with undetermined base information (N) content greater than 1% and reads with Qphred ≤20 accounting for more than 50% of the entire read length.
写trim_galore
的脚本,命名为run_fastq_clean.sh
#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n clean_data
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log
#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#===========================================================
#运行脚本
#===========================================================
cat config_file | while read id
do
sample_dir="./raw_data"
output_dir="./1_clean_data"
arr=($id)
fq1=${arr[0]}
fq2=${arr[1]}
sample_dir1="$sample_dir/$fq1"
sample_dir2="$sample_dir/$fq2"
trim_galore -q 20 --max_n 1 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $output_dir $sample_dir1 $sample_dir2
done
修改权限,并运行代码:
(software_1) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ chmod 777 run_fastq_clean.sh
(software_1) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ dsub -s run_fastq_clean.sh
JOBID MESSAGE
3956 submit job successfully
比对——STAR
之前的推文中已经构好了猪的索引,可以直接比对
猪_RNA_seq的Clean_data上游分析&下游分析总结
查看测序读段的碱基序列长度:
(software_1) [animal_geneticB@master-cli1-x86-agent1 raw_data]$ zcat CRR785672_f1.fq.gz | head -n 4 | tail -n 1 | tr -d '+\n' | wc -c
150
写脚本,命名为run_mapping.sh
,在1_clean_data
目录下运行该代码
#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n mapping_py_sus
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log
#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#===========================================================
#运行测试脚本
#===========================================================
GENOME_INDEX="/workspace/home/pig/animal_geneticB/RNA_seq/reference/index/star/sscrofa"
OUTPUT_DIR="/workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping"
THREADS=20
for R1_FILE in /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/1_clean_data/*_f1_val_1.fq.gz; do
# 提取样本名时去除 '_f1_val_1.fq.gz' 部分
SAMPLE_NAME=$(basename "$R1_FILE" _f1_val_1.fq.gz)
# 构建R2文件名,根据新的命名规则
R2_FILE="${SAMPLE_NAME}_r2_val_2.fq.gz"
# 确认R2文件存在
if [[ ! -f "$R2_FILE" ]]; then
echo "Missing R2 file for $R1_FILE"
continue
fi
OUT_PREFIX="$OUTPUT_DIR/$SAMPLE_NAME":
echo "Processing $SAMPLE_NAME"
# 注意:此处未直接解压fastq.gz文件,如果文件实际上是gzip压缩的,请先解压或修改命令处理
STAR --runThreadN $THREADS \
--genomeDir "$GENOME_INDEX" \
--readFilesIn "$R1_FILE" "$R2_FILE" \
--readFilesCommand zcat \
--outFileNamePrefix "$OUT_PREFIX" \
--outSAMtype BAM SortedByCoordinate \
--sjdbOverhang 150 \
#sjdbOverhang后面的值一定要和构建索引时的值是一致的
if [ $? -eq 0 ]; then
echo "Completed: $SAMPLE_NAME"
else
echo "Failed: $SAMPLE_NAME"
fi
done
定量
- 判断链是否为特异性链
(software_1) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ conda activate my_hisat2_env
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ gtfToGenePred -genePredExt -geneNameAsName2 /workspace/home/pig/animal_geneticB/RNA_seq/reference/Sscrofa/Sus_scrofa.Sscrofa11.1.112.PCGs.gtf gene.tmp
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ ls
gene.tmp
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ awk '{print $2"\t"$4"\t"$5"\t"$1"\t0\t"$3"\t"$6"\t"$7"\t0\t"$8"\t"$9"\t"$10}' gene.tmp > genes_refseq.bed12
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ ls
genes_refseq.bed12 gene.tmp
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785672:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0189
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4927
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4884
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785673:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0326
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4864
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4810
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785674:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0312
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4859
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4828
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785675:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0295
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4868
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4837
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785676:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0182
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4946
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4872
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785677:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0210
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4906
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4884
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785678:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0263
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4893
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4843
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785679:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0214
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4898
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4888
是双端的链非特异性,选择no
这个参数
- 定量
写脚本,命名为run_count.sh
#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n py_pig_count
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log
#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#===========================================================
#运行脚本
#===========================================================
# 定义变量
GENE_ANNOTATION="/workspace/home/pig/animal_geneticB/RNA_seq/reference/Sscrofa/Sus_scrofa.Sscrofa11.1.112.PCGs.gtf" #基因注释文件路径,注意,只用编码基因的注释文件
COUNT_OUTPUT_DIR="/workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/4_count"
HTSEQ_STRAND="no" # 或者 "no"、"yes"、"reverse"根据您的实验 strandedness 设置
# 创建输出目录(如果不存在)
#mkdir -p "$COUNT_OUTPUT_DIR"
# 遍历STAR生成的已排序BAM文件
for BAM_FILE in /workspace/home/pig/animal_geneticB/Analysis/LQP_iWAT_mWAT_Liver/mapping/*Aligned.sortedByCoord.out.bam; do
# 获取样本名
SAMPLE_NAME=$(basename "$BAM_FILE" _Aligned.sortedByCoord.out.bam)
# 定义输出计数文件的全路径
COUNT_FILE="$COUNT_OUTPUT_DIR/${SAMPLE_NAME}_htseq.counts"
echo "Quantifying gene expression for $SAMPLE_NAME using htseq-count..."
# 执行htseq-count
htseq-count -f bam -r pos -s "$HTSEQ_STRAND" "$BAM_FILE" "$GENE_ANNOTATION" > "$COUNT_FILE"
if [ $? -eq 0 ]; then
echo "Quantification completed: $SAMPLE_NAME"
else
echo "Quantification failed: $SAMPLE_NAME"
fi
done
- 合并
counts
文件
- 查看所有count文件顺序是否一致:
(base) [animal_geneticB@master-cli1-x86-agent1 4_count]$ head -n 4 *counts
==> CRR785672:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002 66
ENSSSCG00000000003 156
ENSSSCG00000000005 304
ENSSSCG00000000006 298
==> CRR785673:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002 82
ENSSSCG00000000003 239
ENSSSCG00000000005 649
ENSSSCG00000000006 231
==> CRR785674:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002 151
ENSSSCG00000000003 132
ENSSSCG00000000005 755
ENSSSCG00000000006 228
==> CRR785675:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002 228
ENSSSCG00000000003 224
ENSSSCG00000000005 360
ENSSSCG00000000006 191
==> CRR785676:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002 34
ENSSSCG00000000003 180
ENSSSCG00000000005 300
ENSSSCG00000000006 301
==> CRR785677:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002 10
ENSSSCG00000000003 143
ENSSSCG00000000005 489
ENSSSCG00000000006 431
==> CRR785678:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002 20
ENSSSCG00000000003 50
ENSSSCG00000000005 359
ENSSSCG00000000006 290
==> CRR785679:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002 4
ENSSSCG00000000003 103
ENSSSCG00000000005 297
ENSSSCG00000000006 137
- 写脚本
import pandas as pd
import os
import sys
# 获取目标目录下的所有.counts文件列表
directory = '/workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/4_count'
counts_files = [f for f in os.listdir(directory) if f.endswith('.counts')]
file_base_names = [os.path.splitext(f)[0] for f in counts_files] # 提取ERR编号
# 初始化一个空的DataFrame,用于存储最终结果
df_all = pd.DataFrame()
# 遍历每个.counts文件
for file_name, base_name in zip(counts_files, file_base_names):
# 构建完整路径
file_path = os.path.join(directory, file_name)
# 读取文件,假设文件格式为两列,第一列基因ID,第二列计数
df_temp = pd.read_csv(file_path, sep='\t', header=None, names=['GeneID', 'Count'])
# 将当前文件的计数作为新列添加到总的DataFrame中,基因ID作为索引
df_all[base_name] = df_temp.set_index('GeneID')['Count']
# 填充缺失值,如果某个基因在某些文件中没有出现,则填充为0
df_all.fillna(0, inplace=True)
# 重置索引,使得基因ID成为新的一列
df_all.reset_index(inplace=True)
# 输出到新的csv文件,或者根据需要调整输出格式
output_file = 'merged_counts.csv'
df_all.to_csv(output_file, index=False)
print(f"Merged counts saved to {output_file}")
####上面的脚本内容存成run_merge_counts.py python脚本
(base) [animal_geneticB@master-cli1-x86-agent1 4_count]$ python run_merge_counts.py *.counts
Merged counts saved to merged_counts.csv
下游分析
整理最终版表格
- 差异分析
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
library(DESeq2)
counts<-read.table("laiwu_pig_IMF_count.csv",sep = ',',header = T,row.names = 1)
head(counts)
dim(counts)
#mycounts<-counts[rowSums(counts) != 0,]
#dim(mycounts)
#删掉count为0的基因得到的差异结果和不去掉进行的差异分析结果不一致
#counts<-log2(counts+0.1)
#row.names一定要和数据的表头相一致
colDate<-data.frame(row.names=c("CRR785672","CRR785673","CRR785674","CRR785675","CRR785676","CRR785677","CRR785678","CRR785679")
,condition=factor(c("HLW","HLW","HLW","HLW","LLW","LLW","LLW","LLW"))
,levels=c("HLW","HLW","HLW","HLW","LLW","LLW","LLW","LLW"))
head(colData)
dds<-DESeqDataSetFromMatrix(countData = counts,colData = colDate,design = ~condition)
dds<-DESeq(dds)
sizeFactors(dds)
res <- results(dds, contrast = c("condition","HLW","LLW"), cooksCutoff = FALSE)
res
class(res)
res<-as.data.frame(res)
head(res)
library(dplyr)
res%>%
mutate(Sig = case_when(
log2FoldChange >= 0.58 & padj <= 0.05 ~ "UP",
log2FoldChange <= -0.58 & padj <= 0.05 ~ "DOWN",
TRUE ~ "Not_change"
)) -> res_1
table(res_1$Sig)
res_1<-cbind(rownames(res_1),res_1)
head(res_1)
colnames(res_1)<- c('gene',"baseMean","log2FoldChange","lfcSE","stat","pvalue","padj","Sig")
write.table(res_1,"laiwu_pig_IMF_count_DEseq.csv",sep = ',',col.names = T,row.names = F,quote = FALSE)
- 计算非冗余外显子的长度
前几次分析的时候,已经得到了这个文件,Sus_scrofa.Sscrofa11.1.112.PCGs.gtf.Efflens
,直接用之前的即可
- 计算TPM
脚本命名为2_Calculate_TPM.pl
,在4_count
文件夹中运行代码,得到.TPM
文件,运行命令为:perl 2_Calculate_TPM.pl *.counts
,
#!/usr/bin/perl
use warnings;
use strict;
# 读取基因长度信息
my %LEN;
{
open my $IN, "<", "/workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/5_DEG/Sus_scrofa.Sscrofa11.1.112.PCGs.gtf.Efflens" or die "Cannot open gene length file: $!";
while (<$IN>) {
chomp;
my @a = split;
$LEN{$a[0]} = $a[1];
}
close $IN;
}
# 遍历命令行参数(所有counts文件)
foreach my $counts (@ARGV) {
open my $OUT, ">", "$counts.TPM" or die "Cannot open output file for $counts: $!";
my $cousum = 0;
# 计算总表达量 cousum
{
open my $IN, "<", "$counts" or die "Cannot open input file $counts: $!";
while (<$IN>) {
chomp;
my @a = split;
next unless exists $LEN{$a[0]};
my $r1 = ($a[1] / ($LEN{$a[0]} / 1000));
$cousum += $r1;
}
close $IN;
}
# 计算并打印每个基因的TPM值
{
open my $IN, "<", "$counts" or die "Cannot reopen input file $counts: $!";
while (<$IN>) {
chomp;
my @a = split;
next unless exists $LEN{$a[0]};
my $r1 = ($a[1] / ($LEN{$a[0]} / 1000));
my $tpm = keep2pos(1000000 * ($r1 / $cousum));
print $OUT "$_\t$tpm\n";
}
close $IN;
}
}
sub keep2pos{
my $inv = shift;
my $ouv;
if($inv < 0.01){
$ouv=0;
}else{
my @q=split /\./,$inv;
$ouv=$q[0] + 0.01*(substr($q[1],0,2));
}
return $ouv;
}
- 从GTF文件中筛选出蛋白质编码基因的转录起始位点(TSS)
之前已经运行3_Pick_TSS_syobolID_pig.pl
,得到了Sus_scrofa.Sscrofa11.1.112.PCGs.gtf.TSS
文件