猪转录组数据上游&下游分析（关于IMF）——丝滑流程

本文链接：https://blog.csdn.net/m0_60116853/article/details/140052708

上游分析

下载bulk RNA-seq原始数据

数据清洗

(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ ls raw_data | grep "f1" > gz1 
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ cat gz1
CRR785672_f1.fq.gz
CRR785673_f1.fq.gz
CRR785674_f1.fq.gz
CRR785675_f1.fq.gz
CRR785676_f1.fq.gz
CRR785677_f1.fq.gz
CRR785678_f1.fq.gz
CRR785679_f1.fq.gz
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ ls raw_data | grep "r2" > gz2
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ cat gz2
CRR785672_r2.fq.gz
CRR785673_r2.fq.gz
CRR785674_r2.fq.gz
CRR785675_r2.fq.gz
CRR785676_r2.fq.gz
CRR785677_r2.fq.gz
CRR785678_r2.fq.gz
CRR785679_r2.fq.gz
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ paste gz1 gz2>config_file
(base) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ cat config_file
CRR785672_f1.fq.gz	CRR785672_r2.fq.gz
CRR785673_f1.fq.gz	CRR785673_r2.fq.gz
CRR785674_f1.fq.gz	CRR785674_r2.fq.gz
CRR785675_f1.fq.gz	CRR785675_r2.fq.gz
CRR785676_f1.fq.gz	CRR785676_r2.fq.gz
CRR785677_f1.fq.gz	CRR785677_r2.fq.gz
CRR785678_f1.fq.gz	CRR785678_r2.fq.gz
CRR785679_f1.fq.gz	CRR785679_r2.fq.gz
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ conda activate software_1

按照文章的要求，修改了几个参数：

Raw reads were filtered to remove reads with adapters, reads with undetermined base information (N) content greater than 1% and reads with Qphred ≤20 accounting for more than 50% of the entire read length.

写trim_galore的脚本，命名为run_fastq_clean.sh

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n clean_data
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log
#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#===========================================================
#运行脚本
#===========================================================
cat config_file | while read id
do
	sample_dir="./raw_data"
	output_dir="./1_clean_data"
	arr=($id)
	fq1=${arr[0]}
	fq2=${arr[1]}
	sample_dir1="$sample_dir/$fq1"
	sample_dir2="$sample_dir/$fq2"
	trim_galore -q 20 --max_n 1 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $output_dir $sample_dir1 $sample_dir2
done

修改权限，并运行代码：

(software_1) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ chmod 777 run_fastq_clean.sh 
(software_1) [animal_geneticB@master-cli1-x86-agent1 RNA_seq]$ dsub -s run_fastq_clean.sh 
JOBID        MESSAGE   
3956         submit job successfully

比对——`STAR`

之前的推文中已经构好了猪的索引，可以直接比对
猪_RNA_seq的Clean_data上游分析&下游分析总结

查看测序读段的碱基序列长度：

(software_1) [animal_geneticB@master-cli1-x86-agent1 raw_data]$ zcat CRR785672_f1.fq.gz | head -n 4 | tail -n 1 | tr -d '+\n' | wc -c
150

写脚本，命名为run_mapping.sh，在1_clean_data目录下运行该代码

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n mapping_py_sus
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log

#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#===========================================================
#运行测试脚本
#===========================================================
GENOME_INDEX="/workspace/home/pig/animal_geneticB/RNA_seq/reference/index/star/sscrofa"
OUTPUT_DIR="/workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping"
THREADS=20

for R1_FILE in /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/1_clean_data/*_f1_val_1.fq.gz; do
	# 提取样本名时去除 '_f1_val_1.fq.gz' 部分
	SAMPLE_NAME=$(basename "$R1_FILE" _f1_val_1.fq.gz)
	# 构建R2文件名，根据新的命名规则
	R2_FILE="${SAMPLE_NAME}_r2_val_2.fq.gz"
	# 确认R2文件存在
	if [[ ! -f "$R2_FILE" ]]; then
		echo "Missing R2 file for $R1_FILE"
		continue
	fi

	OUT_PREFIX="$OUTPUT_DIR/$SAMPLE_NAME":
	echo "Processing $SAMPLE_NAME"
	# 注意：此处未直接解压fastq.gz文件，如果文件实际上是gzip压缩的，请先解压或修改命令处理
	STAR --runThreadN $THREADS \
     --genomeDir "$GENOME_INDEX" \
     --readFilesIn "$R1_FILE" "$R2_FILE" \
     --readFilesCommand zcat \
     --outFileNamePrefix "$OUT_PREFIX" \
     --outSAMtype BAM SortedByCoordinate \
     --sjdbOverhang 150 \
#sjdbOverhang后面的值一定要和构建索引时的值是一致的 
	if [ $? -eq 0 ]; then
		echo "Completed: $SAMPLE_NAME"
	else
		echo "Failed: $SAMPLE_NAME"
	fi
done

定量

判断链是否为特异性链

(software_1) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ conda activate my_hisat2_env
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ gtfToGenePred -genePredExt -geneNameAsName2 /workspace/home/pig/animal_geneticB/RNA_seq/reference/Sscrofa/Sus_scrofa.Sscrofa11.1.112.PCGs.gtf gene.tmp
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ ls
gene.tmp
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ awk '{print $2"\t"$4"\t"$5"\t"$1"\t0\t"$3"\t"$6"\t"$7"\t0\t"$8"\t"$9"\t"$10}' gene.tmp > genes_refseq.bed12
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ ls
genes_refseq.bed12  gene.tmp
(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785672:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0189
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4927
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4884

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785673:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0326
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4864
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4810

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785674:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0312
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4859
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4828

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785675:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0295
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4868
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4837

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785676:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0182
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4946
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4872

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785677:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0210
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4906
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4884

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785678:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0263
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4893
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4843

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 3_RseQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/2_mapping/CRR785679:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0214
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4898
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4888

是双端的链非特异性，选择no这个参数

定量
写脚本，命名为run_count.sh

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n py_pig_count
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log

#===========================================================
#加载环境变量及依赖软件
#===========================================================

module use /workspace/public/x86/software/modules/
module purge

#===========================================================
#运行脚本
#===========================================================
# 定义变量
GENE_ANNOTATION="/workspace/home/pig/animal_geneticB/RNA_seq/reference/Sscrofa/Sus_scrofa.Sscrofa11.1.112.PCGs.gtf" #基因注释文件路径，注意，只用编码基因的注释文件
COUNT_OUTPUT_DIR="/workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/4_count"
HTSEQ_STRAND="no" # 或者 "no"、"yes"、"reverse"根据您的实验 strandedness 设置

# 创建输出目录（如果不存在）
#mkdir -p "$COUNT_OUTPUT_DIR"

# 遍历STAR生成的已排序BAM文件
for BAM_FILE in /workspace/home/pig/animal_geneticB/Analysis/LQP_iWAT_mWAT_Liver/mapping/*Aligned.sortedByCoord.out.bam; do
    # 获取样本名
    SAMPLE_NAME=$(basename "$BAM_FILE" _Aligned.sortedByCoord.out.bam)
    
    # 定义输出计数文件的全路径
    COUNT_FILE="$COUNT_OUTPUT_DIR/${SAMPLE_NAME}_htseq.counts"
    
    echo "Quantifying gene expression for $SAMPLE_NAME using htseq-count..."
    # 执行htseq-count
    htseq-count -f bam -r pos -s "$HTSEQ_STRAND" "$BAM_FILE" "$GENE_ANNOTATION" > "$COUNT_FILE"
    
    if [ $? -eq 0 ]; then
        echo "Quantification completed: $SAMPLE_NAME"
    else
        echo "Quantification failed: $SAMPLE_NAME"
    fi
done

合并counts文件

查看所有count文件顺序是否一致：

(base) [animal_geneticB@master-cli1-x86-agent1 4_count]$ head -n 4 *counts
==> CRR785672:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002	66
ENSSSCG00000000003	156
ENSSSCG00000000005	304
ENSSSCG00000000006	298

==> CRR785673:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002	82
ENSSSCG00000000003	239
ENSSSCG00000000005	649
ENSSSCG00000000006	231

==> CRR785674:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002	151
ENSSSCG00000000003	132
ENSSSCG00000000005	755
ENSSSCG00000000006	228

==> CRR785675:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002	228
ENSSSCG00000000003	224
ENSSSCG00000000005	360
ENSSSCG00000000006	191

==> CRR785676:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002	34
ENSSSCG00000000003	180
ENSSSCG00000000005	300
ENSSSCG00000000006	301

==> CRR785677:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002	10
ENSSSCG00000000003	143
ENSSSCG00000000005	489
ENSSSCG00000000006	431

==> CRR785678:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002	20
ENSSSCG00000000003	50
ENSSSCG00000000005	359
ENSSSCG00000000006	290

==> CRR785679:Aligned.sortedByCoord.out.bam_htseq.counts <==
ENSSSCG00000000002	4
ENSSSCG00000000003	103
ENSSSCG00000000005	297
ENSSSCG00000000006	137

写脚本

import pandas as pd
import os
import sys

# 获取目标目录下的所有.counts文件列表
directory = '/workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/4_count'
counts_files = [f for f in os.listdir(directory) if f.endswith('.counts')]
file_base_names = [os.path.splitext(f)[0] for f in counts_files]  # 提取ERR编号

# 初始化一个空的DataFrame，用于存储最终结果
df_all = pd.DataFrame()

# 遍历每个.counts文件
for file_name, base_name in zip(counts_files, file_base_names):
    # 构建完整路径
    file_path = os.path.join(directory, file_name)
    
    # 读取文件，假设文件格式为两列，第一列基因ID，第二列计数
    df_temp = pd.read_csv(file_path, sep='\t', header=None, names=['GeneID', 'Count'])
    
    # 将当前文件的计数作为新列添加到总的DataFrame中，基因ID作为索引
    df_all[base_name] = df_temp.set_index('GeneID')['Count']

# 填充缺失值，如果某个基因在某些文件中没有出现，则填充为0
df_all.fillna(0, inplace=True)

# 重置索引，使得基因ID成为新的一列
df_all.reset_index(inplace=True)

# 输出到新的csv文件，或者根据需要调整输出格式
output_file = 'merged_counts.csv'
df_all.to_csv(output_file, index=False)

print(f"Merged counts saved to {output_file}")

####上面的脚本内容存成run_merge_counts.py python脚本

(base) [animal_geneticB@master-cli1-x86-agent1 4_count]$ python run_merge_counts.py *.counts
Merged counts saved to merged_counts.csv

下游分析

整理最终版表格

差异分析

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("DESeq2")
library(DESeq2)

counts<-read.table("laiwu_pig_IMF_count.csv",sep = ',',header = T,row.names = 1)
head(counts)
dim(counts)

#mycounts<-counts[rowSums(counts) != 0,]
#dim(mycounts)
#删掉count为0的基因得到的差异结果和不去掉进行的差异分析结果不一致

#counts<-log2(counts+0.1)
#row.names一定要和数据的表头相一致
colDate<-data.frame(row.names=c("CRR785672","CRR785673","CRR785674","CRR785675","CRR785676","CRR785677","CRR785678","CRR785679")
,condition=factor(c("HLW","HLW","HLW","HLW","LLW","LLW","LLW","LLW"))
,levels=c("HLW","HLW","HLW","HLW","LLW","LLW","LLW","LLW"))

head(colData)

dds<-DESeqDataSetFromMatrix(countData = counts,colData = colDate,design = ~condition)

dds<-DESeq(dds)
sizeFactors(dds)

res <- results(dds, contrast = c("condition","HLW","LLW"), cooksCutoff = FALSE)
res

class(res)
res<-as.data.frame(res)
head(res)

library(dplyr)

res%>%
  mutate(Sig = case_when(
    log2FoldChange >= 0.58 & padj <= 0.05 ~ "UP",
    log2FoldChange <= -0.58 & padj <= 0.05 ~ "DOWN",
    TRUE ~ "Not_change"
  )) -> res_1
  
table(res_1$Sig)

res_1<-cbind(rownames(res_1),res_1) 
head(res_1)
colnames(res_1)<- c('gene',"baseMean","log2FoldChange","lfcSE","stat","pvalue","padj","Sig")

write.table(res_1,"laiwu_pig_IMF_count_DEseq.csv",sep = ',',col.names = T,row.names = F,quote = FALSE)

计算非冗余外显子的长度

前几次分析的时候，已经得到了这个文件，Sus_scrofa.Sscrofa11.1.112.PCGs.gtf.Efflens,直接用之前的即可

计算TPM
脚本命名为2_Calculate_TPM.pl，在4_count文件夹中运行代码，得到.TPM文件，运行命令为：perl 2_Calculate_TPM.pl *.counts，

#!/usr/bin/perl

use warnings;
use strict;

# 读取基因长度信息
my %LEN;
{
    open my $IN, "<", "/workspace/home/pig/animal_geneticB/Analysis/py/data/RNA_seq/5_DEG/Sus_scrofa.Sscrofa11.1.112.PCGs.gtf.Efflens" or die "Cannot open gene length file: $!";
    while (<$IN>) {
        chomp;
        my @a = split;
        $LEN{$a[0]} = $a[1];
    }
    close $IN;
}

# 遍历命令行参数（所有counts文件）
foreach my $counts (@ARGV) {
    open my $OUT, ">", "$counts.TPM" or die "Cannot open output file for $counts: $!";

    my $cousum = 0;
    
    # 计算总表达量 cousum
    {
        open my $IN, "<", "$counts" or die "Cannot open input file $counts: $!";
        while (<$IN>) {
            chomp;
            my @a = split;
            next unless exists $LEN{$a[0]};
            my $r1 = ($a[1] / ($LEN{$a[0]} / 1000));
            $cousum += $r1;
        }
        close $IN;
    }

    # 计算并打印每个基因的TPM值
    {
        open my $IN, "<", "$counts" or die "Cannot reopen input file $counts: $!";
        while (<$IN>) {
            chomp;
            my @a = split;
            next unless exists $LEN{$a[0]};
            my $r1 = ($a[1] / ($LEN{$a[0]} / 1000));
            my $tpm = keep2pos(1000000 * ($r1 / $cousum));
            print $OUT "$_\t$tpm\n";
        }
        close $IN;
    }
}

sub keep2pos{
    my $inv = shift;
    my $ouv;
    if($inv < 0.01){
        $ouv=0;
    }else{
        my @q=split /\./,$inv;
        $ouv=$q[0] + 0.01*(substr($q[1],0,2));
    }
    return $ouv;
}