小鼠RNA_seq数据上游分析学习总结&流程

红鲤绿与鱼

已于 2024-07-29 13:34:49 修改

阅读量1.7k

点赞数 28

分类专栏： RNA-seq分析文章标签：学习

于 2024-06-11 21:18:22 首次发布

本文链接：https://blog.csdn.net/m0_60116853/article/details/139607459

版权

RNA-seq分析专栏收录该内容

4 篇文章 0 订阅

订阅专栏

RNA_seq上游分析

下载生信软件

安装anaconda3

#下载命令
wget https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh
#安装命令
bash Anaconda3-2024.02-1-Linux-x86_64.sh

在这里插入图片描述
出现该界面，按回车
出现这些内容就按键盘上的 q 键
输入yes

出现这个界面就按回车键，就会进入安装状态，可能需要一些时间

在这里插入图片描述
输入yes，安装完成
然后退出登录，重新进行登录
登录以后再输入命令的地方最前面会多出一个 (base) ，这就说明安装好了
如果我们想安装某个软件，可以首先用conda search 命令搜索一下这个软件在anaconda中是否存在
比如转录组数据处理中经常用的比对软件hisat2，如果这个软件有的话就会列出软件的版本:conda search hisat2 -c bioconda
参考这篇推文

安装hisat2
conda install -c bioconda hisat2
出现错误：
应该将conda换个源，具体操作以后补充，换源完成后，成功安装了hisat2

conda install -c bioconda fastqc
#查看fastqc是否安装完成
fastqc --version
#查看fastqc的参数
fastqc --help
conda install -c bioconda stringtie
conda install -c bioconda trimmomatic
conda install -c bioconda featurecounts
conda install aspera-cli #用来下载原始数据
conda install -y -c bioconda sra-tools
conda install -c bioconda multiqc
conda install -c bioconda STAR
conda install -c bioconda trim-galore

有些软件安装不上，可能是因为用的命令不对
例如，下载bowtie2时，下载不了，应该用官网指定的命令:conda install bioconda::bowtie2
官网地址

在这里插入图片描述

关于conda安装卡在solving environment的问题
创建一个新环境 conda env create -n env_name再安装软件，这样就不用考虑与已有的软件的兼容问题了，也可以大大降低搜索空间和提高解析软件依赖的速度。

！！！超级管用

2024.7.29补充：

(base) [animal_geneticB@master-cli1-x86-agent1 ~]$ conda env create -n test
EnvironmentFileNotFound: '/workspace/home/pig/animal_geneticB/environment.yml' file not found

会出现报错，但是具体的原因没找见，用这个命令之前是可以创建新环境的。
用另外一个命令：conda create -n env_name

参考：
一招解决Conda安装卡在solving environment这一步！

下载原始数据

补充下载之前DSS数据——ERR10773104_2、ERR10773105_1、ERR10773103_1、ERR10773103_2
有机会补上之前如何批量下载 RNA_seq原始数据
在这里插入图片描述可能的报错原因：
ascp: "user@host:" in all sources must match，这表明在使用Aspera Connect进行文件传输时，命令格式存在问题。Aspera Connect (ascp) 是一个高速文件传输工具，它使用 SSH 协议进行安全传输。
错误提示表明 ascp 命令中的源地址格式不正确。正确的格式应该遵循 user@host:source : destination 的模式，其中 user 是目标服务器的用户名，host 是目标服务器的地址，source是要传输的文件的路径，destination是文件传输到的目标路径。
修改命令：ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -l 200m -T -P33001 -K1 era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR107/005/ERR10773105/ERR10773105_1.fastq.gz .
在这里插入图片描述可能是网速的原因，重新输入命令，但是没反应，如下图：
把刚刚下载了一半的文件删掉就好了！！
接着下载ERR10773104_2：

出现了两个问题：

uk:后多了个空格，所以第一行命令出现报错
网速又出现问题，删掉下载部分的文件，继续运行该命令

ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -l 200m -T -P33001 -K1 era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR107/004/ERR10773104/ERR10773104_2.fastq.gz .
下载ERR10773103_1：
ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -l 200m -T -P33001 -K1 era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR107/003/ERR10773103/ERR10773103_1.fastq.gz .

下载ERR10773103_2：
ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -l 200m -T -P33001 -K1 era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR107/003/ERR10773103/ERR10773103_2.fastq.gz .

好的，现在已经完成了基本的环境搭建，原始数据也下载好了，可以开始正式分析数据了

质量控制

激活环境
conda activate my_hisat2_env
Fastqc 质控
fastqc -f fastq -o 1_QC *.fastq.gz -t 3
对当前目录下所有.fastq.gz 格式的文件进行质控，结果输出到1_QC文件夹中
对于这一部分还是有点疑虑，上面的命令感觉不适合双端数据，所以用的下面的命令
fastqc -f fastq -o 1_double_QC ERR10773109_1.fastq.gz ERR10773109_2.fastq.gz -t 3这样不方便，后来改了个脚本，忘记出处是哪了

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n fastqc
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log
#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#===========================================================
#运行脚本
#===========================================================
output dir="1_double_QC"  ##替换为输出目录
# 循环遍历所有以_1.fastq结尾的文件
for file1 in *_1.fastq; do
	file2="${file1/_1/_2}"  ##构建对应的_2.fastq文件名
	if [ -f "$file2" ];then ##检查_2.fastq文件是否存在
		fastqc -o "$output dir" "$file1" "$file2" ##运行Fastqc
	else
		echo "警告：未找到配对文件 $file2"
	fi
done

提交脚本时，失败，发现是由于脚本未激活，或者直接用CP复制已经改变了权限的脚本，不能用鼠标复制脚本，在Linux系统下，只能纯命令复制才能带上文件的激活状态。
chmod 777 run_mapping.sh激活脚本

整合质控结果
multiqc ./：对当前目录下的所有质控文件进行整合

数据预处理（数据清洗）利用`trim_galore`

对数据名预处理

ls fastq | grep "_1" > gz1 ##列出当前目录下名为fastq的文件夹中的所有文件，并筛选出文件名包含_1的行，然后将这些结果重定向到名为gz1的文件中，如果gz1文件已存在，则覆盖其内容。
cat gz1
ls fastq | grep "_2" > gz2 ##筛选的是文件名包含_2的行，并将结果重定向到名为gz2的文件中
cat gz2
paste gz1 gz2>config_file
cat config_file

在这里插入图片描述

处理数据
脚本如下：

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n clean
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log
#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#===========================================================
#运行脚本
#===========================================================
cat config_file | while read id
do
	sample_dir="./fastq"
	output_dir="./fastq_clean"
	arr=($id)
	fq1=${arr[0]}
	fq2=${arr[1]}
	sample_dir1="$sample_dir/$fq1"
	sample_dir2="$sample_dir/$fq2"
	trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $output_dir $sample_dir1 $sample_dir2
done

参考
数据清洗—【trim_galore】那些事

在Linux中使用vim编辑器创建文件并写入脚本内容的步骤如下：

打开终端
创建并编辑文件：输入以下命令来创建新文件并打开它： vim run_fastq_clean.sh这将创建一个名为run_fastq_clean.sh的新文件，并在vim编辑器中打开它
进入插入模式：在vim中，默认处于普通模式。按下i键即可进入插入模式
写入脚本内容
保存并退出：完成编辑之后，ESC退出编辑，输入“:”，在下方会出现冒号，等待输入命令，“wq” （意为写入文件并退出），然后按下Enter键。如果想保存文件但不退出vim，只需输入:w然后按Enter。如果想退出不保存更改，可以输入:q!然后按Enter。
运行脚本：dsub -s run_fastq_clean.sh
查看脚本运行具体过程/检查脚本： bash -x run_fastq_clean.sh

注意：
运行的脚本一定要和所需要的文件在同一目录下，运行脚本前，仔细检查是否在同一路径，这很重要！！！！
在这里插入图片描述
结果解读：

在这里插入图片描述

生成的_triming_report.txt文件就是生成的质控报告，_val_1.fq.gz就是过滤后瘦身的clean data，比原来数据小了30M左右，这个clean data才可以用于后续的分析流程。

参考：
Illumina公司测序时的接头序列：
转录组数据分析笔记（1）——如何用fastqc和trim-galore做测序数据质控

比对参考基因组——利用`STAR`

下载注释和参考基因组文件

进入ensembl
ensembl

在Linux中下载基因组数据：wget +下载链接

wget https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.toplevel.fa.gz
参考
保姆级参考基因组及其注释下载教程（图文详解）

STAR比对

解压基因组文件和注释文件

gzip -d Mus_musculus.GRCm39.dna.toplevel.fa.gz
gzip -d Mus_musculus.GRCm39.112.chr.gtf.gz
#这条命令会将压缩文件解压，并自动删除.gz扩展名，留下未压缩的Mus_musculus.GRCm39.dna.toplevel.fa文件在原目录中

在这里插入图片描述

建立索引，将以下脚本命名为run_mapping.sh

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n mult_node_test
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log

#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#=================================================
#运行脚本
#=================================================
##使用star软件创建小鼠GRCm39基因组索引
fa_dir="/workspace/home/pig/animal_geneticB/RNA_seq/reference/GRCm39/Mus_musculus.GRCm39.dna.toplevel.fa"
gtf_dir="/workspace/home/pig/animal_geneticB/RNA_seq/reference/GRCm39/Mus_musculus.GRCm39.112.chr.gtf"
mkdir -p "/workspace/home/pig/animal_geneticB/RNA_seq/reference/index/star/grcm39"

STAR --runMode genomeGenerate \
        --runThreadN 6 \
        --genomeDir "/workspace/home/pig/animal_geneticB/RNA_seq/reference/index/star/grcm39" \
        --genomeFastaFiles $fa_dir \
        --sjdbGTFfile $gtf_dir \
        --sjdbGTFchrPrefix "chr" \
        --sjdbOverhang 149
        
##参数说明##
#--runMode：　　运行程序模式，默认是比对，所以第一步这个参数设置很关键
#--runThreadN：　　运行的线程数，根据你自己电脑的配置来设置，数字越大运行越快
#--genomeDir：　　这个参数很重要，是存放你生成index的文件路径，需要你事先建立一个有可读写权限的文件夹
#--genomeFastaFiles 　　基因组fasta格式文件路径
#--sjdbGTFfile 　　GTF注释文件路径
#--sjdbOverhang 　　这个值为你测序read的长度减1，是在注释可变剪切序列的时候使用的最大长度值
#--sjdbGTFchrPrefix                        -
#--string: prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes) #即当下载的fa文件和GTF文件在染色体命名上有区别时，添加【UCSC以“chr”命名，然而ENSEMBL直接用数字命名】

建立索引的时候报错：
在这里插入图片描述
查看基因组和注释文件前10行，发现染色体都是以数字开头
head -n 10 Mus_musculus.GRCm39.dna.toplevel.fa
head -n 10 Mus_musculus.GRCm39.112.chr.gtf
后来发现是参数设置得有问题：

–sjdbGTFchrPrefix “chr” 参数的作用是对GTF文件中的染色体名称添加一个指定的前缀。这里的前缀是 “chr”。这意味着如果GTF文件中的染色体名称是以数字开始（如 1, 2, X, Y 等），STAR会在这些名称前加上 “chr”，使之变为 chr1, chr2, chrX, chrY 等，以便与参考基因组FASTA文件中的染色体命名保持一致。所以应该删除这个参数。
在使用代码前，应该搞清楚每个参数的意思，从而匹配自己的数据。！！！
–sjdbOverhang：这个值为测序read的长度减1，是在注释可变剪切序列的时候使用的最大长度值。
查看测序读段的碱基序列长度：
zcat ERR10773103_1.fastq.gz | head -n 4 | tail -n 1 | tr -d '+\n' | wc -c
在这里插入图片描述所以应该将sjdbOverhang参数改为150
最终修改后的脚本如下：

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n mult_node_test
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log
#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#=================================================
#运行脚本
#=================================================
##使用star软件创建小鼠GRCm39基因组索引
fa_dir="/workspace/home/pig/animal_geneticB/RNA_seq/reference/GRCm39/Mus_musculus.GRCm39.dna.toplevel.fa"
gtf_dir="/workspace/home/pig/animal_geneticB/RNA_seq/reference/GRCm39/Mus_musculus.GRCm39.112.chr.gtf"
mkdir -p "/workspace/home/pig/animal_geneticB/RNA_seq/reference/index/star/grcm39"

STAR --runMode genomeGenerate \
        --runThreadN 6 \
        --genomeDir "/workspace/home/pig/animal_geneticB/RNA_seq/reference/index/star/grcm39" \
        --genomeFastaFiles $fa_dir \
        --sjdbGTFfile $gtf_dir \
        #--sjdbGTFchrPrefix "chr" \
        --sjdbOverhang 150

构建索引完成！
在这里插入图片描述

比对
- 解压fastq文件：gunzip *.fastq.gz，使用 gunzip 命令解压所有以 .fastq.gz 结尾的文件。解压之后，原来的压缩文件.fastq.gz会被删除，而解压出来的 .fastq 文件会留在原目录中。
- 比对：脚本如下：

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n mapping
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log

#===========================================================
#加载环境变量及依赖软件
#===========================================================
module use /workspace/public/x86/software/modules/
module purge
#===========================================================
#运行测试脚本
#===========================================================
GENOME_INDEX="/workspace/home/pig/animal_geneticB/RNA_seq/reference/index/star/grcm39"
OUTPUT_DIR="/workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new"
THREADS=20

for R1_FILE in /workspace/home/pig/animal_geneticB/RNA_seq/fastq_clean/*_1_val_1.fq; do
	# 提取样本名时去除 '_1_val_1.fq' 部分
	SAMPLE_NAME=$(basename "$R1_FILE" _1_val_1.fq)
	# 构建R2文件名，根据新的命名规则
	R2_FILE="${SAMPLE_NAME}_2_val_2.fq"
	# 确认R2文件存在
	if [[ ! -f "$R2_FILE" ]]; then
		echo "Missing R2 file for $R1_FILE"
		continue
	fi

	OUT_PREFIX="$OUTPUT_DIR/$SAMPLE_NAME":
	echo "Processing $SAMPLE_NAME"
	# 注意：此处未直接解压fastq.gz文件，如果文件实际上是gzip压缩的，请先解压或修改命令处理
	STAR --runThreadN $THREADS --genomeDir "$GENOME_INDEX" --readFilesIn "$R1_FILE" "$R2_FILE" --outFileNamePrefix "$OUT_PREFIX" --outSAMtype BAM SortedByCoordinate --sjdbOverhang 150 --twopassMode Basic
	if [ $? -eq 0 ]; then
		echo "Completed: $SAMPLE_NAME"
	else
		echo "Failed: $SAMPLE_NAME"
	fi
done

脚本参考
5.基因组索引文件创建—STAR

比对花了18个小时左右，建议晚上跑
在这里插入图片描述这一步不用提前解压.fastq.gz文件，可以直接在STAR中设置相应的参数

定量——利用`htseq`

安装软件
conda install bioconda::htseq
判断数据是否为链特异性，需要用到RSeQC软件
conda install bioconda::rseqc

###鉴定数据的建库方式（链特异性或非链特异性###
#安装gtfToGenePre
conda install bioconda::rseqc
#从gtf转换为GenePred格式
gtfToGenePred -genePredExt -geneNameAsName2 /workspace/home/pig/animal_geneticB/RNA_seq/reference/GRCm39/Mus_musculus.GRCm39.112.chr.gtf gene.tmp
#从GenePred文件提取信息得到bed12文件
awk '{print $2"\t"$4"\t"$5"\t"$1"\t0\t"$3"\t"$6"\t"$7"\t0\t"$8"\t"$9"\t"$10}' gene.tmp > genes_refseq.bed12

###得到bed12文件即可使用infer_experiment.py判断数据是否为链特异性
#检验，要进入有bed12文件的目录运行下列代码
infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773103:Aligned.sortedByCoord.out.bam

在这里插入图片描述
参考
如何鉴定数据的建库方式（链特异性或非链特异性）
来自2024.6.18的忠告：
应该用只含有编码基因的gtf文件进行定量

#将提取编码基因信息内容存储到Mus_musculus.GRCm39.112.chr.PCGs.gtf文件中
(base) [animal_geneticB@master-cli1-x86-agent1 GRCm39]$ less -SN *.gtf | grep 'protein_coding' > Mus_musculus.GRCm39.112.chr.PCGs.gtf

修改脚本

#!/bin/bash
#===========================================================
#配置DSUB资源
#===========================================================
#DSUB --job_type cosched:hmpi
#DSUB -n count
#DSUB -A root.default
#DSUB -q root.default
#DSUB -R cpu=2;gpu=0
#DSUB -N 1
#DSUB --labels x86_64
#DSUB -o out_%J.log
#DSUB -e err_%J.log

#===========================================================
#加载环境变量及依赖软件
#===========================================================

module use /workspace/public/x86/software/modules/
module purge

#===========================================================
#运行脚本
#===========================================================
# 定义变量
GENE_ANNOTATION="/workspace/home/pig/animal_geneticB/RNA_seq/reference/GRCm39/Mus_musculus.GRCm39.112.chr.gtf" #基因注释文件路径
COUNT_OUTPUT_DIR="/workspace/home/pig/animal_geneticB/RNA_seq/4_count"
HTSEQ_STRAND="reverse" # 或者 "no"、"forward"，根据您的实验 strandedness 设置

# 创建输出目录（如果不存在）
#mkdir -p "$COUNT_OUTPUT_DIR"

# 遍历STAR生成的已排序BAM文件
for BAM_FILE in /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/*Aligned.sortedByCoord.out.bam; do
    # 获取样本名
    SAMPLE_NAME=$(basename "$BAM_FILE" _Aligned.sortedByCoord.out.bam)
    
    # 定义输出计数文件的全路径
    COUNT_FILE="$COUNT_OUTPUT_DIR/${SAMPLE_NAME}_htseq.counts"
    
    echo "Quantifying gene expression for $SAMPLE_NAME using htseq-count..."
    # 执行htseq-count
    htseq-count -f bam -r pos -s "$HTSEQ_STRAND" "$BAM_FILE" "$GENE_ANNOTATION" > "$COUNT_FILE"
    
    if [ $? -eq 0 ]; then
        echo "Quantification completed: $SAMPLE_NAME"
    else
        echo "Quantification failed: $SAMPLE_NAME"
    fi

注意：脚本中匹配bam文件时多了一个空格，删掉空格即可；

在这里插入图片描述

定量完成后（大概22h），合并表达矩阵
- 查看所有count文件顺序是否一致：head -n 4 *counts
- 是一致的，直接用python脚本合成表达矩阵

import pandas as pd
import os
import sys

# 获取目标目录下的所有.counts文件列表
directory = '/workspace/home/pig/animal_geneticB/RNA_seq/4_count'
counts_files = [f for f in os.listdir(directory) if f.endswith('.counts')]
file_base_names = [os.path.splitext(f)[0] for f in counts_files]  # 提取ERR编号

# 初始化一个空的DataFrame，用于存储最终结果
df_all = pd.DataFrame()

# 遍历每个.counts文件
for file_name, base_name in zip(counts_files, file_base_names):
    # 构建完整路径
    file_path = os.path.join(directory, file_name)
    
    # 读取文件，假设文件格式为两列，第一列基因ID，第二列计数
    df_temp = pd.read_csv(file_path, sep='\t', header=None, names=['GeneID', 'Count'])
    
    # 将当前文件的计数作为新列添加到总的DataFrame中，基因ID作为索引
    df_all[base_name] = df_temp.set_index('GeneID')['Count']

# 填充缺失值，如果某个基因在某些文件中没有出现，则填充为0
df_all.fillna(0, inplace=True)

# 重置索引，使得基因ID成为新的一列
df_all.reset_index(inplace=True)

# 输出到新的csv文件，或者根据需要调整输出格式
output_file = 'merged_counts.csv'
df_all.to_csv(output_file, index=False)

print(f"Merged counts saved to {output_file}")

####上面的脚本内容存成run_merge_counts.py python脚本

在Linux中运行：python run_merge_counts.py *.counts

合并完成，得到表达矩阵，可以进行下游分析，下游分析过程中，发现差异基因很不对称，上调的只有几个，肯定不对，后来经过排查，发现定量的链特异性参数未设置对。

重新进行定量

对所有比对文件进行判断：

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773103:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1046
Fraction of reads explained by "1++,1--,2+-,2-+": 0.8871
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0083

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773104:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1541
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0073
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8386

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773105:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1452
Fraction of reads explained by "1++,1--,2+-,2-+": 0.8449
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0100

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773109:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0995
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0087
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8918

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773111:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1207
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0078
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8715

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773112:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1190
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0101
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8709

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773113:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1354
Fraction of reads explained by "1++,1--,2+-,2-+": 0.8572
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0074

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773114:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0944
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0089
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8967

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773119:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1001
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0082
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8917

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773120:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1099
Fraction of reads explained by "1++,1--,2+-,2-+": 0.8828
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0074


(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773121:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1848
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0063
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8089

(my_hisat2_env) [animal_geneticB@master-cli1-x86-agent1 4_0_RSeQC]$ infer_experiment.py -r genes_refseq.bed12 -i /workspace/home/pig/animal_geneticB/RNA_seq/reference/mapping_new/ERR10773123:Aligned.sortedByCoord.out.bam
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.1055
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0077
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8868

根据以上判断，ERR10773103、ERR10773105、ERR10773113、ERR10773120为forward，ERR10773104、ERR10773109、ERR10773111、ERR10773112、ERR10773114、ERR10773119、ERR10773121、ERR10773123为reverse，因此，需要分别定量。
分别定量后，合并表达矩阵

一直出现报错/workspace/home/pig/animal_geneticB/Analysis/LQP_iWAT_mWAT_Liver/CleanData/run_mapping_new.sh: line 45: STAR: command not found，试了好久，不是参数的原因，结果是忘记激活下载了STAR的环境。
激活之后，就没有报错了。

定量完成，接下来合并表达矩阵

(software_1) [animal_geneticB@master-cli1-x86-agent1 4_count]$ python run_merge_counts.py *.counts
Traceback (most recent call last):
  File "/workspace/home/pig/animal_geneticB/RNA_seq/4_count/run_merge_counts.py", line 1, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

报错了

(base) [animal_geneticB@master-cli1-x86-agent1 4_count]$ python run_merge_counts.py *.counts
Merged counts saved to merged_counts.csv

直接在base的环境下运行的python run_merge_counts.py *.counts没有报错，得到merged_counts.csv文件

出了点小问题，我直接是拿含有编码lncRNA、miRNA基因等信息的gtf文件做的定量，但是已经定量完成，所以现在只能从最初的gtf文件将编码基因的信息提取出来，以及编码基因对应的ensembl号，再将ensembl号去和定量完后的表格进行比对，只保存编码基因的定量count数。

(base) [animal_geneticB@master-cli1-x86-agent1 GRCm39]$ less -SN *.gtf | grep 'protein_coding' | less -SN
(base) [animal_geneticB@master-cli1-x86-agent1 GRCm39]$ less -SN *.gtf | grep 'protein_coding' | awk '$3=="gene"' | wc -l
21652
#将提取编码基因信息内容存储到Mus_musculus.GRCm39.112.chr.PCGs.gtf文件中
(base) [animal_geneticB@master-cli1-x86-agent1 GRCm39]$ less -SN *.gtf | grep 'protein_coding' > Mus_musculus.GRCm39.112.chr.PCGs.gtf 
#提取唯一的gene_id后的ensembl编号
(base) [animal_geneticB@master-cli1-x86-agent1 GRCm39]$ awk -F'"' '{print $2}' Mus_musculus.GRCm39.112.chr.PCGs.gtf | uniq > mus_pcg_ensemid_unique.txt
(base) [animal_geneticB@master-cli1-x86-agent1 GRCm39]$ wc -l mus_pcg_ensemid_unique.txt
21652 mus_pcg_ensemid_unique.txt
(base) [animal_geneticB@master-cli1-x86-agent1 GRCm39]$ less -SN mus_pcg_ensemid_unique.txt
      1 ENSMUSG00000051285
      2 ENSMUSG00000026312
      3 ENSMUSG00000039748
      4 ENSMUSG00000104158
      5 ENSMUSG00000057363
      6 ENSMUSG00000047216
      7 ENSMUSG00000038702
      8 ENSMUSG00000055214
      9 ENSMUSG00000078184
     10 ENSMUSG00000033007
     11 ENSMUSG00000070695
     12 ENSMUSG00000025909

写脚本：

import os
import glob

# 文件存放的目录
data_dir = "/workspace/home/pig/animal_geneticB/RNA_seq/4_count"

# 读取第二类文件，即编码基因的ensemblID列表
def read_coding_genes(coding_genes_file):
    with open(coding_genes_file, 'r') as file:
        coding_genes = [line.strip() for line in file]
    return set(coding_genes)

# 处理文件，筛选出编码基因的ensemblID和count数
def filter_coding_genes_counts(input_file_path, coding_genes):
    filtered_counts = {}
    with open(input_file_path, 'r') as file:
        for line in file:
            ensembl_id, count = line.strip().split()
            if ensembl_id in coding_genes:
                filtered_counts[ensembl_id] = int(count)
    return filtered_counts

# 遍历第一类文件，筛选并为每个文件保存结果到独立的文件中
def process_files(data_dir, coding_genes_file):
    coding_genes = read_coding_genes(coding_genes_file)
    
    # 使用glob模块来找到所有匹配的文件
    for file_path in glob.glob(os.path.join(data_dir, "*:Aligned.sortedByCoord.out.bam_htseq.counts")):
        # 从文件名中提取唯一的标识符，例如 ERR 后面的数字
        identifier = os.path.basename(file_path).split(":")[0]
        
        # 创建输出文件名
        output_file_name = f"{identifier}:Aligned.sortedByCoord.out.bam_htseq.counts.filter"
        output_file_path = os.path.join(data_dir, output_file_name)
        
        # 筛选并保存结果到独立的文件中
        filtered_counts = filter_coding_genes_counts(file_path, coding_genes)
        with open(output_file_path, 'w') as output_file:
            for ensembl_id, count in filtered_counts.items():
                output_file.write(f"{ensembl_id}\t{count}\n")

# 调用函数
coding_genes_file = 'mus_pcg_ensemid_unique.txt'  # 第二类文件的名字
process_files(data_dir, coding_genes_file)

(base) [animal_geneticB@master-cli1-x86-agent1 4_count]$ python run_filter_pcd_counts.py
(base) [animal_geneticB@master-cli1-x86-agent1 4_count]$ wc -l *.filter
  21652 ERR10773103:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773104:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773105:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773109:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773111:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773112:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773113:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773114:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773119:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773120:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773121:Aligned.sortedByCoord.out.bam_htseq.counts.filter
  21652 ERR10773123:Aligned.sortedByCoord.out.bam_htseq.counts.filter
 259824 total

修改run_merge_counts.py脚本：

import pandas as pd
import os
import sys

# 获取目标目录下的所有.counts文件列表
directory = '/workspace/home/pig/animal_geneticB/RNA_seq/4_count'
counts_files = [f for f in os.listdir(directory) if f.endswith('.filter')]
file_base_names = [os.path.splitext(f)[0] for f in counts_files]  # 提取ERR编号

# 初始化一个空的DataFrame，用于存储最终结果
df_all = pd.DataFrame()

# 遍历每个.counts文件
for file_name, base_name in zip(counts_files, file_base_names):
    # 构建完整路径
    file_path = os.path.join(directory, file_name)

    # 读取文件，假设文件格式为两列，第一列基因ID，第二列计数
    df_temp = pd.read_csv(file_path, sep='\t', header=None, names=['GeneID', 'Count'])

    # 将当前文件的计数作为新列添加到总的DataFrame中，基因ID作为索引
    df_all[base_name] = df_temp.set_index('GeneID')['Count']

# 填充缺失值，如果某个基因在某些文件中没有出现，则填充为0
df_all.fillna(0, inplace=True)

# 重置索引，使得基因ID成为新的一列
df_all.reset_index(inplace=True)

# 输出到新的csv文件，或者根据需要调整输出格式
output_file = 'merged_counts.csv'
df_all.to_csv(output_file, index=False)
print(f"Merged counts saved to {output_file}")