基于rMATs的可变剪切分析

最新推荐文章于 2024-11-19 09:53:55 发布

bobo JLU

最新推荐文章于 2024-11-19 09:53:55 发布

阅读量1k

点赞数 2

分类专栏：可变剪切分析文章标签： python linux conda

本文链接：https://blog.csdn.net/codyliu_bo/article/details/128758226

版权

可变剪切分析专栏收录该内容

1 篇文章

订阅专栏

1. conda的基本操作

1.1 创建conda的环境及处理

创建新的conda环境

conda create -n rmats python=2.7.15

删除已经存在的环境

conda remove -n py36 --all

激活环境

conda activate rmats

退出环境

conda activate py36

1.2 设置镜像

注意镜像只需要设置一次即可

conda config --add channels r 
conda config --add channels conda-forge 
conda config --add channels bioconda
conda config --add channels https://mirrors.bfsu.edu.cn/anaconda/cloud/bioconda/
conda config --add channels https://mirrors.bfsu.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.bfsu.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.bfsu.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes

1.3 安装rmats镜像

采用conda方法安装rmats

conda search rmats  -c bioconda

conda install -c bioconda rmats=4.1.1
conda clean --al 
conda install -c bioconda rmats=4.1.1

这里同一句出现了两次，暂时也不确定原因，可能是博主错误，但是我运行的时候成功了。
暂定用这种方法。

需要仔细查看安装rmats这一个软件，我们的conda需要做的工作：

The following NEW packages will be INSTALLED:

  _libgcc_mutex      anaconda/cloud/conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      anaconda/cloud/conda-forge/linux-64::_openmp_mutex-4.5-1_gnu
  ca-certificates    anaconda/cloud/conda-forge/linux-64::ca-certificates-2020.12.5-ha878542_0
  certifi            anaconda/cloud/conda-forge/linux-64::certifi-2020.12.5-py38h578d9bd_1
  gsl                anaconda/cloud/conda-forge/linux-64::gsl-2.6-he838d99_2
  ld_impl_linux-64   anaconda/cloud/conda-forge/linux-64::ld_impl_linux-64-2.35.1-hea4e1c9_2
  libblas            anaconda/cloud/conda-forge/linux-64::libblas-3.9.0-8_openblas
  libcblas           anaconda/cloud/conda-forge/linux-64::libcblas-3.9.0-8_openblas
  libffi             anaconda/cloud/conda-forge/linux-64::libffi-3.3-h58526e2_2
  libgcc-ng          anaconda/cloud/conda-forge/linux-64::libgcc-ng-9.3.0-h2828fa1_18
  libgfortran-ng     anaconda/cloud/conda-forge/linux-64::libgfortran-ng-7.5.0-h14aa051_18
  libgfortran4       anaconda/cloud/conda-forge/linux-64::libgfortran4-7.5.0-h14aa051_18
  libgomp            anaconda/cloud/conda-forge/linux-64::libgomp-9.3.0-h2828fa1_18
  liblapack          anaconda/cloud/conda-forge/linux-64::liblapack-3.9.0-8_openblas
  libopenblas        anaconda/cloud/conda-forge/linux-64::libopenblas-0.3.12-pthreads_hb3c22a3_1
  libstdcxx-ng       anaconda/cloud/conda-forge/linux-64::libstdcxx-ng-9.3.0-h6de172a_18
  ncurses            anaconda/cloud/conda-forge/linux-64::ncurses-6.2-h58526e2_4
  numpy              anaconda/cloud/conda-forge/linux-64::numpy-1.20.1-py38h18fd61f_0
  openssl            anaconda/cloud/conda-forge/linux-64::openssl-1.1.1j-h7f98852_0
  pip                anaconda/cloud/conda-forge/noarch::pip-21.0.1-pyhd8ed1ab_0
  python             anaconda/cloud/conda-forge/linux-64::python-3.8.8-hffdb5ce_0_cpython
  python_abi         anaconda/cloud/conda-forge/linux-64::python_abi-3.8-1_cp38
  readline           anaconda/cloud/conda-forge/linux-64::readline-8.0-he28a2e2_2
  rmats              bioconda/linux-64::rmats-4.1.1-py38h566bde1_0
  setuptools         anaconda/cloud/conda-forge/linux-64::setuptools-49.6.0-py38h578d9bd_3
  sqlite             anaconda/cloud/conda-forge/linux-64::sqlite-3.34.0-h74cdb3f_0
  star               bioconda/linux-64::star-2.7.8a-0
  tk                 anaconda/cloud/conda-forge/linux-64::tk-8.6.10-h21135ba_1
  wheel              anaconda/cloud/conda-forge/noarch::wheel-0.36.2-pyhd3deb0d_0
  xz                 anaconda/cloud/conda-forge/linux-64::xz-5.2.5-h516909a_1
  zlib               anaconda/cloud/conda-forge/linux-64::zlib-1.2.11-h516909a_1010

安装成功后，就查看自己的软件：

$ STAR --version 
2.7.8a 
$ rmats.py  --version 
v4.1.1

此时如果都运行出结果，表示已经安装成功。我们使用官网中例子（https://rnaseq-mats.sourceforge.net/rmats4.1.1/）进行下一步的数据处理。

2. 数据分析

对于可变剪切分析，我们目前使用的是官网的测试数据进行分析。

2.1 gtf准备

gtf文件是基因组注释文件，需要下载。
参见https://blog.csdn.net/u011262253/article/details/89363809
考虑如何准备gtf文件。

gtf文件的内容如下

GTF是在GFF的基础上发展而来，二者有很多类似的地方，都是\t分隔的9列文件，内容也比较接近。GFF能够包含的信息更多更全，可以包含染色体，基因，转录本的信息，而GTF主要用来描述基因和转录本的信息。

GTF全称Gene transfer format, 每列的含义如下

column1
第一列是seqid, 代表序列ID, 通常是染色体的ID, 每条染色体拥有一个唯一的ID。
column2
第二列是source, 代表基因结构的来源，可以是数据库的名称，比如来自RefSeq数据库，也可以是软件的名称，比如用GeneScan软件预测得到，当然，也可以为空，用.点号填充。
column3
第三列是feature, 代表区间对应的特征类型, 在GTF中，常见的类型如下
```
 	5UTR
 	3UTR
 	exon
 	CDS
 	start_codon
 	stop_codon
```
column4
第四列是start, 代表区间的起始位置
column5
第四列是end, 代表区间的终止位置
column6
第六列是score, 软件提供了统计值，如果没有，就用.填充
column7
第七列是strand, 代表正负链的信息, +表示正链，-表示负链，?表示不清楚正负链的信息，当正负链信息没有意义时，可以用.填充
column8
第八列是phase，当描述的是CDS区间信息时，需要指定翻译时开始的位置，取值范围有0,1,2两种
column9
第九列是attributes, 表示属性，每种属性写法为key value, 注意和gff中key=value有所区别，而且必须有gene_id和transcript_id这两个属性，多个属性用分号分隔

在头部有#开头的注释行

#!genome-build GRCh38.p12
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.27
#!genebuild-last-updated 2018-01

在正文中，基因示例如下

1       havana  transcript      65419   71585   .       +       .       gene_id "ENSG00000186092"; gene_version "6"; transcript_id "ENST00000641515"; transcript_version "2"; gene_name "OR4F5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR4F5-202"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";

其他信息参见
https://cloud.tencent.com/developer/article/16252043

使用的人源的文件都从此处下载

https://www.gencodegenes.org/human/

2.2 索引文件准备

$ cat g1.txt 
SRR8518122.bam,SRR8518123.bam,SRR8518124.bam
$ cat g2.txt 
SRR8518436.bam,SRR8518442.bam,SRR8518448.bam

这里要准备两个索引文件，其中空格，中英文输入法等都需要注意。

2.3 数据处理——以bam文件开始

rmats.py --b1 /path/to/b1.txt --b2 /path/to/b2.txt --gtf /path/to/the.gtf -t paired --readLength 50 --nthread 4 --od /path/to/output --tmp /path/to/tmp_output

运行成功的日志如下所示：

gtf: 26.418766975402832
There are 60651 distinct gene ID in the gtf file
There are 234485 distinct transcript ID in the gtf file
There are 36780 one-transcript genes in the gtf file
There are 1460986 exons in the gtf file
There are 25134 one-exon transcripts in the gtf file
There are 22496 one-transcript genes with only one exon in the transcript
Average number of transcripts per gene is 3.866136
Average number of exons per transcript is 6.230616
Average number of exons per transcript excluding one-exon tx is 6.858587
Average number of gene per geneGroup is 8.495835
statistic: 0.04167461395263672

通常呢，运行速度很快：

==========
Done processing each gene from dictionary to compile AS events
Found 55759 exon skipping events
Found 4089 exon MX events
Found 18752 alt SS events
There are 11349 alt 3 SS events and 7403 alt 5 SS events.
Found 8037 RI events
==========

ase: 3.8115618228912354
count: 5.383385896682739
Processing count files.
Done processing count files.

以上内容相当一部分转自如下

We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.

2.4 数据分析——以fq数据开始

下载gtf文件 https://www.gencodegenes.org/human/
建立两个文本文档，分别叫做s1.txt 和s2.txt
S1文档如下

 /home/labsever_0/Boqun/heilu/MS1.R1.fq:/home/labsever_0/Boqun/heilu/MS1.R2.fq,/home/labsever_0/Boqun/heilu/MS2.R1.fq:/home/labsever_0/Boqun/heilu/MS2.R2.fq,/home/labsever_0/Boqun/heilu/MS3.R1.fq:/home/labsever_0/Boqun/heilu/MS3.R2.fq

s2文档如下

/home/labsever_0/Boqun/heilu/NC1.R1.fq:/home/labsever_0/Boqun/heilu/NC1.R2.fq,/home/labsever_0/Boqun/heilu/NC2.R1.fq:/home/labsever_0/Boqun/heilu/NC2.R2.fq,/home/labsever_0/Boqun/heilu/NC3.R1.fq:/home/labsever_0/Boqun/heilu/NC3.R2.fq

开始运行rmats

rmats.py --s1 /path/to/s1.txt --s2 /path/to/s2.txt --gtf /path/to/the.gtf --bi /path/to/STAR_binary_index -t paired --readLength 50 --nthread 4 --od /path/to/output --tmp /path/to/tmp_output

较为详细的解释方法。同时注意我安装后用的是ramts.py 没有加python。

python rmats.py --s1 /path/to/s1.txt \              #样本1的fq文件
                --s2 /path/to/s2.txt \              #样本2的fq文件
                --gtf /path/to/the.gtf \            #gtf文件
                --bi /path/to/STAR_binary_index \   #STAR的二进制索引文件,(包含SA的文件)仅fq模式。
                -t paired \                         #声明是双端测序。默认paired
                --readLength 150 \                  #每个read的长度
                --nthread 4 \                       #线程数
                --od /path/to/output \              #输出目录
                --tmp /path/to/tmp_output           #缓存文件存储目录```


## 2.5 问题分析
这个问题在软件github有几个issues提到过类似问题，如 most output files with only a header
说可能是由参数--readLength设置得和实际的read长度不符导致的。加个参数--variable-read-length即可。

```bash
 rmats.py --b1 b1.txt --b2 b2.txt --gtf Bos_taurus.ARS-UCD1.2.108.chr.gtf -t paired --readLength 90 --variable-read-length --nthread 8 --allow-clipping  --od ./results/  --tmp ./tmp_output/