Stringtie详解

最新推荐文章于 2024-08-16 07:43:51 发布

songyi10

最新推荐文章于 2024-08-16 07:43:51 发布

阅读量5.1k

点赞数 3

分类专栏：生信软件使用文章标签： linux

本文链接：https://blog.csdn.net/songyi10/article/details/125015671

版权

生信软件使用专栏收录该内容

5 篇文章

订阅专栏

StringTie 是一种快速高效的将 RNA-Seq 比对到潜在转录本的组装器。它使用新的网络流算法以及可选的从头组装步骤来组装和定量代表每个基因位点的多个剪接变体的全长转录本。它的输入不仅可以包括其他转录组装器也可以使用的短读取比对，还可以包括从这些读取组装的较长序列的比对。为了识别实验之间的差异表达基因，StringTie 的输出可以通过专门的软件如 Ballgown、Cuffdiff 或其他程序（DESeq2、edgeR 等）进行处理。

下载与安装

源码安装

wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.tar.gz
tar -zxvf stringtie-2.2.1.tar.gz
cd stringtie-VER
make release

github安装

git clone https://github.com/gpertea/stringtie
cd stringtie
make release

conda安装（推荐）省时省力

conda install stringtie -c bioconda

用法详解

stringtie基本用法：stringtie <aligned_reads.bam> [options]*

StringTie v1.3.3b usage:
 stringtie <input.bam ..> [-G <guide_gff>] [-l <label>] [-o <out_gtf>] [-p <cpus>]
  [-v] [-a <min_anchor_len>] [-m <min_tlen>] [-j <min_anchor_cov>] [-f <min_iso>]
  [-C <coverage_file_name>] [-c <min_bundle_cov>] [-g <bdist>] [-u]
  [-e] [-x <seqid,..>] [-A <gene_abund.out>] [-h] {-B | -b <dir_path>}
Assemble RNA-Seq alignments into potential transcripts.
 Options:
 --version : print just the version at stdout and exit
 -G reference annotation to use for guiding the assembly process (GTF/GFF3)
 		使用参考注释基因文件指导组装过程，格式GTF/GFF3。输出文件中既包含已知表达的转录本，也包含新的转录本。选项-B，-b，-e，-C需要此选项
 --rf assume stranded library fr-firststrand
 			链特异性建库方式：fr-firststrand(最常用的是dUTP测序方式，其他有NSR，NNSR).
 --fr assume stranded library fr-secondstrand
 			链特异性建库方式：fr-secondstrand(如 Ligation,Standard SOLiD).
 -l name prefix for output transcripts (default: STRG)
 			将<label>设置为输出转录本名称的前缀。默认：STRG
 -f minimum isoform fraction (default: 0.1)
 			将预测转录本的最低isoform的丰度设定为在给定基因座处组装的丰度最高的转录本的一部分。较低丰度的转录物通常是经加工的转录本的不完全剪接前体的artifacts。默认值为0.1。
 -m minimum assembled transcript length (default: 200)
 			设置预测的转录本所允许的最小长度.默认值为200
 -o output path/file name for the assembled transcripts GTF (default: stdout)
 			设置StringTie组装转录本的输出GTF文件的路径和文件名。此处可指定完整路径，在这种情况下，将根据需要创建目录。默认情况下，StringTie将GTF写入标准输出。
 -a minimum anchor length for junctions (default: 10)
 			预测新转录本的最小的锚点长度。默认值：10
 -j minimum junction coverage (default: 1)
 			连接点的覆盖度，即设置至少有这么多的spliced reads 比对到连接点(align across a junction)。 这个数字可以是分数, 因为有些reads可以比对到多个地方。 当一个read 比对到 n 个地方是，则此处连接点的覆盖度为1/n 。默认值为1。
 -t disable trimming of predicted transcripts based on coverage
    (default: coverage trimming is enabled)
     该参数禁止修剪组装的转录本的末端。默认情况下，StringTie会根据组装的转录本的覆盖率的突然下降来调整预测的转录本的开始和/或停止坐标。
 -c minimum reads per bp coverage to consider for transcript assembly
    (default: 2.5)
    设置预测转录本所允许的最小read 覆盖度。 当一个转录本的覆盖度低于阈值，则输出文件中不含该转录本。默认值为 2.5
 -v verbose (log bundle processing details)
 		输出运行过程中的运行信息
 -g gap between read mappings triggering a new bundle (default: 50)
 		设置gap最小值。
 -C output a file with reference transcripts that are covered by reads
 		输出所有转录本对应的reads覆盖度的文件，此处的转录本是指参考注释基因文件中提供的转录本。(需要参数 -G).
 -M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)
 		
 -p number of threads (CPUs) to use (default: 1) 线程数目
 -A gene abundance estimation output file
 		输出结果中的gene丰度信息
 -B enable output of Ballgown table files which will be created in the
    same directory as the output GTF (requires -G, -o recommended)
    应用该选项，则会输出Ballgown输入表文件（* .ctab），其中包含用-G选项给出的参考转录本的覆盖率数据。
 -b enable output of Ballgown table files but these files will be
    created under the directory path given as <dir_path>
     指定 *.ctab 文件的输出路径, 而非由-o选项指定的目录。
 -e only estimate the abundance of given reference transcripts (requires -G)
 		限制reads比对的处理，仅估计和输出与用-G选项给出的参考转录本匹配的组装转录本。使用该选项，则会跳过处理与参考转录本不匹配的组装转录本，这将大大的提升了处理速度。
 -x do not assemble any transcripts on the given reference sequence(s)
 		忽略所有比对到指定的参考序列上的reads，因此这部分的reads不需要组装转录本。 参数 <seqid_list>可以是单个参考序列名称 (如： -x chrM)，也可以是逗号分隔的序列名称列表 (如： -x 'chrM,chrX,chrY')。这可以加快StringTie的组装分析的速度，特别是在排除线粒体基因组的情况下，在某些情况下，线粒体的基因可能具有非常高的覆盖率，但是它们对于特定的RNA-Seq分析可能不感兴趣的。
 -u no multi-mapping correction (default: correction enabled)
 -h print this usage message and exit

Transcript merge usage mode:
  stringtie --merge [Options] { gtf_list | strg1.gtf ...}
With this option StringTie will assemble transcripts from multiple
input files generating a unified non-redundant set of isoforms. In this mode
the following options are available:
  -G <guide_gff>   reference annotation to include in the merging (GTF/GFF3)
  									参考注释基因组文件(GTF/GFF3)
  -o <out_gtf>     output file name for the merged transcripts GTF
                    (default: stdout) 指定输出合并的GTF文件的路径和名称 (默认值：标准输出)
  -m <min_len>     minimum input transcript length to include in the merge
                    (default: 50)	合并文件中，指定允许最小输入转录本的长度 (默认值: 50)
  -c <min_cov>     minimum input transcript coverage to include in the merge
                    (default: 0)	 合并文件中，指定允许最低输入转录本的覆盖度(默认值: 0)
  -F <min_fpkm>    minimum input transcript FPKM to include in the merge
                    (default: 1.0)	合并文件中，指定允许最低输入转录本的FPKM值 (默认值: 0)
  -T <min_tpm>     minimum input transcript TPM to include in the merge
                    (default: 1.0)	合并文件中，指定允许最低输入转录本的TPM值  (默认值: 0)
  -f <min_iso>     minimum isoform fraction (default: 0.01)
  -g <gap_len>     gap between transcripts to merge together (default: 250)
  -i               keep merged transcripts with retained introns; by default
                   these are not kept unless there is strong evidence for them
                   合并后，保留含retained introns的转录本 (默认值: 除非有强有力的证据，否则不予保留)
  -l <label>       name prefix for output transcripts (default: MSTRG)
  								输出转录本的名称前缀 (默认值: MSTRG)

使用stringtie的注意事项如下所示：

第一，aligned_reads.bam 是输入文件，该输入文件要求必须按其基因组位置排序，如TopHat的输出文件accepted_hits.bam可直接当做输入文件，而 HISAT2的输出文件则需经过samtools sort生成的bam文件才可当做输入文件。
第二，输入BAM文件中的每个 spliced read 比对（即跨越至少一个连接点的比对）必须包含标签XS，用以指示测序产生的read是来源于基因组序列上的哪条链产生的RNA。由TopHat和 HISAT2 (需参数 --dta，该参数用于发现剪接位点) 产生的比对结果中已经包含标签XS。但是，有的mapping程序(read mapper)未必含有标签XS，所以，用户在进行下一步分析时需要进行检查。注意：一定要使用-dta选项来运行HISAT2，否则结果将会受到影响。
第三，作为选项，可以向StringTie提供GTF / GFF3格式的参考注释基因组文件。在这种情况下，StringTie更喜欢使用注释文件中的这些“已知”基因，对于那些被表达的基因，它将计算coverage，TPM和FPKM值。它还会产生额外的转录本，而注释文件中并没有这些转录本。请注意，如果不使用选项-e，那么参考转录本就需要被reads 完全覆盖，以便包含在StringTie的输出中。在这种情况下，其他通过StringTie从数据中组装的转录本，且不在注释文件中的转录本也会输出。

输出结果

主要输出结果

GTF文件：记录组装的转录本信息 -o GFF
Tab文件：记录基因丰度信息. -A TAB
GTF文件：完全覆盖与参考注释基因组文件所匹配的转录本信息 -C GTF
*.ctab文件：用于下游Ballgown软件做差异表达分析的输入文件 -B *.ctab
GTF文件：在合并模式下，生成一个合并的GTF文件

GTF文件：记录组装的转录本信息

seqname: 染色体，contig, 或 scaffold
source: GTF文件的源文件。
feature: 特征类型；如：exon, transcript, mRNA, 5’UTR。
start: 开始位置，使用基于1的索引
end: 结束位置，使用基于1的索引
score: 组装的转录本的可信度分数。目前这个字段没有被使用，并且如果转录本与a read alignment bundle有连接，则StringTie输出常数值1000。
strand: 正向链： ‘+’；反向链： ‘-’.
frame: CDS特征的 Frame or phase 。 StringTie不使用该字段，只记录一个“.”。
attributes:
gene_id: A unique identifier for a single gene and its child transcript and exons based on the alignments’ file name. 基于比对文件名的单个基因及其子转录本和外显子的唯一标识符
transcript_id: A unique identifier for a single transcript and its child exons based on the alignments’ file name. 基于比对文件名的单个转录本及其子外显子的唯一标识符。
exon_number: A unique identifier for a single exon, starting from 1, within a given transcript. 给定转录本中单个外显子的唯一标识符，从1开始。
reference_id: The transcript_id in the reference annotation (optional) that the instance matched. 用以拼接的参考transcript_id
ref_gene_id: The gene_id in the reference annotation (optional) that the instance matched. 用以拼接的参考gene_id
ref_gene_name: The gene_name in the reference annotation (optional) that the instance matched.用以拼接的参考gene_name
cov: The average per-base coverage for the transcript or exon. 转录本或外显子的平均每个碱基覆盖率。
FPKM: Fragments per kilobase of transcript per million read pairs. This is the number of pairs of reads aligning to this feature, normalized by the total number of fragments sequenced (in millions) and the length of the transcript (in kilobases).
TPM: Transcripts per million. This is the number of transcripts from this particular gene normalized first by gene length, and then by sequencing depth (in millions) in the sample. A detailed explanation and a comparison of TPM and FPKM can be found here, and TPM was defined by B. Li and C. Dewey here.

Tab文件：记录基因丰度信息

Column 1 / Gene ID: The gene identifier comes from the reference annotation provided with the -G option. If no reference is provided this field is replaced with the name prefix for output transcripts (-l).
Column 2 / Gene Name: This field contains the gene name in the reference annotation provided with the -G option. If no reference is provided this field is populated with ‘-’.
Column 3 / Reference: Name of the reference sequence that was used in the alignment of the reads. Equivalent to the 3rd column in the .SAM alignment.
Column 4 / Strand: ‘+’ denotes that the gene is on the forward strand, ‘-’ for the reverse strand.
Column 5 / Start: Start position of the gene (1-based index).
Column 6 / End: End position of the gene (1-based index).
Column 7 / Coverage: Per-base coverage of the gene.
Column 8 / FPKM: normalized expression level in FPKM units (see previous section).
Column 9 / TPM: normalized expression level in RPM units (see previous section).