cufflinks之cuffmerge,cuffdiff

转自:http://yangl.net/2016/06/03/cufflinks/

1. Cuffmerge简介

Cuffmerge将各个Cufflinks生成的transcripts.gtf文件融合称为一个更加全面的transcripts注释结果文件merged.gtf。以利于用Cuffdiff来分析基因差异表达。

2. 使用方法

$ cuffmerge [options]* 输入文件为一个文本文件,是包含着GTF文件路径的list。常用例子: $ cuffmerge -o ./merged_asm -p 8 assembly_list.txt

3. 使用参数

-h | --help 
-o default: ./merged_asm 将结果输出至该文件夹。
 -g | --ref-gtf 将该reference GTF一起融合到最终结果中。 
-p | --num-threads defautl: 1 使用的CPU线程数 
-s | --ref-sequence / 该参数指向基因组DNA序列。如果是一个文件夹,则每个contig则是一个fasta文件;如果是 一个fasta文件,则所有的contigs都需要在里面。Cuffmerge将使用该ref-sequence来 帮助对transfrags分类,并排除repeats。比如transcripts包含一些小写碱基的将归类 到repeats.

4. Cuffmerge输出结果

输出的结果文件默认为 /merged.gt



1. Cuffdiff简介

用于寻找转录子表达的显著性差异。

2. Cuffdiff使用方法

cuffdiff主要是发现转录本表达,剪接,启动子使用的明显变化。

cuffdiff [options]* … [sampleN.sam_replicate1.sam[,…,sample2_replicateM.sam]]

$ cuffdiff [options]* ...[sampleN_1.sam[,...,sampleN_M.sam]] 其中transcripts.gtf是由cufflinks,cuffcompare,cuffmerge所生成的文件,或是由其它程序生成的。一个样本有多个replicate,用逗号隔开。sample多于一个时,cuffdiff将比较samples间的基因表达的差异性。 一个常用例子: $ cuffdiff --lables lable1,lable2 -p 8 --time-series --multi-read-correct --library-type fr-unstranded --poisson-dispersion transcripts.gtf sample1.sam sample2.sam
cuffdiff接受bam/sam或cuffquant的CXB文件,同时也可以接受bam与sam的混合文件,不能接受bam/sam和CXB的混合文件。

3. 使用参数

-h | --help 
-o | --output-dir default: ./ 输出的文件夹目录。 
-L | --lables default: q1,q2,...qN 给每个sample一个样品名或者一个环境条件一个lable 
-p | --num-threads default: 1 使用的CPU线程数 
-T | --time-series 让Cuffdiff来按样品顺序来比对样品,而不是对所有的samples都进行两两比对。即第二个 SAM和第一个SAM比;第三个SAM和第二个SAM比;第四个SAM和第三个SAM比... 
-N | --upper-quartile-form 使用75%分为数的值来代替总的值(比对到单一位点的fragments的数值),作normalize。 这样有利于在低丰度基因和转录子中寻找差异基因。 
--total-hits-norm  Cufflinks在计算FPKM时,算入所有的fragments和比对上的reads。和下一个参数对立。 默认不激活该参数。
 --compatible-hits-norm Cufflinks在计算FPKM时,只针对和reference transcripts兼容的fragments以及 比对上的reads。该参数默认激活,使用该参数可以降低核糖体rna的reads对基因表达的干扰。
 -b | --frag-bias-correct(一般是genome.fa) 提供一个fasta文件来指导Cufflinks运行新的bias detection and correction algorithm。这样能明显提高转录子丰度计算的精确性。
 -u | --multi-read-correct 让Cufflinks来做initial estimation步骤,从而更精确衡量比对到genome多个位点 的reads。 
-c | --min-alignment-count default: 10 如果比对到某一个位点的fragments数目少于该值,则不做该位点的显著性分析。认为该位点的表达量没有显著性差异。 
-M | --mask-file 提供GFF文件。Cufflinks将忽略比对到该GTF文件的transcripts中的reads。该文件中常常是rRNA的注释,也可以包含线立体和其它希望忽略的transcripts的注释。将这些不需要的RNA去除后,对计算mRNA的表达量是有利的。 
-FDR default: 0.05 允许的false discovery rate. 
--library-type default:fr-unstranded 处理的reads具有链特异性。比对结果中将会有个XS标签。一般Illumina数据的library- type为 fr-unstranded。 
--dispersion-method   

其他高级参数:
-m | --frag-len-mean default: 200 插入片段的平均长度。不过现在Cufflinks能learns插入片段的平均长度,因此不推荐自主 设置此值。 
-s | --frag-len-std-dev default: 80 插入片段长度的标准差。不过现在Cufflinks能learns插入片段的平均长度,因此不推荐自 主设置此值。 
-v/--verbose   显示版本信息等等
 -q/--quiet     除了警告和错误外,其他信息将不会print 
--no-update-check   关系cufflinks自动更新的能力 
-F/--min-isoform-fraction <0.0-1.0>   建议不要更改,主要的isorform丰度若低于这个分数,可变的isoform将四舍五入为0.默认为1e-5
--max-bundle-frags  一个skipped locus/loci在skipped前可以拥有的最大的fragment片段。默认为1000000  
--max-frag-count-draws (默认为100)和--max-frag-assign-draws (默认为50)--min-reps-for-js-test      一个针对不同调控的基因做test的最小的复制次数。Cuffdiff won't test genes for differential regulation unless the conditions in question have at least this many replicates. Default: 3. 
--no-effective-length-correction   Cuffdiff will not employ its "effective" length normalization to transcript FPKM. Cufflinks将不会使用它的“effective” 长度标准化去计算转录的FPKM
--no-length-correction    cufflinks将根本不会使用转录本的长度去标准化fragment的数目。当fragment的数目和the features being quantified的size是独立的,可以使用(例如for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).小心使用
--max-mle-iterations       极大似然法的迭代次数,默认5000--poisson-dispersion Use the Poisson fragment dispersion model instead of learning one in each condition.

4. Cuffdiff输出

1. FPKM tracking files   cuffdiff计算每个样本中的转录本,初始转录本和基因的FPKM。其中,基因和初始转录本的FPKM的计算是在每个转录本group和基因group中的转录本的FPKM的求和。
isoforms.fpkm_trackingTranscript FPKMs
genes.fpkm_trackingGene FPKMs. Tracks the summed FPKM of transcripts sharing each gene_id
cds.fpkm_trackingCoding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p_id, independent of tss_id
tss_groups.fpkm_trackingPrimary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id
2. Count tracking files    评估每个样本中来自每个 transcript, primary transcript, and gene的fragment数目。其中primary transcript, and gene的fragment数目是每个primary transcript group或gene group中trancript的数目之和。
isoforms.count_trackingTranscript counts
genes.count_trackingGene counts. Tracks the summed counts of transcripts sharing each gene_id
cds.count_trackingCoding sequence counts. Tracks the summed counts of transcripts sharing each p_id, independent of tss_id
tss_groups.count_trackingPrimary transcript counts. Tracks the summed counts of transcripts sharing each tss_id
 3. Read group tracking files   计算在每个repulate中每个transcript, primary transcript和gene的表达量和frage数目
isoforms.read_group_trackingTranscript read group tracking
genes.read_group_trackingGene read group tracking. Tracks the summed expression and counts of transcripts sharing each gene_id in each replicate
cds.read_group_trackingCoding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing each p_id, independent of tss_id in each replicate
tss_groups.read_group_trackingPrimary transcript FPKMs. Tracks the summed expression and counts of transcripts sharing each tss_id in each replicate
4. Differential expression test    对于splicing transcript, primary transcripts, genes, and coding sequences.样本之间的表达差异检验。对于每一对样本x和y,都会有以下四个文件:
isoform_exp.diffTranscript differential FPKM.
gene_exp.diffGene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id
tss_group_exp.diffPrimary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id
cds_exp.diffCoding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id
每个文件的样式如下:
Column numberColumn nameExampleDescription
1Tested idXLOC_000001A unique identifier describing the transcipt, gene, primary transcript, or CDS being tested
2geneLypla1The gene_name(s) or gene_id(s) being tested
3locuschr1:4797771-4835363Genomic coordinates for easy browsing to the genes or transcripts being tested.
4sample 1LiverLabel (or number if no labels provided) of the first sample being tested
5sample 2BrainLabel (or number if no labels provided) of the second sample being tested
6Test statusNOTESTCan be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7FPKMx8.01089FPKM of the gene in sample x
8FPKMy8.551545FPKM of the gene in sample y
9log2(FPKMy/FPKMx)0.06531The (base 2) log of the fold change y/x 
10test stat0.860902The value of the test statistic used to compute significance of the observed change in FPKM
11p value0.389292The uncorrected p-value of the test statistic
12q value0.985216The FDR-adjusted p-value of the test statistic
13significantnoCan be either “yes” or “no”, depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing
5. Differential splicing tests – splicing.diff     对于每个primary transcript,鉴定的不同的isoform的差异性。只有2个或2个以上的isoforms的primary transcript存在
Column numberColumn nameExampleDescription
1Tested idTSS10015A unique identifier describing the primary transcript being tested.
2gene nameRtknThe gene_name or gene_id that the primary transcript being tested belongs to
3locuschr6:83087311-83102572Genomic coordinates for easy browsing to the genes or transcripts being tested.
4sample 1LiverLabel (or number if no labels provided) of the first sample being tested
5sample 2BrainLabel (or number if no labels provided) of the second sample being tested
6Test statusOKCan be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7Reserved0 
8Reserved0 
9√JS(x,y)0.22115The splice overloading of the primary transcript, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the splice variants 
10test stat0.22115The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
11p value0.000174982The uncorrected p-value of the test statistic.
12q value0.985216The FDR-adjusted p-value of the test statistic
13significantyesCan be either “yes” or “no”, depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing
6. Differential coding output – cds.diff    对于每个基因,它的cds的鉴定。样本间的输出cds的差异性。只有2个或2个以上的cds(multi-protein genes)列举在文件中。
Column numberColumn nameExampleDescription
1Tested idXLOC_000002-[chr1:5073200-5152501]A unique identifier describing the gene being tested.
2gene nameAtp6v1hThe gene_name or gene_id
3locuschr1:5073200-5152501Genomic coordinates for easy browsing to the genes or transcripts being tested.
4sample 1LiverLabel (or number if no labels provided) of the first sample being tested
5sample 2BrainLabel (or number if no labels provided) of the second sample being tested
6Test statusOKCan be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7Reserved0 
8Reserved0 
9√JS(x,y)0.0686517The CDS overloading of the gene, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the coding sequences 
10test stat0.0686517The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
11p value0.00546783The uncorrected p-value of the test statistic
12q value0.985216The FDR-adjusted p-value of the test statistic
13significantyesCan be either “yes” or “no”, depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing
7. Differential promoter use – promoters.diff  样本间启动子使用的差异性。只有表达2个或2个以上isoform的基因列举在这里。
8. Read group info – read_groups.info   每个repulate,在进行定量分析时,cuffdiff的关键属性会列出。
Column numberColumn nameExampleDescription
1filemCherry_rep_A/accepted_hits.bamBAM or SAM file containing the data for the read group
2conditionmCherryCondition to which the read group belongs
3replicate_num0Replicate number of the read group
4total_mass4.72517e+06Total number of fragments for the read group
5norm_mass4.72517e+06Fragment normalization constant used during calculation of FPKMs.
6internal_scale1.23916Internal scaling factor, used to transform replicates of a single condition onto the “internal” common count scale.
7external_scale0.96External scaling factor, used to transform counts from different conditions onto an internal common count scale.
9. Run info – run.info   运行的信息。

其中:输出文件FPKM Tracking file的格式如下:

1  tracking_id  TCONS_00000001  内部唯一object的id(识别基因,转录本,CDS,初始转录本)A unique identifier describing the object (gene, transcript, CDS, primary transcript)

2  class_code  =  内部定义的类别的id,“-”表明不是转录本。The class_code attribute for the object, or “-” if not a transcript, or if class_code isn’t present

3  nearest_ref_id  NM_008866.1  最接近的参考转录本The reference transcript to which the class code refers, if any

4  gene_id  NM_008866  基因id  The gene_id(s) associated with the object

5  gene_short_name  Lypla1  基因名字  The gene_short_name(s) associated with the object

6  tss_id  TSS1  初始转录本id,或者“-”表示没有初始转录本。The tss_id associated with the object, or “-” if not a transcript/primary transcript, or if tss_idisn’t present

7  locus  chr1:4797771-4835363  基因组上的位置Genomic coordinates for easy browsing to the object

8  length  2447  转录本的长度The number of base pairs in the transcript, or ‘-‘ if not a transcript/primary transcript

9  coverage  43.4279  read覆盖深度的估测值  Estimate for the absolute depth of read coverage across the object

10  q0_FPKM  8.01089  样本0中object的FPKM   FPKMof the object in sample 0

11  q0_FPKM_lo  7.03583  object在样本0中FPKM的95%置信区间的下界the lower bound of the 95% confidence interval on the FPKM of the object in sample 0

12  q0_FPKM_hi  8.98595  object在样本0中FPKM的95%置信区间的上界the upper bound of the 95% confidence interval on the FPKM of the object in sample 0

13  q0_status  OK  object在样本0中的量化状态,0K表示成功,LOWDATA:太复杂或测序深度不够;HIDATA:在一个基因座上太多fragments,FAIL:失败的协方差矩阵或其他数值阻止了去卷积Quantification status for the object in sample 0. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.
Count tracking files 格式如下:

 

1  tracking_id  TCONS_00000001  A unique identifier describing the object (gene, transcript, CDS, primary transcript)

2  q0_count  201.334  Estimated (externally scaled) number of fragments generated by the object in sample 0

3  q0_count_variance  5988.24  Estimated variance in the number of fragments generated by the object in sample 0

4  q0_count_uncertainty_var  170.21  Estimated variance in the number of fragments generated by the object in sample 0 due to fragment assignment uncertainty.

5  q0_count_dispersion_var  4905.63  Estimated variance in the number of fragments generated by the object in sample 0 due to cross-replicate variability.

6  q0_status  OK  Quantification status for the object in sample 0. Can be one of OK (deconvolutionsuccessful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception preventsdeconvolution.

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Cuffdiff 是一个用于分析 RNA-Seq 数据的软件,可以用来比较不同条件下的基因表达量。以下是 Cuffdiff 的下载和使用方法: 1. 下载 Cuffdiff 软件:从 Cufflinks 的官方网站(http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/)下载 Cuffdiff 软件包。 2. 安装依赖项:在运行 Cuffdiff 之前,需要先安装一些依赖项,如 boost、samtools、htslib、Eigen、tbb 等。具体安装方法可以在 Cufflinks 的官方网站上找到。 3. 准备输入文件:Cuffdiff 需要一组已经比对过的 BAM 文件作为输入,并且这些 BAM 文件需要按照样本进行分组。每个分组还需要一个条件文件,其中包含有关每个样本的元数据,例如条件名称、样本名称等。 4. 运行 Cuffdiff:打开终端窗口,切换到 Cuffdiff 的安装目录,并运行以下命令: ``` cuffdiff [options] <assembly_GTF> <sample1_replicate1_bam>,<sample2_replicate1_bam>,<...>,<sampleN_replicateM_bam> [replicates.txt] ``` 其中,`options` 是可选的参数,包括一些分析选项和输出文件路径等。`assembly_GTF` 是参考基因组的注释文件,`<sample1_replicate1_bam>,<sample2_replicate1_bam>,<...>,<sampleN_replicateM_bam>` 是按照样本分组的 BAM 文件列表,`replicates.txt` 是每个分组的条件文件(可选)。 例如,以下命令将运行 Cuffdiff,比较三个样本之间的基因表达量差异,并输出结果到 `diff_out` 目录: ``` cuffdiff -o diff_out -b genome.fa -p 4 -L WT,KO,Rescue \ ref_annotation.gtf \ WT_rep1.bam,WT_rep2.bam,WT_rep3.bam \ KO_rep1.bam,KO_rep2.bam,KO_rep3.bam \ Rescue_rep1.bam,Rescue_rep2.bam,Rescue_rep3.bam ``` 5. 解读结果:Cuffdiff 的输出结果包括差异表达基因列表、基因注释信息、差异表达基因的统计信息等。可以使用 Excel 或其他统计软件来进一步分析和可视化这些结果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值