FPKM计算

最新推荐文章于 2024-03-15 21:19:56 发布

qq_39306047

最新推荐文章于 2024-03-15 21:19:56 发布

阅读量4.3k

点赞数

分类专栏： R Linux

本文链接：https://blog.csdn.net/qq_39306047/article/details/114370154

版权

Linux 同时被 2 个专栏收录

78 篇文章 3 订阅

订阅专栏

52 篇文章 7 订阅

订阅专栏

FPKM计算

1有时候用gtf提取和计算取基因长度时候，很多基因的名字是一样的，就是同一个基因有不同长度
分别计算fpkm然后家和取平均值（对？）。
2htseq计算counts的时候，定量出来的某些基因名称在gtf文件上找不不到，这样fpkm基因的数量比read count的基因数量少一点

https://www.jianshu.com/p/5b9aa8ec8cc7

https://www.jianshu.com/p/49f2030e937c

https://www.jianshu.com/p/b5c35df4ac36

https://www.jianshu.com/p/2cce4376be48

Python处理如何将相同基因的FPKM算一个平均值再输出https://shengxin.ren/question/159

bigwig归一化方式详解https://blog.csdn.net/weixin_43569478/article/details/108079478?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522161495063116780255276217%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257D&request_id=161495063116780255276217&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2_allsobaiduend~default-2-108079478.first_rank_v2_pc_rank_v29&utm_term=bamCoverage

perl mer *count >all.count.txt
grep -v ‘^__’ all.count.txt>all.count
#!/usr/bin/perl -w

$header= “Gene”;
foreach $i ( 0…@ARGV-1){

$A R G V [$ i]=~/(.?)_(.?).txt/;

$ARGV[$i]=~/(.*?)\.count/;
$header.="\t$1";

open IN,"$ARGV[$i]" or die "$!";
while(<IN>){
    chomp;
    @line=split;
    $hash{$line[0]}[$i]=$line[1];
}

}
print"$header\n";
foreach $gene_id(sort keys %hash){
foreach $KaTeX parse error: Expected '}', got 'EOF' at end of input: \dots){ if(!$ hash{ $KaTeX parse error: Expected 'EOF', got '}' at position 8: gene_id}̲[$ i]){
$KaTeX parse error: Expected '}', got 'EOF' at end of input: hash{$ gene_id}[$i]=0;
}
}
print $KaTeX parse error: Undefined control sequence: \t at position 10: gene_id,"\̲t̲",join("\t",@{$ hash{$gene_id}}),"\n";
}

https://www.cnblogs.com/renping/p/9206517.html
39、count_rpkm_fpkm_TPM
参考：https://f1000research.com/articles/4-1521/v1

      https://www.biostars.org/p/171766/

      http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). However, TPM (Transcripts Per Kilobase Million) is now becoming quite popular.

fpkm========

rate = geneA_count / geneA_length

fpkm = rate / (sum(gene*_count) /10^6)

即： fpkm = 10^6 * (geneA_count / geneA_length) / sum(gene*_length) ##sum(gene*_length) 没有标准化处理的所有基因的count总和。

TPM========

rate = geneA_count / geneA_length

tpm = rate / (sum(rate) /10^6)

即： tpm = 10^6 * (geneA_count / geneA_length) / sum(rate) ##sum(gene*_length)

====================================================================

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:

Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).

TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:

Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
Divide the RPK values by the “per million” scaling factor. This gives you TPM.

So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.

When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.

Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)

With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.