与生科、计生同学联合大创任务一：计算N50 byC 语言，以及JAVA（所查找资料汇总）

最新推荐文章于 2022-08-17 23:27:04 发布

果冻过冬竖琴

最新推荐文章于 2022-08-17 23:27:04 发布

阅读量3k

点赞数

本文链接：https://blog.csdn.net/AMAZEINGHA/article/details/52317587

版权

N50值是评定基因组拼接好坏的一个标准,一般来说是越大越好,说明你得到的contig越长.

perl 就可以。
以计算contig的N50为例：
计算每条contig的长度，然后计算所有contig长度总和，然后将contig按长到短的顺序依次累加，当累加值为contig长度总和的50%的时候，刚加进入contig的长度即为N50值。N80、N90类推.

参考某博主：

R程序：

   
   
    
    N50=function(ctg){ 
   
   
   
   
    
      half_sum=sum(ctg)/2#计算所有contig总的碱基数的一半
   
   
   
   
    
      ctg_ac=sort(ctg)#对contig长度进行升序排列
   
   
   
   
    
      n=1
   
   
   
   
    
      sum_n=ctg_ac[1]
   
   
   
   
    
      #从长度最短的contig开始按次序累加，直到总长度大于总碱基数的一半，此时的n为最后一个累加上的contig的序号
   
   
   
   
    
      while(sum_n<half_sum){
   
   
   
   
    
        n=n+1
   
   
   
   
    
        sum_n=sum_n+ctg_ac[n]
   
   
   
   
    
      }
   
   
   
   
    
      m=length(ctg_ac)
   
   
   
   
    
      sum_m=ctg_ac[m]
   
   
   
   
    
      #从长度最长的contig开始按递减的顺序开始累加，直到总长度大于总碱基数的一半，此时m为最后一个累加上的contig的序号
   
   
   
   
    
      while(sum_m<half_sum){
   
   
   
   
    
        m=m-1
   
   
   
   
    
        sum_m=sum_m+ctg_ac[m]
   
   
   
   
    
      }
   
   
   
   
    
      #如果第m个contig的长度等于第n个contig的长度，则n50就是此长度，否则就是第m个和第n个contig长度和的1/2
   
   
   
   
    
      if(ctg_ac[m]==ctg_ac[n]){
   
   
   
   
    
        n50=ctg_ac[m]
   
   
   
   
    
      }else{
   
   
   
   
    
        n50=(ctg_ac[m]+ctg_ac[n])/2}
   
   
   
   
    
      #输出n50
   
   
   
   
    
      n50
   
   
   
   
    
    }

1. 什么是Reads?

高通量测序平台产生的序列就称为reads。

2. 什么是Contig？

拼接软件基于reads之间的overlap区，拼接获得的序列称为Contig（重叠群）。

3. 什么是Scaffold？

基因组de novo测序，通过reads拼接获得Contigs后，往往还需要构建454 Paired-end库或Illumina Mate-pair库，以获得一定大小片段（如3Kb、6Kb、10Kb、20Kb）两端的序列。基于这些序列，可以确定一些Contig之间的顺序关系，这些先后顺序已知的Contigs组成Scaffold。

R专门针对生信分析的软件包Bioconduct有许多现在的常用的函数供使用，比如这里的N50

引用文章：

26 March 2012

组装方面的老问题了，就是从大到小排序然后累加，第一次超过比例后输出当前值。
关于N50，可以看wiki和文章 (Assembly algorithms for next-generation sequencing data) 。

Assemblies are measured by the size and accuracy of their contigs and scaffolds. Assembly size is usually given by statistics including maximum length, average length, combined total length, and N50. The contig N50 is the length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. The N50 statistics for different assemblies are not comparable unless each is calculated using the same combined length value. Assembly accuracy is difficult to measure. Some inherent measure of accuracy is provided by the degrees of mate-constraint satisfaction and violation [17]. Alignment to reference sequences is useful whenever trusted references exist.

代码可以看SEQanswers论坛的帖子，第二页的附件也是。
关键的就是这堆：

@x=sort{$b<=>$a}@x;
my $count=0;
my ($n50,$n80,$n90);
my $len_mean=$tlen/$n;
foreach my $i(@x)
{
	$count+=$i;
	$n50=$i if (($count>=$tlen*0.5)&&(!defined $n50));
	$n80=$i if (($count>=$tlen*0.8)&&(!defined $n80));
	$n90=$i if (($count>=$tlen*0.9)&&(!defined $n90));
}

#/usr/bin/perl -w
use strict;
my ($len,$total)=(0,0);
my @x;
while(<>){
	if(/^[\>\@]/){
		if($len>0){
			$total+=$len;
			push @x,$len;
		}
		$len=0;
	}
	else{
		s/\s//g;
		$len+=length($_);
	}
}
if ($len>0){
	$total+=$len;
	push @x,$len;
}
@x=sort{$b<=>$a} @x; 
my ($count,$half)=(0,0);
for (my $j=0;$j<@x;$j++){
	$count+=$x[$j];
	if (($count>=$total/2)&&($half==0)){
		print "N50: $x[$j]\n";
		$half=$x[$j]
	}elsif ($count>=$total*0.9){
		print "N90: $x[$j]\n";
		exit;
	}
}

or run this command as before:
Code:

perl -e 'my ($len,$total)=(0,0);my @x;while(<>){if(/^[\>\@]/){if($len>0){$total+=$len;push@x,$len;};$len=0;}else{s/\s//g;$len+=length($_);}}if ($len>0){$total+=$len;push @x,$len;}@x=sort{$b<=>$a}@x; my ($count,$half)=(0,0);for (my $j=0;$j<@x;$j++){$count+=$x[$j];if(($count>=$total/2)&&($half==0)){print "N50: $x[$j]\n";$half=$x[$j]}elsif($count>=$total*0.9){print "N90: $x[$j]\n";exit;}}' contigs.fa

决定采用排一遍，正着算一遍再反着算一遍的方法，比较 m 与 n

Test 文件  是

果冻过冬竖琴

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
与生科、计生同学联合大创任务一：计算N50 byC 语言，以及JAVA（所查找资料汇总）

N50值是评定基因组拼接好坏的一个标准,一般来说是越大越好,说明你得到的contig越长.1. 什么是Reads?高通量测序平台产生的序列就称为reads。2. 什么是Contig？拼接软件基于reads之间的overlap区，拼接获得的序列称为Contig（重叠群）。 3. 什么是Scaffold
复制链接

扫一扫