Imitation: Seahorse genome

最新推荐文章于 2019-06-03 12:04:13 发布

Cs_mary

最新推荐文章于 2019-06-03 12:04:13 发布

阅读量495

点赞数

分类专栏： BioInfo

本文链接：https://blog.csdn.net/cs_mary/article/details/56494381

版权

BioInfo 专栏收录该内容

30 篇文章 2 订阅

订阅专栏

source

Abstract
report the sequencing
de novo assembly of the genome

比较分析得出进化速率
Comparative genomic analysis: identifies higher protein and nucleotide evolutionary rates compared with other teleost fish genomes

虾红素金属蛋白酶基因家族扩张
identified an astacin虾红素 metalloprotease gene family undergone expansion

基因扩张：The expansion of a gene cluster is the duplication of genes that leads to larger gene families.
A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions. One such family are the genes for human hemoglobin subunits; the ten genes are in two clusters on different chromosomes, called the α-globin and β-globin loci. These two gene clusters are thought to have arisen as a result of a precursor gene being duplicated approximately 500 million years ago.

育儿袋基因高表达
highly expressed in the male brood pouch

牙釉质基因的缺乏
genome lacks enamel matrix protein-coding proline/glutamine-rich secretory calcium-binding phosphoprotein genes, which might have led to the loss of mineralized teeth.

下肢调节基因：无
tbx4, a regulator of hindlimb development, is also not found in H. comes genome. Knockout of tbx4 in zebrafish showed a ‘pelvic fin-loss’ phenotype

开工
Genome assembly and annotation
组装注释

The genome was sequenced using the Illumina HiSeq 2000 platform.
Illumina Hiseq 2500型和2000型测序仪的比较
仪器升级：
Hiseq 2500是Hiseq 2000的升级版。
其主要的改进点是：Hiseq 2500可以在快速、高通量两种模式之间切换。
高通量模式就是原来的Hiseq2000的每张Flowcell有8个Lane的模式。
Hiseq 2500的快速模式，核心的改进是用2个Lane的Flowcell来测序，而且这种快速Flowcell的Lane比Hiseq 2000的Lane要短，数据产量也略低于高通量模式的2条Lane。

测序速度提升：
Hiseq高通量模式，PE100，双Flowcell，11天完成测序。数据量每Flowcell在270G PF data以上。
Hiseq快速模式，PE100，双Flowcel，27小时完成测序。数据量每Flowcell在60G PF data以上。
每张Flowcell的数据产量下调之后，使得用户要凑够一张Flowcell上机样本量的准备时间大大缩减。这简接地提升了测序的速度，缩短的程度可能是从原来一个月的凑样时间缩短到一周，甚至几天。

测序速度的加快对于临床诊断、疾病预防之类的检测有巨大的意义。
数据质量提升：
在快速模式下，Hiseq机器可以更快地拍完一个cycle的所有照片，也就是每个cycle的用时更少。SR50可以在1天内走完，PE100可以在27小时内走完。这明显比原来的SR50要3天、PE100要11天来得快得多。
（100PE=paired end 100bp
50SE=single end 50bp
都是illumina 高通量测序的术语）
在高通量测序有三种测序模式（现在大部分2种,single-end比较少了 454和以前的illumina GA为single -end）,single-end(单端测序,只测一条序列的一头）,pair-end（双端测序,测一条序列的两头）,mate-pair（环化序列测序序列,然后在环化接口处生物素标记富集,测环化的接口处序列）.你说的PE30.PE100就是第二种,一条序列不管多长只各测两头的30bp(2X30bp）.100bp(2X100).即一条序列测60bp或者200bp.

在速度加快的同时，还带来质量的提升。因为Hiseq测序过程中有两个主要的物质：酶和荧光剂，而这两者都是在常温下不稳定的，或者说是在融化后（原来是冰冻保存的）随放置的时间延长而不断降解（为此Hiseq还为试剂内置了4度小冰箱，以减慢其降解）。
原来的Hiseq 2000要走11天，现在快速模式27小时完成，这带来了明显的测序质量提升。

实测哺乳类动物的基因组DNA文库， Q30比例可以达到90%以上。
Q30:A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.
Phred quality scores {\displaystyle Q} Q are defined as a property which is logarithmically related to the base-calling error probabilities {\displaystyle P} P.[2]

{\displaystyle Q=-10\ \log {10}P} Q=-10\ \log {{10}}P

{\displaystyle P=10^{\frac {-Q}{10}}} P=10^{{{\frac {-Q}{10}}}}

For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.

测长提升：
因为测序质量的提升，也带动测序长度的提升，目前Illumina官方支持的Hiseq2500的测长是PE 2*150。
特别需要注意的，Illumina目前不直接提供PE150的试剂，客户要用1*PE Cluster kit + 1*PE100 SBS kit + 2*SR50 SBS kit合并起来，才能测PE150。

容纳更多文库：
Hiseq 2500的快速模式试剂直接支持双Index测序模式，也就是测216个cycles。
双Index是指两个接头各有一个Index。这样两套Index排列组合，一个Lane里可以放更多的文库。目前Illumina官方试剂是支持96个排列组合（ 12*8 = 96），这对充分利用Hiseq平台巨大的测序数据产量有很大的帮助。原来的单Index是支持单侧24种Index。
Hiseq PE100高通量模式标准PE100试剂只能测单Index。因为双Index会实际需要216cycles, 这超过了200 cycle SBS标准试剂可以保证的cycle数（208 cycles）。
当然，Hiseq2000b也可以测双Index，但是要用4个50cycles SBS kit（每Kit保证58个cycles）拼起来（58*4=232 cycles），才可以保证有足够的SBS试剂量。

仪器操作更方便：
Hiseq 2500快速模式可以直接在Hiseq仪上进行Cluster生成，这比起高通量模式是省事得多。高通量模式要在cBOT上生成Cluster，接着再要将Flowcell从cBOT上移到Hiseq，再在Hiseq上操作一遍。
但是请注意，如果直接在Hiseq 2500上生成cluster，两条Lanes就只能上一种预混合文库，而不能象原来的Hiseq 高通量模式上那样，8条Lanes物理分开。也就是说预混合文库中的各个文库的Index一定是要分得开的才行。
当然，快速模式也可以还用cBOT生成cluster，但是那要另外买一个编号为CT-402-4001（全名：TruSeq®Rapid Duo cBot™ Sample Loading Kit ）的试剂盒，这个试剂盒要好几百美元。将直接增加额外的测序成本。
试剂操作更方便：
Hiseq 2500快速模式的试剂是做成MasterMix的，也就是酶、Buffer、荧光dNTP等都预先混合好了，一大管，拿来一化冻就可以用，很方便。
而高通量模式试剂把酶、荧光dNTP分几管的，所以使用之前还要人工再混合，这样会多占用一点人工。
Hiseq 2500仪器更贵：
据公开资料，Hiseq 2500的报价比Hiseq2000的贵5~8万美元（不同国家略有差异）。Hiseq2500的美国报价是74万美元，Hiseq 2000的美国报价是69万美元。请参见参考资料（下有链接）。
快速模式的试剂更贵：
把试剂的价格分摊到其所产生的每个G的数据，快速模式比高通量模式的大约贵了15~40%。
Hiseq 2500的两个机位同时只能运行一种模式：
Hiseq 2500在一台机器的两个机位同时只能跑同一种模式，也就是要么都跑快速模式，要么都跑高通量模式，而不能一个机位跑快速模式，另一个机位同时跑高通量模式。
总之
Hiseq 2500是Hiseq的一个重要升级，升级后：更快、更准、更长、更方便、更高效（容纳更多文库），同时也略贵一些。
原文来自：http://blog.sina.com.cn/s/blog_53e7471b0101bqw2.html

After filtering low-quality and duplicate reads, 132.13 Gb(approximately 190-fold coverage coverage of the esitmated 695Mb genome) of reads from the libraries with insert sizes ranging from 170bp to 20 kb were retained for assembly.

The filtered reads were assembled using SOAPdenovo(version 2.04) to yield a 501.6 Mb assembly with an N50 contig size and N50 scaffold size of 34.7 kb and 1.8 Mb,respectively.

Total RNA from combined soft tissues was sequenced using RNA-sequencing（RNA-seq）and assembled de novo. The genome assembly of high quality, as >99% of the de novo assembled transcripts (76,757 out of 77,040) could be mapped to the assembly; and 243 out of 248 core eukaryotic genes mapping approach (CEGMA) genes are complete in the assembly.

We predicted 23,458 genes in the genome. More than 97% of the predicted genes (22,941) genes either have homologues in public databases (Swissprot, Trembl and the Kyoto Encyclopedia of Genes and Genomes (KEGG)) or are assembled RNA-seq transcripts. Analysis of gene family evolution using a maximum likelihood framework identified an expansion of 25 gene families (261 genes; 1.11%) and contraction of 54 families (96 genes; 0.41%) in lineage.

Transposable elements comprise around 24.8% (124.5 Mb) with class II DNA