The draft genome of MD-2 pineapple using hybrid error correction of long reads

The draft genome of MD-2 pineapple using hybrid error correction of long reads 

 

2.4. Error correction of long PacBio reads

Advancement in long-read sequencing technology such as from PacBio has proven to alleviate many of the difficulties in assembly plant genomes. However, direct use of PacBio in genome assembly is not possible at low to medium sequencing coverage, due to the innate high error rate of single-pass sequence reads. Thus, in order to improve the accuracy of the PacBio, high accuracy short reads library from 350 to 750 bp average insert size were used by using novoLR™ package13 developed by Novocraft. The programme is divided into two parts; pre-processing of the PacBio reads by novoLRcleaver™ and aligning and variant calling by novoAlign™ and novoLRcorrector™, respectively. Error correction began by mapping the short reads onto the long reads using the novoAlign™ programme and variant calling was performed using novoLRcorrector™ to produce error corrected PacBio reads. After error correction, the number of error-corrected read base was further reduced to 56% (8.34 Gb) of the initial total subreads produced (Supplementary Table 1b). Much of the data was lost because any subreads that were a replicate of each other (i.e. derived from the same DNA template) were removed by novoLRcleaver™ and only the longest replicates were chosen to represent the template. This is because reads at the same start and end will not help in improving the contiguity of the genome assembly; rather it would further complicate assembly process with the minor variants it may carry due to random error rate innate to PacBio sequencing profile. As the sequencing library preparation improved in terms of the template DNA fragment size (by increasing the BP start to 9,000 bp), more read base survived the novoLRcleaver™ process, as there were less shorter templates (Supplementary Table 1b).

2.5. Genome assembly

The genome of MD-2 pineapple was assembled using only the ×15.9 (8.34 Gb) of error corrected PacBio sequence reads, with highest read length of 27,913 bp and average read length of 4,684 bp. Assembly was performed using Celera Assembler software version 8.3rc1 (http://wgs-assembler.sourceforge.net/) (9 June 2016, date last accessed) with parameters as summarized in Supplementary Table 2. The first assembly produced a draft with a total size that was 48% larger than the haploid genome size of pineapple and N50 of 25,277 bp. The expanded genome size from the assembly was due to the failure of the software to resolve the double haplotype that existed in the genome due to its high heterozygosity rate. An overlap-based assembly as adopted in Celera Assembler should be able to resolve low heterozygosity rate as it allows tolerant mismatch in the discovery of overlaps among the sequence reads. However, with higher allele differences between the haplotype, assembler may produce different composites of the polymorphic paths into the assembly,18 leading to the construction of different unitigs representing the variants observed and thus causing an inflated assembly.

In order to reduce the redundancy that may be present in the assembly, contigs were binned into two based on length cut-off of 25,000 bp. The bin with contigs smaller than 25,000 bp were then mapped to another bin containing contigs larger than 25,000 bp using GMAP19 with default parameter and any shorter contigs with hits and target coverage of more than 80% were removed from the assembly. In addition, the error corrected PacBio reads were also mapped to the assembly to remove contigs with an average coverage of less than one, as it may represent a spurious combination of the polymorphic block not present in the genome.18 Mapping with the error-corrected PacBio reads onto the genome was performed using Blasr20 with parameter ‘-bestn 5 –minPctIdentity 90 –placeRepeatsRandomly’. At this point, the assembly did not improve much, with less than 16% reduction of the total size assembly compared with the original assembly. Furthermore, all of the Illumina short reads from the three libraries were mapped to the genome using novoAlign™ with parameter ‘-t 20,3 –hlimit 6 -H 20 -p 5,20 -r All 50’ producing a BAM alignment file. The mapping quality of the short reads was then accessed from the alignment file and the assembly was then disjointed at low-quality mapping site (Q-score of less than 10). Subsequently, the same method to reduce redundancy by a similar search as above was carried out again for the second time to the fragmented contigs but with length cut-off of 1,000 bp. Once again, the short reads were mapped to the draft using the same parameter for scaffolding purpose via BESST software21 after the previous fragmentation. Thereupon, the draft assembly achieved the total size of 508 Mb, which was 96.6% of the estimated genome size of a pineapple and improved N50 of 34,762 bp.

The contiguity of the draft was further improved using multiple scaffold software for different sequence data. First, SSPACE-LongRead22 software was used to scaffold the draft using error corrected PacBio reads, then Quiver tool (default parameter) via SMRT analysis was used to perform consensus calling using uncorrected PacBio reads, followed by another round of scaffolding using in-house PacBio long transcripts sequence data of pineapple using GMAP as aligner and L_RNA_Scaffolder23 for scaffolding. The draft was then further improved using SSPACE-LongRead software but in this round, all filtered subreads uncorrected PacBio reads were used. Finally, consensus calling was performed on the final draft using novoLRpolish™, utilizing all the trimmed and filtered Illumina short reads, and uncorrected novoLRcleaver™ processed long PacBio reads. After multiple rounds of scaffolding, the draft assembly improved significantly with N50 of 153,084 bp, maximum scaffold length of 1,287,057 bp and a total assembly size of 524.070 Mb.

Genome quality assessment was evaluated by remapping short reads, and the long reads error corrected PacBio to the final draft using novoAlign™ and Blasr, respectively. More than 91% of the short reads mapped to the genome in proper pairs and more than 96% of long PacBio reads mapped to the genome. The high percentage of reads mapped back to the draft suggests that most of the reads were incorporated into the genome and thus most of the genome were assembled. In addition, the genome was also evaluated using Core Eukaryotic Genes Mapping Approach (CEGMA)24 in order to identify the correct exon–intron structure of 248 Core Eukaryotic Genes (CEGs) in the assembly. The analysis found 99% of the CEGs and 90% of the match was complete. This number of CEG retrieved is higher than other genomes assembled using next generation sequencing technology; pear25 (98.4%), adzuki bean26 (86%) and date palm27 (94%). Mapping of MD-2 reference genome to the F153 pineapple assembly was performed using lastal alignment tool28 and the comparison metrics was produced using the COMPASS tool.29 Variant analysis was performed as in Wit et al.30 using minimum mapping quality of Q30.

2.6. Repeat and gene annotation

Repeat annotation was performed based on the advanced tutorial to construct repeat library from MAKER31 software. De novo identification of miniature inverted-repeat transposable elements (MITE) and Long Terminal Repeat (LTR) retrotransposons were performed using MITE-Hunter32 and LTRharvest33 software, respectively. Consensus sequences from de novo repeat library were then combined with LTR identified using RepeatModeler, to constitute the final repeat library. Gene prediction and annotation were conducted using MAKER pipeline31 with gene predictor tools including SNAP,34 AUGUSTUS35 and GENEMARK-ES.36 mRNA sequencing of a tissue sample from the mature yellow fruit of local pineapple variety from Babagon, Sabah were sequenced using PacBio. In addition to our previous RNA-Seq assembly data10 on the mature yellow pineapple fruit, another RNA-Seq assembly was performed on the mature green pineapple fruit following the same method. However, de novo assembly of the mature green RNA-Seq sequences was performed using Oases-MK (http://www.ebi.ac.uk/~zerbino/oases/ 9 June 2016, date last accessed), using the combination of kmer range from 23 to 65 in a step of two. The assembled transcripts were then clustered using TGI Clustering Tool37 software by default. All of these transcriptomic sequencing data in addition to the available pineapple EST data downloaded from Genbank NCBI were used as EST evidence to the predicted gene in the MAKER pipeline. Due to the limited transcriptome data to represent other tissues of pineapple, RefSeq protein sequences from Poales order were downloaded from NCBI to be recruited as the protein homology evidence. The final gene set contained 27,087 genes. In addition, tRNAscan-SE38 in MAKER pipeline was also enabled for identification of tRNA in the genome.

Gene function was assigned according to best hit in a similar search against SwissProt and TrEMBL database39 using BLASTP at E-value cut-off of 1e-5. The gene ids were then modified to add the gene function according to the best hit search and genes with no identity were indicated as ‘Protein of unknown function’. Motifs and domain of the genes were determined using InterProScan40 v5.15-54.0 against multiple databases including Pfam, PROSITE, PRINTS, ProDom, SMART, Panther, TMHMM and SignalP_EUK with pathway and GO lookup. Gene features comparison across other sequenced genomes in subclass Commelinidae and Arabidopsis thaliana were performed using GenomeTool41 and protein homology search were carried out using OrthoMCL.42

Non-coding RNA (ncRNA) including microRNA (miRNA), small nuclear RNAs (snRNA), small nucleolar RNA (snoRNAs) and other ncRNAs were identified using INFERNAL-v1.1 software43 using RFAM covariance database.44 In the analysis, for the case of overlap prediction, hit with higher E-value was selected.

2.7. Phylogeny construction

Single-copy gene among the nine taxa were obtained via orthology analysis using OrthoMCL.42,Oryza sativaSorghum bicolorBrachypodium distachyon and Aegilops tauschii were selected to represent Poacea family, and Musa acuminata and Elaeis guineensis to represent the non-Polaes order in subclass commelinid. A. thaliana was chosen to represent the dicotyledonous group for comparison and Amborella trichopoda were included as the most recent common ancestor among the angiosperm. Four hundred and nine single-copy genes identified by OrthoMCL were concatenated into a single super long sequence for each taxon. The sequences were then aligned by using the amino acid sequence as the guide via MAFFT.45 The aligned amino acids were then back translated using EMBOSS Backtranseq tool prior to subsequent phylogenetic analysis. The phylogenetic tree was constructed using the same matrix via GTR + GAMMA model implemented in RAxML.46 Gene family expansion and contraction was analysed using CAFÉ47 on the 1,000 largest core gene family shared across all taxa in the tree. Divergence times in the phylogenetic tree were estimated using RelTime method in MEGA6 calibrated using divergence time between Brachypodium and Oryza (40–45 million years ago)48 and Arabidopsis and Oryza (130–200 million years ago).49

2.8. Transcriptome analysis

Differential expression analysis in this study is part of an extension to the previously published de novo transcriptome of mature yellow pineapple fruit.10 All of the tissue samples and RNA extraction were performed concurrently as in Ong et al.10 The fruits were collected from a commercial pineapple field located at Babagon, Sabah. The fruit was a local variety but a variant of the Smooth Cayenne. Nevertheless, its average fruit size is smaller than Smooth Cayenne but with higher Brix value (ranges from 8 to 20°) and the pH value of the fruit ranges from 3.7 to 3.2. The green mature fruit and yellow mature fruit were harvested randomly at 12 and 16 weeks after flowering, respectively. The green mature fruit was fruit that had all of its eyes fully expanded but the skin was still green and the yellow mature fruit was a fruit harvested during the time when more than 80% of its eyes turned yellow. Total RNA was extracted by using the modified method of Li et al.50 Total RNA was sequenced using Genome Analyzer IIx and was sequenced in 75 bp paired-end format with average insert size of 200 bp. All sequence reads were filtered and trimmed using Perl script Condetri51 with a minimal length of 50 bp and average Q-score of 25. RNASeq reads were then mapped onto the draft genome and analysed for differential expression by using the Tuxedo suite pipeline.52

3. Results and discussion

3.1. Sequencing and assembly

Sequencing of the MD-2 pineapple was carried out using the two forefront sequencing platforms, Illumina and PacBio, to produce short and long sequencing reads, respectively. Genomic DNA was obtained from leaf tissues and was sequenced using the Illumina platform in three libraries, each with different average insert sizes; 350, 550 and 750 bp. The first two libraries each produced 42 Gbp and 46 Gbp of 100 paired-end reads, respectively, and the latter produced 9 Gbp of 300 bp paired-end reads. All of the reads were trimmed to yield a final coverage of ×154. Sequencing of 20 kb template size library on the PacBio platform produced 14.85 Gb sequence data. After error correction, the data was reduced to 8.34 Gb, which translated to a ×15.9 of high accuracy long sequencing reads of the pineapple genome. Maximum read length before the assembly was 37,591 bp which was reduced to 27,913 bp after error correction. Two approaches were taken to find the most optimum assembly: de novo assembly of only error corrected long PacBio reads using the well-known, Celera Assembler53 and the recent strategy of long reads assembly-based mapping on the de-bruijn assembled short reads contigs, implemented in DBG2OLC.54 After comprehensive comparison between the two assemblies, the Celera assembly was selected as it contained more CEGs as assessed with CEGMA and better mapability of the pineapple transcripts obtained from the NCBI database and in-house database (Table 1). Although DBG2OLC strategy produced larger N50, Celera assembly is superior in terms of its accuracy which is what we considered to be more important. After contiguity improvement using multiple software for scaffolding, the final Celera assembly contained 8,448 scaffolds covering 99.6% of the genome with 901 scaffolds (i.e. 50% of the assembly) at a length of at least 153,084 bp (i.e. N50). The CEGMA24 assessment found 245 out of 248 (98.8%) CEGs with 93.2% of the matches were with alignment spanning more than 70% (i.e. complete). In addition, more than 95% of the short reads mapped to the genome in a proper pair and 96% of the long PacBio reads mapped uniquely to the genome. Earlier before the advancement of next-generation sequencing, sequencing a heterozygous sample was thought to be impossible. Nowadays, with massive parallel sequencing technology of the second-generation sequencing such as Illumina and the long sequencing technology such as the PacBio platform, de novo assembly of the heterozygous diploid sample is feasible. Hybrid assembly of the heterozygous diploid sample of MD-2 pineapple is yet another evidence of its practicability. To our knowledge, this is the second genome published utilizing the hybrid sequence technology in plant genome assembly following the Chinese orchid herb55 but the first in showcasing its feasibility with diploid heterozygous plant sample. The N50 of the MD-2 pineapple draft assembly is lower than many of the inbred fruit tree genome assembly utilizing NGS but the total genome coverage in relative to their estimated genome size exceeds many other plant genomes. It is important to note that the N50 is higher in comparison to other draft assembly of complex samples such as the 20 Gbp white spruce,56 the 2.57 Gb hop57 and the weed horseweed.58 The integrity of the scaffold is good enough for gene annotation as shown by the number of CEGs found which was comparable to the Setaria italica draft genome59 but better than the draft genome of pear25 and the Chinese orchid.55 The number of the CEG found in full length and partial were also exceeded the recently published pineapple genome.11 Furthermore, more than 99% of 114,077 complementary DNAs (cDNAs) of pineapple could be aligned to the genome with 87% of the matches had over 90% coverage and identity (Table 1). The cDNAs used for validation were derived from the previous fruit transcriptomic studies,10 pineapple EST sequences from Genbank and the new long RNA sequences derived from Iso-Seq sequencing using PacBio RSII. Transcripts mapped partially onto the genome could be used to further improve the gene annotation and assembly accuracy. Variant analysis of the MD-2 pineapple draft revealed one heterozygosity per 448 bp, which included 1,009,925 of SNPs and 183,133 of indels.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值