三代测序数据或长contigs的纠错和基因组组装工具的安装方法和详细使用方法

最新推荐文章于 2024-08-26 14:27:11 发布

小果运维

最新推荐文章于 2024-08-26 14:27:11 发布

阅读量1.6k

点赞数 23

分类专栏：生信分析-bioinfo 文章标签： canu contigs 长序列组装纠错基因组

本文链接：https://blog.csdn.net/zrc_xiaoguo/article/details/135332993

版权

生信分析-bioinfo 专栏收录该内容

40 篇文章 39 订阅

订阅专栏

介绍：

Canu是一种用于长读长contigs的纠错和基因组组装工具。它最初是为了处理PacBio等第三代测序技术产生的长读长DNA测序数据而设计的。更近期，Canu也开始支持Oxford Nanopore等其他长读长测序技术。

Canu的目标是通过利用长读长测序数据，提供高质量的基因组组装结果。它的设计思路是以自我校正（self-correction）为基础的组装方法。Canu首先通过将长读长测序数据拆分为较短的overlaps，然后进行纠错和重叠扩展（overlapping extension）来构建contigs。接下来，Canu使用错误校正和重叠扩展迭代的过程来提高contig质量，并且通过建立read的互补关系来组装contigs。

Canu的使用场景取决于待解决问题的需求。当您需要进行高质量的基因组组装，特别是在处理长读长测序数据时，Canu就是一个合适的选择。它适用于各种生物学研究领域，如微生物学、植物学和动物学等。同时，Canu也适合处理大型基因组，特别是那些无法通过短读长测序数据进行准确组装的基因组。使用Canu可以提供更长的contigs和更好的基因组覆盖率，从而有助于识别基因和其他遗传元件。

老规矩，先看文章：

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

De novo assembly of haplotype-resolved genomes with trio binning | Nature Biotechnology

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads | bioRxiv

再看github： marbl/canu: A single molecule sequence assembler for genomes large and small. (github.com)

纠错和基因组组装是基因组学领域中的重要任务，可以帮助研究人员快速的获得高质量的基因组序列。下面是一些常用的三代测序数据或长contigs的纠错和基因组组装工具的安装和使用方法的介绍：

安装方法

通过源代码编译安装

克隆Canu项目源码库：

注意官方不建议直接下载zip文件，所以直接clone

git clone https://github.com/marbl/canu.git
cd canu/src

安装依赖（如果尚未安装）

Canu依赖于一些第三方软件和库，例如zlib、bzip2、perl、c++编译器等。确保这些依赖已经正确安装在系统中。

安装调试这里就不说了

编译Canu

make -j

设置环境变量，这个按自己喜好操作，不做这一步的直接使用绝对路径运行即可

export PATH="/your-path-to-canu/canu/bin:$PATH"

使用包管理工具安装（例如conda）

推荐使用mamba ，速度快。

mamba create -n canu
mamba activate canu
mamba install -c conda-forge -c bioconda -c defaults canu

conda环境配置参考：轻快小miniconda3在linux下的安装配置-centos9stream-Miniconda3 Linux 64-bit-CSDN博客

Canu的组装用法及具体步骤

假设你有一个名为nanopore_reads.fastq.gz的Oxford Nanopore原始数据文件，想要进行基因组组装，以下是一个基本的Canu命令行实例：

canu -p project_name \
    -d output_directory \
    genomeSize=genome_size_in_bp \
    useGrid=false \
    -nanopore-raw nanopore_reads.fastq.gz \
    -maxMemory memory_limit \
    -threads num_threads


#官方参考样例：
canu [-haplotype|-correct|-trim] \
   [-s <assembly-specifications-file>] \
   -p <assembly-prefix> \
   -d <assembly-directory> \
   genomeSize=<number>[g|m|k] \
   [other-options] \
   [-trimmed|-untrimmed|-raw|-corrected] \
   [-pacbio|-nanopore|-pacbio-hifi] *fastq

参数解释：

-p project_name: 指定输出结果前缀。
-d output_directory: 设置输出目录路径。
genomeSize: 预估目标基因组大小，单位为碱基对。
useGrid=false: 如果不在网格计算环境中运行，则设置为false。
-nanopore-raw: 输入原始长读测序数据文件路径。
-maxMemory: 设定程序最大内存使用量。
-threads: 指定使用的线程数量。

这里注意参数，如果系统中配置了超算slurm等环境，默认会启用超算，所以如果不使用超算环境则加上useGrid=false，这样会启用单节点进行计算。

这里直接使用二代测序的组装contigs作为输入开始运行，建议使用nohup后台运行。

全参数帮助信息：

canu --help

usage:   canu [-version] [-citation] \
              [-haplotype | -correct | -trim | -assemble | -trim-assemble] \
              [-s <assembly-specifications-file>] \
               -p <assembly-prefix> \
               -d <assembly-directory> \
               genomeSize=<number>[g|m|k] \
              [other-options] \
              [-haplotype{NAME} illumina.fastq.gz] \
              [-corrected] \
              [-trimmed] \
              [-pacbio |
               -nanopore |
               -pacbio-hifi] file1 file2 ...

example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz 


  To restrict canu to only a specific stage, use:
    -haplotype     - generate haplotype-specific reads
    -correct       - generate corrected reads
    -trim          - generate trimmed reads
    -assemble      - generate an assembly
    -trim-assemble - generate trimmed reads and then assemble them

  The assembly is computed in the -d <assembly-directory>, with output files named
  using the -p <assembly-prefix>.  This directory is created if needed.  It is not
  possible to run multiple assemblies in the same directory.

  The genome size should be your best guess of the haploid genome size of what is being
  assembled.  It is used primarily to estimate coverage in reads, NOT as the desired
  assembly size.  Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'

  Some common options:
    useGrid=string
      - Run under grid control (true), locally (false), or set up for grid control
        but don't submit any jobs (remote)
    rawErrorRate=fraction-error
      - The allowed difference in an overlap between two raw uncorrected reads.  For lower
        quality reads, use a higher number.  The defaults are 0.300 for PacBio reads and
        0.500 for Nanopore reads.
    correctedErrorRate=fraction-error
      - The allowed difference in an overlap between two corrected reads.  Assemblies of
        low coverage or data with biological differences will benefit from a slight increase
        in this.  Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.
    gridOptions=string
      - Pass string to the command used to submit jobs to the grid.  Can be used to set
        maximum run time limits.  Should NOT be used to set memory limits; Canu will do
        that for you.
    minReadLength=number
      - Ignore reads shorter than 'number' bases long.  Default: 1000.
    minOverlapLength=number
      - Ignore read-to-read overlaps shorter than 'number' bases long.  Default: 500.
  A full list of options can be printed with '-options'.  All options can be supplied in
  an optional sepc file with the -s option.

  For TrioCanu, haplotypes are specified with the -haplotype{NAME} option, with any
  number of haplotype-specific Illumina read files after.  The {NAME} of each haplotype
  is free text (but only letters and numbers, please).  For example:
    -haplotypeNANNY nanny/*gz
    -haplotypeBILLY billy1.fasta.gz billy2.fasta.gz

  Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.

  Reads are specified by the technology they were generated with, and any processing performed.

  [processing]
    -corrected
    -trimmed

  [technology]
    -pacbio      <files>
    -nanopore    <files>
    -pacbio-hifi <files>

基于conda环境下的宏基因组学分析利器MetaWRAP 1.3.2 安装和使用，序列分析基本流程自动分析脚本_(metawrap132) [lzh2023@master metawrap_db]$ quast--CSDN博客

基于BWA，Bowtie2，Salmon和SAMtools、checkm等工具计算宏基因组学序列分析中Contigs与Genes在样品中的丰度，多种计算方式和脚本对比（20231217更新）_bwa 显示结果-CSDN博客

基于conda环境使用mamba/conda安装配置QIIME 2 2023.9 Amplicon扩增子分析环境，q2cli主要功能模块介绍及使用_qiime 2 amplicon distribution-CSDN博客