模拟一个简单计算器
Read simulators are widely being used within the research community to create synthetic and mock datasets for analysis. In this article, I will introduce some recently proposed, commonly used read simulators.
阅读模拟器在研究社区中被广泛使用,以创建用于分析的综合和模拟数据集。 在本文中,我将介绍一些最近提出的,常用的读取模拟器。
DNA测序和读取 (DNA Sequencing and Reads)
If you have come across my previous article on DNA Sequence Data Analysis, you may have read about DNA sequencing. Sequencing is the process that determines the precise order of nucleotides of a given DNA molecule. We can determine the order of the four bases adenine, guanine, cytosine and thymine, in a strand of DNA. DNA sequencing is used to determine the sequence of individual genes, full chromosomes or entire genomes of an organism.
如果您看过我以前有关DNA序列数据分析的文章,那么您可能已经阅读了有关DNA测序的信息。 测序是确定给定DNA分子核苷酸精确顺序的过程。 我们可以确定四个碱基的腺嘌呤 , 鸟嘌呤 , 胞嘧啶和胸腺嘧啶的顺序, 在DNA链中。 DNA测序用于确定生物的单个基因,完整染色体或完整基因组的序列。
Special machines known as sequencing machines are used to extract short random DNA sequences from a particular genome we wish to determine (target genome). Current DNA sequencing technologies cannot read one whole genome at once. It reads small pieces of between 100 and 30,000 bases, depending on the technology used. These short pieces are called reads.
使用称为测序机的特殊机器从我们希望确定的特定基因组( 目标基因组 )中提取随机的短DNA序列。 当前的DNA测序技术无法一次读取一个完整的基因组。 根据所使用的技术,它可以读取100到30,000个碱基之间的小片段。 这些短片段称为读取 。
读模拟器 (Read Simulators)
Sequencing machines may not be available as we wish and we may not be able to get hold of real-world samples to sequence. This is where read simulators come in handy for research purposes. Read simulators can mimic sequencing machines to simulate reads. They have pre-defined statistical models to mimic the error rates relevant to the particular sequencing machines. Furthermore, we can provide our own error models as well (different rates of insertions, deletions and substitutions).
测序机器可能无法如我们所愿,并且我们可能无法掌握现实世界中的样品进行测序。 在这里,阅读模拟器可用于研究目的。 阅读模拟器可以模仿测序仪来模拟阅读。 他们具有预定义的统计模型,可以模拟与特定测序仪相关的错误率。 此外,我们还可以提供自己的错误模型(插入,删除和替换的比率不同)。
估计测序覆盖率 (Estimating sequencing coverage)
Sequencing coverage is defined as the average number of reads that covers each base of the reference genome. Estimating the sequencing coverage is very important when you are simulating datasets. The coverage equation is defined as follows.
测序覆盖率定义为覆盖参考基因组每个碱基的平均读取数。 模拟数据集时, 估计测序覆盖率非常重要。 覆盖方程定义如下。
C = LN / G
C = LN / G
- C is the sequencing coverage
- G is the length of the genome
- L is the read length