PBSIM-PacBio-Simulator

This Repository

This repository was created because the Google Code repository provided by the original authors is not being maintained, and Google Code is now defunct.

About PBSIM

PacBio sequencers produce two types of characteristic reads as below.

Continuous Long Read (CLR) : long and high error rate. Circular consensus Read (CCS) : short and low error rate.

We have developed a PacBio reads simulater (called PBSIM) in which sampling-based and model-based simulations are implemented.

Building PBSIM

To build PBSIM run;

 autoreconf -i
 ./configure
 make

A new executable pbsim will be available in the src/ directory

Run PBSIM with sample data

To run model-based simulation:

pbsim --data-type CLR --depth 20 --model_qc data/model_qc_clr sample/sample.fasta

In the example above, simulated read sequences are randomly sampled from a reference sequence ("sample/sample.fasta") and differences (errors) of the sampled reads are introduced. Data type is CLR, and coverage depth is 20. If the reference sequence is multi-FASTA file, the simulated data is created for each FASTA. Three output files are created for each FASTA. "sd_0001.ref" is a single-FASTA file which is copied from the reference sequence. "sd_0001.fastq" is a simulated read dataset in the FASTQ format. "sd_0001.maf" is a list of alignments between reference sequence and simulated reads in the MAF format. The length and accuracy of reads are simulated based on our model of PacBio read.

To run sampling-based simulation:

pbsim --data-type CLR --depth 20 --sample-fastq sample/sample.fastq sample/sample.fasta

In the sampling-based simulation, read length and quality score are the same as those of a read taken randomly in the sample PacBio dataset ("sample/sample.fastq").

If you want to create several simulated data with different coverage depths using the same PacBio sample, you would be better to use --sample-profile-id option as below. You can save time to parse "sample/sample.fastq".

(1) storing profile

pbsim --data-type CLR 
      --depth 20
      --prefix depth20
      --sample-fastq sample/sample.fastq
      --sample-profile-id pf1
      sample/sample.fasta

(2) reusing profile

pbsim --data-type CLR 
      --depth 30
      --prefix depth30
      --sample-profile-id pf1
      sample/sample.fasta

pbsim --data-type CLR 
      --depth 40
      --prefix depth40
      --sample-profile-id pf1
      sample/sample.fasta

Model-based simulation

For each read, the length is randomly drawn from the log-normal distribution with given mean and standard deviation.

How to simulate the accuracy of each read is different between CLR and CCS read. For CLR reads, the accuracy is randomly drawn from the normal distribution with given mean and standard deviation. For CCS reads, an exponential function which is fit to the the real distribution is utilized to simulate with fixed mean and standard deviation.

Errors from single molecule sequencing which generates PacBio reads are considered to be stochastical, therefore quality scores are randomly chosen from a frequency table of quality scores (named "quality profile") for each accuracy of a read. For accuracies of 0-59% and 86-100% of CLR readsi and 0-84% of CCS reads, uniform distributions are used because real PacBio datasets are not sufficiently large. "data/model_qc_clr" is quality profile for CLR and "data/model_qc_ccs" is for CCS.

Simulated read sequences are randomly sampled from a reference sequence. The percentage of both directions of reads is same. Differences (errors) of the sampled reads are introduced as follows. The substitutions and insertions are introduced according to the quality scores. Their probabilities are computed for each positions of a simulated read from the error probability of the position (computed from the quality score of the position) and the ratios of differences given by the user. Patterns of substitutions are randomly sampled.
We observed that inserted nucleotides are often the same as their following nucleotides. According to the observed bias, half of inserted nucleotides are chosen to be the same as their following nucleotides, and the other half are randomly chosen. The deletion probability is uniform for all positions of all simulated reads, which is computed from the mean error probability of the read set and the ratios of differeces.

By setting minimum and maximum of the length, the range of length chosen from the distribution model can be restricted. Note that mean and standard deviation of the chosen length are influenced by this restriction. The accuracy can be restricted in the same way, however unlike the length, the restriction of accuracy is not strict, and can be used in only case of CLR reads.

Sampling-based simulation

The lengths and quality scores of reads are simulated by randomly sampling them in a real library of PacBio reads provided by the user. randomly in a real PacBio dataset given by user. Subsequently, their nucleotide sequences are simulated by the same method with the model- based simulation. The restriction of length and accuracy are also the same as model-based simulation.

Input files

PBSIM requires reference sequences in the single- or multi-FASTA Format.

A real PacBio read data is required for sampling-based simulation, specified with the --sample-fastq option. FASTQ format must be Sanger standard (fastq-sanger).

Output files

If a reference sequence file is multi-FASTA format, simulated datasets are generated for each reference sequence numbered sequentially. Three output files are created for each reference sequence.

"sd_.ref" is a single-FASTA file which is copied from the reference sequence. "sd_.fastq" is a simulated read dataset in the FASTQ format. "sd_.maf" is a list of alignments between reference sequence and simulated reads in the MAF format.

"sd" is prefix which can be specified with the --prefix option.

Quality profile

Quality profiles are derived from frequencies of real quality scores for each accuracy of a read. "data/model_qc_clr" is quality profile for CLR, "data/model_qc_ccs" is for CCS. In "data/model_qc_clr", 1st column is accuracies of a read, and 2nd-23th columns are proportions of phred quality scores (0-21). In "data/model_qc_ccs", 1st column is accuracies of a read, and 2nd-95th columns are proportions of phred quality scores (0-93).

Runtime and memory

When a coverage depth is 100x and a length of reference sequence is about 10M, PBSIM generates simulated dataset in several minutes. The runtime is roughly proportional to the coverage depth and the length of reference sequence.

PBSIM requires memory of the length of reference sequence plus several mega bytes.

Contributors

@kiwiroy - autotools and warning corrections

@jumpinsky - fixed memory leaks

DNA测序(DNA-Sequencing),也称为基因测序,是一个复杂的过程,用于确定生物体DNA分子中特定区域的核苷酸序列。以下是DNA测序的一般流程: 1. **样本获取**:从组织、血液、唾液等生物材料中提取DNA。这通常涉及细胞裂解、纯化和浓度测定。 2. **文库构建**:将DNA片段打断并添加接头,形成大小适中的DNA文库。常用的有 shotgun 测序(随机断裂)和 PCR 连接策略(选择性放大感兴趣区域)。 3. **扩增**:如果需要增加样本量,可能通过PCR技术扩增目标DNA片段。这一步骤可以保证每个复制都包含相同的序列信息。 4. **配对末端测序**:通过高通量测序平台(如Illumina、PacBio或ONT)对文库进行测序。这包括分条、接头去除、测序反应和数据生成。 5. **质量控制**:对原始测序数据进行质量评估,剔除低质量读段,确保后续分析的准确性。 6. **组装和拼接**:利用软件(如BWA、SPAdes或Canu)将短读段合并成连续的长链,得到初步的参考基因组或转录本。 7. **比对和注释**:将组装后的序列与已知参考数据库(如GenBank)进行比对,确定其功能区域(编码区、剪切位点等),并注释出基因、外显子、内含子等信息。 8. **数据分析**:统计比对结果,计算变异频率、结构变异、表达差异等,并进行生物学解释。 9. **结果解读**:基于测序数据的结果,研究人员进行生物学结论的推断,例如基因型检测、进化研究、疾病关联性分析等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangchuang2017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值