过程处理
- 软件逻辑中唯一出现 “relative_abundance” 的位置位于 generate_from_clap
- 其位于一个 match 方法中:
&"relative_abundance" => { columns_to_normalise.push(i); estimators.push(CoverageEstimator::new_estimator_mean( min_fraction_covered, contig_end_exclusion, false, )); // TODO: Parameterise exclude_mismatches }
- 其中 “i” 为 “methods” 中的值的索引, 当前情况下为 “0”
- 与之对比, “mean” 方法的逻辑为:
&"mean" => { estimators.push(CoverageEstimator::new_estimator_mean( min_fraction_covered, contig_end_exclusion, false, )); // TODO: Parameterise exclude_mismatches }
- 可见, 唯一的差异在 “columns_to_normalise” 中, 该非空列表激活如如下操作:
} else { debug!( "Cached regular coverage taker with columns to normlise: {:?} and rpkm_column: {:?} and tpm_column: {:?}", columns_to_normalise, rpkm_column, tpm_column ); taker = CoverageTakerType::new_cached_single_float_coverage_taker(estimators.len()); printer = match output_format { "sparse" => CoveragePrinter::SparseCachedCoveragePrinter, "dense" => CoveragePrinter::DenseCachedCoveragePrinter { entry_type: None, estimator_headers: None, }, _ => unreachable!(), } }
- 其中 “output_format” 值为 “dense”
- 其中 taker 对象为:
CoverageTakerType::CachedSingleFloatCoverageTaker { stoit_names: Vec<String>, entry_names: Vec<Option<String>>, // list of per-stoit lists of recorded coverages coverages: Vec<Vec<CoverageEntry>>, current_stoit_index: Option<usize>, // None current_entry_index: Option<usize>, // None // number of different coverage calculations num_coverages: usize, // estimators.len() },
- 在这里, “stoic” 指当前使用的丰度方法, 而 “entry” 指样本名, 对应 bam 文件
- “columns_to_normalise” 保存在 “EstimatorsAndTaker.columns_to_normalise” 中
- 其位于一个 match 方法中:
- 命令 “EstimatorsAndTaker::generate_from_clap” 在 main 函数的 “genome” 分支中.
- 随后结构最简单的分支为 “bam-files”
- 其首先设置 bam 文件和 genome 定义
let bam_files: Vec<&str> = m.values_of("bam-files").unwrap().collect(); // Associate genomes and contig names, if required let genomes_and_contigs_option = parse_all_genome_definitions(&m);
- “genomes_and_contigs_option” 的内容为:
pub struct GenomesAndContigs { pub genomes: Vec<String>, pub contig_to_genome: HashMap<String, usize>, }
- “genomes_and_contigs_option” 的内容为:
- 随后进入 “run_genome” 进行计算:
run_genome( coverm::bam_generator::generate_named_bam_readers_from_bam_files(bam_files), m, &mut estimators_and_taker, separator, &genomes_and_contigs_option, &mut print_stream, );
- 在 “run_genome” 函数中,
- 对于多个基因组, 使用 “mosdepth_genome_coverage” 计算
true => coverm::genome::mosdepth_genome_coverage( bam_generators, separator.unwrap(), &mut estimators_and_taker.taker, print_zeros, &mut estimators_and_taker.estimators, &flag_filter, single_genome, threads, ),
- 随后输出时再次在 “finalise_printing” 中使用了 “columns_to_normalise”.
- 对于多个基因组, 使用 “mosdepth_genome_coverage” 计算
- 在 “print_dense_cached_coverage_taker” 函数中:
- 跳过输出标题行
- “estimator_headers” 本身已设置为 “None”
- 该内容已经在 “EstimatorsAndTaker::print_headers” 中输出.
- 计算 “coverage_multipliers”:
let coverage_multipliers: Vec<f32> = match reads_mapped_per_sample { Some(rm) => rm .iter() .map(|r| r.num_mapped_reads as f32 / r.num_reads as f32) .collect(), None => vec![], };
- “columns_to_normalise” 非空 (“relative_abundance” 方法下) 时输出未比对序列的比例:
100.0 * (1.0 - (coverage_multipliers[stoit_i]))
- 可见, 未比对序列的比例等于未比对到 genome 上的 reads 的比例
- 随后在 “columns_to_normalise” 非空时计算 “coverage_totals”:
for i in columns_to_normalise { coverage_totals[ecs.stoit_index as usize][*i] = match coverage_totals[ecs.stoit_index as usize][*i] { Some(total) => Some(total + ecs.coverages[*i]), None => Some(ecs.coverages[*i]), } }
- 可以看到, coverage_totals 按样本存储丰度之和, 其丰度是各基因组相对丰度值的简单相加, 不考虑基因组大小
- 这里的每个 “ecs” 按 “每个样本每个基因组” 的方式迭代输出其全部相对丰度值
- 同时建立 “stoit_by_entry_by_coverage”
stoit_by_entry_by_coverage[ecs.stoit_index].push(ecs);
- 这是一个二维表, 作为输出的重排, 以样本为单位存储每个基因组的序列的丰度 (列表)
- 跳过输出标题行
- 最终按 “my_entry_i” (基因组) 对每行进行输出, 最终输出为
coverages[i] // Divide first because then there is less // rounding errors, particularly when // coverage == coverage_total /coverage_totals[ecs.stoit_index as usize][i].unwrap() * 100.0 * coverage_multipliers[stoit_i]
- 这里的 “coverages[i]” 即为 L534 的 “cov”
- 对比的描述为
0.02235294/0.02235294*(2/6)
“If the contig is considered a genome, then its mean coverage is 0.02235294. There is a total of 0.02235294 mean coverage across all genomes, and 2 out of 6 reads (1 out of 3 pairs) map. This coverage calculation is only available in ‘genome’ mode.”
- 可见:
var name mean coverage mean coverage across all genomes map in code coverages[i] coverage_totals[ecs.stoit_index as usize][i] coverage_multipliers[stoit_i]
结论
- relative_abundance 计算方法:
- 计算每个基因组的测序深度
- 以全部基因组测序深度之和作为总丰度, 对每个基因组 (细胞) 计算比例
- 不考虑基因组大小: 假设每个基因组的覆盖度是均匀的, 且序列无误
- 按比对率校正群落已知类群的丰度.