signature=31b267d5d277423a23aa00b75299bca4,Signatures of positive selection reveal a universal role ...-CSDN博客

Data

Pancan12 somatic mutation data. Filtered MAF files from 12 tumor types were obtained from synapse (syn1729383). Allele counts, ploidy status, and histological purity estimates were merged into a single MAF file containing information for 3,276 samples and 617,354 mutations described elsewhereμ = 15 and scale factor of 2. Individual MAF files for each cancer were produced in order to perform downstream analysis. Expression values for this dataset were also obtained from synapse (syn1729383).

Pancan21 somatic mutation data. The MAF files for 6,485 exome samples from 20 tumor types were downloaded from DCC-Firehose and combined with 385 CLL cases to obtain a large dataset of 21 tumor types (Supplementary Table 4). Allele counts were transformed to VAF and CADD scores were added for each mutation. We removed duplicated samples and updated the gene symbols using the Hugo Gene Symbol database. Colon and rectal tumors were merged into one tumor type giving us a final set of 20 tumor types. All curated MAF files used in this study were uploaded to synapse (syn5593040).

CLL somatic mutation data. 385 CLL tumor-normal pairs sequenced by WES were analyzed using an in-house pipeline. Reads were aligned to hg19 using BWA-memhttps://www.broadinstitute.org/gatk/). Mutecthttps://www.broadinstitute.org/cancer/cga/indelocator), and ClinDel (unpublished) were used to produce a set of somatic SNVs and Indels. 27,625 mutations were annotated using eDiVA (www.ediva.crg.eu) to obtain several features including effect on gene and protein sequence, allele counts, CADD functional impact score, and population allele frequencies. Somatic SNVs that have a high number of occurrences across all paired normal samples, i.e. are likely germline variants (ND occurrence > 10), or have a high rate of exclusion by MuTect across all samples (more than five times excluded) were excluded from the analysis. Indels falling within 30 bp of a repeat masked region were removed. Additionally, to reduce common false positive in detecting indels, we excluded indels that were reported in exons not expressed in B-lymphocytes. We used tophat and cufflinks to calculate FPKMs for 270 CLL RNA-seq samples. We considered exons not to be expressed if they had an average or median fragment per kilobase per million (FPKM) 1%) or with alternative allele fraction (VAF)

ClinDel (part of the shore package) and ClinCNV are available on github:

CCF distribution analysis

We estimated cancer cell fraction of somatic mutations for Pancan12, Pancan20, and CLL-ICGC mutation data using the function CCF of the cDriver package. For each tumor type we obtained the set of significant driver genes from IntoGen (www.intogen.org), having at least one significant prediction in the pooled and the individual cancer analysis. These drivers have been predicted based on recurrence (mutSigCV), functional damage bias (oncodriveFM), or clustering (oncodriveClust), but none of the predictors considered CCF. Somatic mutations were divided into 4 sets, including nonsilent mutations in driver genes, nonsilent mutations in passenger genes, silent mutations in driver genes, and silent mutations in passenger genes. We compared the CCF distribution for each group using Wilcoxon-Mann-Whitney statistical test for each cancer type separately as well as for curated (Pancan12) and non-curated (Pancan16) pan-cancer sets. In this analysis we only included tumor types for which at least 10 significant drivers were found (16 tumor types in total).

cDriver package

We have developed cDriver, an R package to identify mutational driver genes using NGS data from cancer genome studies (Supplementary Fig. 1). cDriver uses a MAF file (v2.4) as input data with additional mandatory columns (i) variant allele frequency (VAF), and optional columns (ii) functional impact score, (iii) ploidy, (iv) histological purity, and (v) cancer cell fraction of the CNV. These measures can be obtained from current cancer genome or exome sequencing studies and public genome an- notation databases. cDriver can use any mutation annotated as indel, missense, nonsense, splice, TSS, nonstop, and silent variant following the MAF column variant type.

One of the conceptual advances of our method is the inclusion of cancer cell fraction (CCF). Although any current method for CCF calculation can be used with cDriver, we provide a simple function to estimate CCF of SNVs and indels based on VAF, ploidy, CCF of overlapping CNVs, and tumor purity (Supplementary note). The CCF estimation by cDriver highly correlates with the cellular prevalence calculated by PyClone

To account for the variability of the background mutation rate (bmr) between genes, cDriver uses silent mutations to locally estimate the expected number of nonsilent mutations. To this end, we applied a classical formula for detecting selection (Ka/Ks or dNdS ratio) to infer the expected number of nonsilent mutations assuming neutral evolution. However, somatic mutations are rare for most cancers and many genes do not harbor silent mutations, restricting the usefulness of Ka/Ks. Therefore, cDriver calculates an average bmr using the CCF-adjusted Ka/Ks formula and a pre-calculated bmr taken from the literature

Next, cDriver calculates posterior probabilities per gene using two Bayesian models, (i) the “cancer hazard model” and (ii) the “driver model”. The first model requires the incidence of the tumor in the population as prior probability, while the second requires a list of known driver genes (e.g. any gold standard used in this study) to estimate prior and likelihood values. The “cancer-hazard model” estimates the posterior probability of developing cancer if the focal gene is mutated, given evidence from the data, i.e. somatic mutations of the gene in a cohort of cancer patients. The “driver inference model” estimates the posterior probability that a gene is a true cancer driver given evidence from the data.

As a final step, cDriver can provide an optimum rank-cut off value by estimating FDR at each rank based on a null model. cDriver default estimates the optimum rank for each Bayesian model and suggests the best rank cut-off as the average value between both ranks.

In summary, cDriver combines recurrence, CCF, and functional impact as a fore- ground measure, and an averaged background mutation probability to calculate posterior probabilities for each gene. In the next sections, each step is described in detail.

Step 1) Estimation of cancer cell fraction per gene

CCF calculation

We developed a function as part of the R package for the estimation of Cancer Cell Fraction (CCF) per gene (for details see Supplementary note), although any CCF calculation method can be specified in cDriver (e.g. PyClone). Intuitively, we assumed that the variant allele should be observed in approximately half of the reads if it is a clonal heterozygous variant in a diploid locus. In this case, CCF is calculated as the variant allele frequency (VAF) multiplied by two and corrected for the purity of the cancer sample. Other scenarios, including non-diploid loci, are described in the supplementary note.

Step 2) Background mutation rate models

CCF-adjusted counts of nonsilent and silent mutations

First, we adjusted the classic formula for detecting selection (Ka/Ks)$\widehat{{n}_{a}}$) assuming neutral evolution (Ka/Ks = 1). This formula is the ratio between the rate of nonsynonymous substitutions (n

a) per nonsynonymous sites (N

a) and the rate of synonymous substitutions (n

s) per synonymous sites (N

s):

$$\frac{{n}_{a}/{N}_{a}}{{n}_{s}/{N}_{s}}=\frac{{K}_{a}}{{K}_{s}}\,\Rightarrow \,\widehat{{n}_{a}}=\frac{{n}_{s}\,\ast \,{N}_{a}}{{N}_{s}}\,$$

(1)

Next, we adapted the formula to take into account cancer cell fraction of mutations by calculating n

s and n

a as the sum of CCF of silent and nonsilent mutations, respectively, resulting in ${n}_{s}^{CCF}$ and ${n}_{a}^{CCF}$. The CCF-adjusted K

a/K

s formula is:

$$\frac{{n}_{a}^{CCF}/{N}_{a}}{{n}_{s}^{CCF}/{N}_{s}}=\frac{{K}_{a}^{CCF}}{{K}_{s}^{CCF}}\,\Rightarrow \,\widehat{{n}_{a}^{CCF}}=\frac{{n}_{s}^{CCF}\ast {N}_{a}}{{N}_{s}}$$

(2)

Zero counts of silent mutations

We then assume that one nonsilent mutation in the cohort is not enough evidence for a gene to be defined as a driver. Nevertheless, in scenarios where there are zero silent mutations (or ${n}_{s}^{CCF}\approx 0$) we would expect zero nonsilent mutations s (or $\widehat{{n}_{a}^{CCF}}\approx 0$); thus, one observed nonsilent mutation would appear as positively selected. To avoid that single non-silent mutations generate a positive selection signal we correct ${n}_{s}^{CCF}$. The corrected ${n}_{s}^{CCF}\text{'}$ is the maximum value between (i) the sum of CCF for observed silent mutations, ${n}_{s}^{CCF}$, and (ii) the value of ${n}_{s}^{CCF}\,$in equation 2 when ${n}_{a}^{CCF}=1$.

Expected number of nonsilent mutations per gene

We estimated the expected $\widehat{{n}_{a}^{CCF}}$ for each gene in a specific cancer cohort (e.g. a WES data from BRCA, CLL, Pancan12, Pancan21) based on ${n}_{s}^{CCF}\text{'}$. The total number of sites (N

a and N

s) was taken from Lawrence et al.2 becomes:

$$\widehat{{n}_{a}^{CCF}}=\frac{{n}_{s}^{CCF}\text{'}\,\ast \,{N}_{a}}{{N}_{s}}$$

(3)

Background mutation probability based on the expected number of nonsilent mutations

After obtaining the expected $\widehat{{n}_{a}^{CCF}}$, we calculated the probability that a patient has a at least one nonsilent mutation in a gene, P(X ≥ 1). To this end, we approximated the average number of somatic nonsilent mutations in a healthy cohort (noted as r) using a cancer cohort. We assume that the majority of clonal mutations were passengers acquired before tumor initiation. Following this assumption, we set the r as the average of patients counts of clonal mutations (CCF

SNV ≥ 0.85). With one nonsilent mutation event the probability of success (P(X = 1)) for a gene is $\widehat{{n}_{a}^{CCF}}$ divided by the sum of $\widehat{{n}_{a}^{CCF}}$ across all genes. Next, we estimated the probability P(X ≥ 1) with r nonsilent mutations events using a binomial distribution:

$${\rm{{\rm P}}}(X\ge 1)=1-{\rm{{\rm P}}}(X=0)=1-{(1-\frac{\widehat{{n}_{a}^{CCF}}}{{\sum }_{i\in genes}{\widehat{{n}_{a}^{CCF}}}_{i}})}^{r}$$

(4)

where P(X = 0) is a probability for a gene to have zero nonsilent mutations, and derived as ${\rm{{\rm P}}}(X=0)=$

$(\begin{array}{c}r\\ 0\end{array}){p}^{0}{(1-p)}^{r-0}={(1-p)}^{r}$ where p is the probability of success and equals to $\widehat{{n}_{a}^{CCF}}/\sum _{i\in genes}\,{\widehat{{n}_{a}^{CCF}}}_{i}$. As above described, r is the number of nonsilent mutations prior to cancer.

Background mutation probability based on other mutation rate estimates

To compensate for the lack of power of the CCF-adjusted K

a/K

s model for some genes, we additionally incorporated the non-coding mutation rate (ncmr) provided by Lawrence et al.ncmr and the gene length were used to calculate the number of total expected mutations (silent and nonsilent) $\widehat{{n}_{t}}$. Under the assumption of neutral evolution (K

a/K

s = 1), we determined the expected number of nonsilent mutations following a rearrangement of the classical formula into:

$$\widehat{{n}_{a}}=\frac{\widehat{{n}_{t}}}{1+{N}_{s}/{N}_{a}}$$

(5)

Similarly to the previous model, we calculated the probability that a gene has at least one nonsilent mutation using the binomial distribution formula, but without CCF:

$${\rm{{\rm P}}}(X\ge 1)=1-{\rm{{\rm P}}}(X=0)=1-{(1-\frac{\widehat{{n}_{a}}}{{\sum }_{i\in genes}{\widehat{{n}_{a}}}_{i}})}^{r}$$

(6)

Given the continuous progress on technologies, it is likely that more specific somatic mutation rates will be available in the future, so in addition cDriver could integrate any measure of background mutation rate. Here, as final background mutation probability (bmp) we took the average value between the two bmps obtained with the methods described above (CCF-adjusted K

a/K

s and ncmr).

Step 3) Bayesian inference models

Cancer-hazard inference model

In the first model, we adapted the Bayes formula to calculate the posterior probability of developing cancer given that a gene is mutated as

$${\rm{{\rm P}}}(cancer|ns\,)=\frac{{\rm{{\rm P}}}(ns|cancer)\,\ast \,{\rm{{\rm P}}}(cancer)}{{\rm{{\rm P}}}(ns|cancer)\,\ast \,{\rm{{\rm P}}}(cancer)+{\rm{{\rm P}}}(ns|\neg cancer)\,\ast \,{\rm{{\rm P}}}(\neg cancer)}$$

(7)

where the event cancer represents that a patient has cancer and the event ns indicates that a patient has at least one nonsilent mutation in the gene of interest. The prior probability for developing cancer, P(cancer), is the incidence of the cancer type in the population. Then the likelihood P(ns|cancer) is calculated using the formula:

$${\rm{{\rm P}}}(ns|cancer)=\frac{{\sum }_{i=1}^{n}CC{F}_{i}\,\ast \,CAD{D}_{i}}{n}$$

(8)

where i is the index of the patient and n is the total size of the cohort. If a patient did not have a nonsilent mutation, then CCF was equal to zero. If two nonsilent mutations were found in the same patient in the same gene, we used the mutation with the highest CCF. The probability that a healthy individual has at least one nonsilent mutation, ${\rm{{\rm P}}}(ns|\neg cancer)$, is approximated using the somatic background mutation probability (bmp).

Driver inference model

In the second model we calculated the posterior probability that a gene is a cancer driver given the mutation data in the studied cohort using the formula:

$${\rm{{\rm P}}}(d|ns)=\frac{{\rm{{\rm P}}}{(ns|d)}^{m}\,\ast \,{\rm{{\rm P}}}{(\neg ns|d)}^{n-m}\,\ast \,{\rm{{\rm P}}}(d)}{{\rm{{\rm P}}}{(ns|d)}^{m}\,\ast \,{\rm{{\rm P}}}{(\neg ns|d)}^{n-m}\,\ast \,{\rm{{\rm P}}}(d)+{\rm{{\rm P}}}{(ns|p)}^{m}\,\ast \,{\rm{{\rm P}}}{(\neg ns|p)}^{n-m}\,\ast \,{\rm{{\rm P}}}(p)}$$

(9)

where the event d indicates that a gene is a driver, p indicates that the gene is a passenger, ns that the gene has at least one nonsilent mutation, n is the size of cohort, and m is calculated across the total size of the cohort as $\sum _{i=1}^{n}CC{F}_{i}\,\ast \,CAD{D}_{i}$.

To estimate the prior probability of a driver gene we must consider that often tumors are caused by mutations in different genes. Depending on which type or group of cancers (‘pan-cancer’) is being analyzed, the number of known driver genes differs and hence the prior probabilities change. We estimated the prior probability that a random gene is a driver as equal to the ratio between the number of known driver genes of the cancer type and the total number of protein coding genes:

$${\rm{p}}(d)=\frac{\#driver\,genes}{\#genes}$$

(10)

The number of driver genes can be approximated as the number of published driver genes for a given tumor type. If the cancer has not been studied yet, or if we deal with pan-cancer sets of multiple tumor types, the prior can be approximated using any gold standard set of cancer driver genes.

Because of inter-tumor heterogeneity genes that are cancer drivers in a given tumor type are not necessarily mutated in all patients. The probability that a gene is mutated given that it is a known driver is estimated as:

$${\rm{{\rm P}}}(ns|d)=\frac{\#mutations\,in\,drivers}{\#patients\,\ast \,\#drivers}$$

(11)

where we assume that all drivers have the same chance to be mutated. As this assumption is weak the cDriver package allows the user to define other estimates for this likelihood.

The probability of having a nonsilent mutation given that the gene is a known passenger (P(ns|p)), we approximate using bmp as described previously. The other terms in the equation 9, ${\rm{{\rm P}}}(\neg ns|d)$ and ${\rm{{\rm P}}}(\neg ns|d)$ were calculated as the complementary events of P(ns|d) and P(ns|p).

Step 4) Optimum rank cut off selection

Significant rank selection based on weighted sampling of a null model

To select an optimum rank cut-off for Bayesian models, we calculated a null model based on random sampling with repetition of the gene labels for each mutation. We sample a new gene name based on the background mutation probability (bmp) vector. Then, we run both previously described models to obtain the posterior probabilities under the null model. We repeated this process 100 times to assess if posterior probabilities are stable between each run. Next, cDriver by default takes the median value of posterior probabilities for each rank. Finally, we calculate the optimal rank assuming a false discovery rate of 10% (see Supplementary Fig. 11).

Running competing methods for cancer driver gene identification

(i) MuSiC on Pancan12 and BRCA: gene coverage files and results from the MuSiC suite for the Pancan12 dataset (including all BRCA cases used here) were obtained from synapse (syn819550, syn1713813, syn1734155). We only considered genes that were less than 0.05 FDR in at least two out of three measures. The rank order was based on the P-values given by the CT test, followed by the LRT test, followed by the number of cases affected. MuSiC on CLL dataset: gene coverage files for 385 samples were generated from tumor-normal BAM files using the function calc-bmr available in the MuSiC suite. The region of interest files (ROI) were downloaded from synapse and merged to avoid duplicates. For comparative analysis we used the sorted list returned by the tool. (ii) MutSigCV on all datasets: we ran MutSigCV using default parameters on all datasets, assuming full coverage and using the example covariate space provided in the source code. (iii) OncodriveClust and (iv) oncodriveFM: The analysis was performed by the group of Nuria Lopez-Bigas. (v) cDriver on all datasets: We ran cDriver using default parameters. Prior values used for the cancer-hazard model are shown in Supplementary Table 1.

Benchmarking

To compare the performance of competing methods we tested on datasets that differed in the number of samples, the number of mutations, the tissue-of-origin, and the purity of the tumor using different gold standard. We downloaded published lists of significantly mutated genes (SMGs) from: (1) Kandoth et al.,et al., 2014et al., 2013et al., 2014et al., 2013

The gold standard genes for breast cancer consisted of the union of dataset (7), and breast cancer genes found in (2) and (4) plus the top 20 genes identified in COSMIC. For CLL we merged dataset (6) and CLL genes found in (2) and (4) plus the top 20 genes identified in COSMIC. Subsequently, we manually curated this list by checking the number of PubMed records found by querying the HUGO gene name and the corresponding MeSH term for breast cancer and CLL. We excluded histone genes that have no relevant publication associating them to recurrent somatic mutations. Results for Pancan12 were benchmarked using datasets 1–5. Single tumor types (i.e. CLL and breast cancer) were benchmarked against the tumor specific gold standards assembled as described above. In addition, we compared the performance of the methods on Pancan12 with and without filtration of non-expressed genes.

Furthermore, we benchmarked our own method cDriver under several scenarios and parameter settings: the F-score curve for (i) cDriver using a simple recurrence model where no CCF or functional impact was used and the background model did not include CCF-adjusted Ka/Ks, (ii) cDriver using only functional impact, (iii) cDriver using CCF adjusted mutation counts and the CCF-adjusted Ka/Ks background model, and (iv) cDriver using all signatures of positive selection. Lastly, we benchmarked an ensemble of complementary methods (MuSic, MutSigCV, OncodriveFM, OncodriveClust) including and not including cDriver. For this, we calculated a combined rank and calculate the F-score. We used “Borda count” method with truncated ranks (up to two times the gold standard size) but using the ranks of only the three best methods.

For visualization purposes we show only genes that were ranked up to twice the number of gold standard genes given no further improvement was achieved by any tool beyond these thresholds.

Defining the landscape of tumor type–driver gene connections in Pancan21

We ran cDriver on each of the 21 tumor types separately (syn5593040, Supplementary Table 4) and in the pooled Pancan21 dataset. We selected genes that were significant (FDR

Identification of novel TTDG connections by PubMed mining

For each high confidence gene, we queried the HUGO symbol together with the MeSH term “neoplasm” (i.e. “ATM[TIAB] AND neoplasm[MH]”) against the PubMed database in order to test if the gene has been associated to any cancer type. The HUGO symbol had to be found in the title and/or abstract. Next, for each TTDG connection we queried the HUGO symbol together with the MeSH term of the associated tumor type. Based on the PubMed mining results, we used the following criteria to detect novel TTDG connections: (i) the tested gene was among the top 200 of the Pancan21 analysis, (ii) the gene had at least 5 neoplasms-related PubMed records, (iii) the gene was among the top 100 of the corresponding tumor type (defined by its TTDG connection), (iv) the gene had zero or one TTDG specific PubMed record, and, (v) if one TTDG specific PubMed entry was retrieved, we required that the publication did not report recurrent somatic mutations for that gene in the tumor type of interest.

Protein interaction and functional enrichment analysis

We used STRING v10.0

Individual gene analysis

To visualize the somatic mutations on the gene structure we used MutationMapperhttps://tcga-data.nci.nih.gov/tcga/). Kaplan-Meier curves and log-rank p-values for the selected genes were calculated using the R package Surv, patient-gene pairs were classified as affected if they presented at least one of these types of mutations ‘Frame_Shift_Del’, ‘Frame_Shift_Ins’, ‘In_Frame_Del’, ‘In_Frame_Ins’, ‘Missense_Mutation’, ‘Nonsense_Mutation’, ‘Splice_Site’, ‘Translation_Start_Site’, ‘Nonstop_Mutation’. Multiple testing correction was performed using Benjamini and Hochberg for the selected connections of chromatin modifiers.

Code availability

The latest version of the cDriver R package, with documentation and example data sets, is freely available at github.com/hanasusak/cDriver.