Patient material
Primary tumor, matched metastases and corresponding normal tissues of 12 patients with colorectal cancer were obtained at the Medical Faculty Mannheim, University of Heidelberg, Germany (Tables 1 and 2). The tissue banking and sample study was approved by the Ethical Committee of the University Hospital Mannheim, Medical Faculty Mannheim of Heidelberg University, all relevant ethical regulations were complied with, and informed consent was obtained from all patients or their spouses/relatives when the former were deceased. Bio banking and handling of the tissues followed the BRISQ guidelines
Genomic DNA isolation
Genomic DNA was isolated from 5 to 10, 20 μM cryosection slices (depending on tissue size) using the QIAamp DNA mini kit (Qiagen, Hilden, Germany) according to the manufacturer’s manual. The extracted DNA was submitted to the HIPO Sample Processing Laboratory (HIPO-SPL) for quality check and pseudo-anonymization of the samples, then transferred to the Genomics and Proteomics Core Facility of the German Cancer Research Center for sequencing.
Whole-genome sequencing and alignment
Whole-genome DNA sequencing was performed on the HiSeq2000 platform. Library preparation and whole-genome sequencing of matched tumor/normal/metastasis DNA was carried outhttps://dockstore.org/containers/quay.io/pancancer/pcawg-bwa-mem-workflow]. Read pairs were mapped to the 1000 Genomes Project phase 2 assembly of the human reference genome (hs37d5) using Burrows-Wheeler Aligner software
Small variant calling
Small variants were called from the whole aligned whole-genome sequencing data. They were initially called using our in-house workflows, described below, followed by cross checking of variant positions between tumor and metastasis pairs. SNVs were initially called using the DKFZ SNV and indel calling workflow from ICGC Pan-Cancer Analysis of Whole Genome projects [https://dockstore.org/containers/quay.io/pancancer/pcawg-dkfz-workflow]
Due to potential tumor in normal contamination leading to false negative calls we applied the TiNDA (tumor in normal detection algorithm) workflow (unpublished). Briefly, using the unique set of combined mutated positions for a tumor metastasis pair the B-allele frequency (BAF) was calculated from the tumor, metastasis and control samples. Positions overlapping with common variants were filtered out. Then, the clustering algorithm from Canopy
Indels were initially called using Platypus
Due to varying tumor cell content, we cross checked allele frequencies of mutations between tumors and metastasis to validate those small mutations were not missed due to lower tumor cell content in either the tumor or metastasis samples. A SNV was called when (i) it was called somatic using our in-house workflow, (ii) it was called somatic in the matched tumor/metastasis pair and its MAF was above 5% (corresponding to a minimum of 2 reads) and at least twice that of the matched germline control. This threshold of 2 reads was selected based on our series (i) where some of the samples are triploid (median series ploidy) (ii) with ×36 coverage (median series coverage), (iii) and with a tumor purity of 47.5% (median series purity), where the expected read support for a single copy variant would be 5.7 reads (0.475 × 36/3). Using a Poisson distribution model, variants with 2 read support fall within the majority of the distributions (pPoisson (X = 2) = 0.54, where μ = 5.7). SNVs that were shared between tumor and metastasis samples tended to have similar variant allele fractions (VAF) (Supplementary Figure 1). In some samples (CRC-004, CRC-006, and CRC-007) we observed slightly lower VAF in the tumor- and metastasis-specific mutations compared to the shared mutations, indicating a dominant truncal clone with low level heterogeneity.
We classify mutations of interest as somatic SNV and indels those causing protein coding changes (non-silent), and also exonic mutations on non-coding genes. Annotation of non-silent mutations in protein coding genes include non-synonymous SNVs, gain or loss of stop codons, splice site mutations, and both frameshift and non-frameshift indels in protein coding genes for mutations of interest on non-coding genes we used all exonic and splicing mutations.
A total of 2403 mutations of interest were detected, of which 1589 were in protein-coding genes and 814 in non-coding genes (Supplementary Data 10). The average number of mutations of interest per sample was 200 (range 94–351), of which an average of 132 (range 73–222) were in protein-coding genes, and 78 were in non-coding genes (range 21–129). These alterations hit 1428 protein-coding and 764 non-coding genes, of which 145 and 61 were hit in 2 or more samples, respectively (Supplementary Data 10). Relative to SNVs, much fewer indels were called, with the average per sample was 15 (range 8–23), of which an average of 7 (range 2–18) were in protein-coding genes, and 8 were in non-coding genes (range 4–14). These alterations hit 74 protein-coding and 94 non-coding genes, of which 3 and 1 were hit in two or more samples, respectively (Supplementary Data 11).
Among the most recurrently mutated genes, APC was mutated in all high-purity samples and TP53 in 15 samples. Further recurrently mutated genes included KRAS, NRAS, SOX9, TCF7L2, and FBXW7. We observed mutations in TTN and LRP1B which have been described as passenger mutations, although LRP1B is a paralogue of LRP1, which is known to be involved in Wnt receptor signaling. SOX9 was exclusively hit by frameshift insertions and deletions and always co-occurred with mutations in HK3, but this was not observed in the larger TCGA and Giannakis cohorts.
Correlating these recurrently mutated protein coding genes, ncRNAs and 3′-UTRs with clinical factors we found that KRAS was mutated exclusively in right sided colon and caecum tumors (compared to left sided sigmoid) and TP53 was mutated only in left sided sigmoid (compared to right sided colon and caecum) (Supplementary Data 12) consistent with observations by Yaeger et al. Additionally, we found mutations in RP11-983P16.2, POKR1, SLC26A10 affect females more than males (p 0.0455, χ2-test), 3′-UTRs mutations of CBLB, IFI44L, MMP16, RNF217 affect females more than males (p 0.0455, χ2-test), and 3′-UTR mutations of XKR4 were found 3 of 3 patients who did not undergo neoadjuvant therapy (compared to 0 of 3 who did).
The mean inter-mutation distance across the genome was between 10,000 and 1,000,000 bp and we did not observe recurrent regions of kataegis in our patients. Some of these individual kataegis loci were in close proximity to genes PEAK1, ADAP2, SUFU, and SGK3, with SUFU being metastasis-specific and SGK3 tumor-specific (Supplementary Figure 10, 11). However, several recurrent regions of increased mutation density were seen in both tumor and metastases, most prominently on chromosomes 5 and 13 which may be due to a gain of partially methylated domains10, Supplementary Figure 11).
Sample classification
The samples in our series were all deemed to be micro-satellite stable; they did not harbor mutations on DNA mismatch repair genes MLH1, MLH3, MSH2, MSH3, MSH6, PMS2 suggesting that they were not micro-satellite instable/hypermutators, nor did they harbor mutations on POLE suggesting that they were neither ultra-mutators.
The sample exhibiting the most mutations in our series (CRC-008, primary tumor) has 17,189 somatic SNVs, equivalent to 6.1 mutations per 106 bases (assuming 2.8 Gb of mappable human genome), which is about half of this hypermutator boundary. By extension, our samples cannot be classified as ultra-mutators.
Mutual exclusivity analysis
Mutual exclusivity analysis was initially performed on all genes that have established roles in colorectal cancer. Gene pairs were deemed to be mutually exclusive if no more than 1 sample harbored somatic SNVs for them. Using cBioPortal, we determined the significance of mutual exclusivity and co-occurrence of recurrently mutated genes in our, the TCGA, Giannakis et al., and Yaeger et al. studies, and the ARHGEF gene family (Supplementary Data 7). We found support to our observation of mutual exclusivity of ARHGEF7-KRAS (TCGA, p-value 0.021, Fisher test) and SOX9-TP53 (Yaeger et al., p-value <0.001), NRAS-KRAS (Yaeger et al., p-value <0.001). We found SOX9 mutations co-occurred with HK3 mutations, and with frameshift indels in APC as opposed to typical stop gains, although we did not observe co-occurrence of SOX9 and HK3 in larger cohorts.
Survival analysis
Survival analysis (overall and disease-free) was performed using cBioPortal on the TCGA provisional dataset using ARHGEF7 and all ARHGEFs combined: ARHGEF1, ARHGEF10, ARHGEF10L, ARHGEF11, ARHGEF12, ARHGEF15, ARHGEF16, ARHGEF17, ARHGEF18, ARHGEF19, ARHGEF2, ARHGEF25, ARHGEF26, ARHGEF3, ARHGEF33, ARHGEF34P, ARHGEF35, ARHGEF37, ARHGEF38, ARHGEF4, ARHGEF40, ARHGEF5, ARHGEF6, ARHGEF7 and ARHGEF9 (Supplementary Figure 8).
Structural variant calling
Structural variations (SV) were called using the SOPHIA algorithm (manuscript in preparation) using a workflow as described in Sahm et al.
Briefly, SOPHIA uses information of supplementary alignments from the alignment file as produced by bwa-mem. This indicates candidate chimeric alignments of split-reads which would be an indication of a possible underlying SV. SOPHIA uses a decision tree to consider only high-quality reads that do not fall on lowly mappable regions or consist of low-quality base calls. SOPHIA uses these reads and further filters the results by comparing them to a background control set of sequencing data derived from normal blood samples from a large background population database of 3261 patients from published TCGA studies and both published and unpublished DKFZ studies, sequenced using Illumina HiSeq 2000, 2500 (100 bp) and HiSeq X (151 bp) platforms and aligned uniformly. A SV is discarded if: it has more than 75% of read support is from low-quality reads; the second breakpoint of the SV was unmappable in the sample and in 10 or more background control samples; a SV with 2 breakpoints had one present in at least 98 control samples (3% of the control samples); both breakpoints have less than 5% read support at both positions.
In addition to the recurrently hit genes, we also found a number of topologically associated domains that were recurrently hit by SVs, including chr10:13,280,000–15,440,000 (containing SUV39H2), chr1:3,360,000–3,359,999 (MUM1, GNA15, GNA11, STK11, and TCF3), chr14:67,880,000–69,720,000 (RAD51B), and chr19:14,600,000–16,800,000 (BRD4) (Supplementary Data 13).
Copy number aberration calling
Copy number aberrations (CNAs) were called using ACEseqhttps://github.com/eilslabs/ACEseqWorkflow]. Briefly, ACEseq (allele-specific copy number estimation from whole-genome sequencing) determines copy number states, tumor cell content, ploidy, and sex in the tumor by using read coverage and the B-allele frequency (BAF). Heterozygous germline positions (with BAF 0.33–0.77 at dbSNP version 135 SNP loci)
For subsequent analysis, gains and losses were identified when they deviate more than 0.7 from the base ploidy. Annotation of genes was based on direct overlap with gene models from gencode version 19.
We observed recurrent chromosome arm-level changes, included gains 7p and q, 8q, 13q, 19q, and 20p and q, and deletions in 1p, 4q, 8p, 15q, 17p, and 18q, which have been described in the TCGA study
Identification of kataegis loci
Kataegis loci were classified as clusters of a minimum of 5 mutations within a 10 kb region. Annotation of genes to kataegis loci was done using bedtools using the gencode version 19 gene models. A kataegis locus was determined to be proximal to a gene if it was within 10 kb of it.
Supervised mutational signatures analysis
Supervised mutational signatures analysis was performed using the R package YAPSA [https://rdrr.io/bioc/YAPSA/]. The linear combination decomposition of the mutational catalog with known and predefined COSMIC signatures
Ingenuity pathway analysis
All genes hit by non-synonymous, including stop gain SNVs and indels were imported into the core analysis pipeline of the Ingenuity Pathway Analysis tool. Genes that were hit multiple times were included as an individual entry. SNVs and indels were combined together and exonic mutations in primary tumors (822 genes), metastasis (913 genes), primary tumor 3′-UTRs (770 genes) and metastasis 3′-UTRs (809 genes) were analyzed individually. Core analysis was performed with the default settings and the most significant pathways were selected after removal of those unrelated to cancer, GI disease or colorectal physiology.
Annotation enrichment analysis with DAVID
All genes hit by metastasis-specific mutations and indels that were mutated at least twice as much in metastasis samples compared to tumors were imported into functional annotation clustering tool of DAVID. The homo sapiens background, medium stringency and Benjamini-Hochberg correction was applied to the hypergeometric test.
In silico evaluation of miRNA binding to mutated 3′-UTRs
All predictions were made with the RNA22 interactive software [https://cm.jefferson.edu/rna22/Interactive/] using all known miRNA (miR) sequences from miRBase (Release 21) and the corresponding wild-type or mutated sequences as input. Default settings were used with sensitivity at 63%, specificity at 61%, seed size of 7 with a maximum of one unpaired base. The minimum number of paired-up bases in the heteroduplex was 12, the maximum folding energy for the heteroduplex (Kcal/mol) was −515 and no limit was given on the number of potential GU wobbles in the seed region. Gain- or loss-of-potential miRNA binding was evaluated by positive results in the presence or absence of a given mutation.