GENOVA

最新推荐文章于 2021-01-31 11:06:50 发布

qq_39306047

最新推荐文章于 2021-01-31 11:06:50 发布

阅读量2.5w

点赞数 1

分类专栏： R

本文链接：https://blog.csdn.net/qq_39306047/article/details/106939647

版权

R 专栏收录该内容

52 篇文章 7 订阅

订阅专栏

vignettes/GENOVA.Rmd
@@ -1,409 +1,409 @@

title: “GENOVA: explore the Hi-C’s”
author:

name: Robin H. van der Weide
affiliation:
- Division of Gene Regulation, the Netherlands Cancer Institute
name: Teun van den Brand
affiliation:
- Division of Gene Regulation, the Netherlands Cancer Institute
name: Elzo de Wit
affiliation: Division of Gene Regulation, the Netherlands Cancer Institute
email: e.d.wit@nki.nl
date: “r format(Sys.time(), '%d %B %Y')”
output:
BiocStyle::pdf_document
abstract: |
The increase in interest for Hi-C methods in the chromatin community has led to a need for more user-friendly and powerful analysis methods. The few currently available software packages for Hi-C do not allow a researcher to quickly summarize and visualize their data. An easy to use software package, which can generate a comprehensive set of publication-quality plots, would allow researchers to swiftly go from raw Hi-C data to interpretable results. Here, we present GENome Organisation Visual Analytics (GENOVA): a software suite to perform in-depth analyses on various levels of genome organisation, using Hi-C data. GENOVA facilitates the comparison between multiple datasets and supports the majority of mapping-pipelines. \newpage

vignette: >
%\VignetteIndexEntry{GENOVA: explore the Hi-C’s}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: GENOVA.bib

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

knitr::opts_chunk$set(fig.pos = 'h')

Loading data

# devtools::install_github("robinweide/GENOVA", ref = 'dev')
library(GENOVA)

Data structures of input

GENOVA expects two input files: the signal- and the index-file. Signal-files have three columns (bin1, bin2, contactCount) and index-files have four (chromosome, start, end, bin). These are the default output of the Hi-C mapping pipeline HiC-Pro [@Servant2015], where they are called *.matrix and *.bed. The files are expected to be genome-wide and may be corrected with ICE-normalisation.

Recommended resolutions

To ensure computational strain and time is kept to a minimum, we recommend different resolutions for different functions (table @ref(tab:tableRES)). More experienced users are free to deviate, while keeping in mind that these datasets are quite memory-heavy (table @ref(tab:tableMEM)).

Function	Resolution
APA	10kb-20kb
ATA	10kb-40kb
cisTotal.perChrom	500kb-1Mb
chromosomeMatrix	500kb-1Mb
RCP	40kb-500kb
intra.inter.TAD.contacts	20kb - 40kb
PE-SCAn	20kb-40kb
hic.matrixplot	$\frac{width\ in\ bp\ of\ window}{500}$
centromere.telomere.analysis	40kb
*.compartment.plot	100kb

: (#tab:tableRES) Recommended resolutions. These will provide optimal resource/result tradeoffs.

Experiment	Contacts	10kb	40kb	100kb	1Mb
Hap1 [@Haarhuis2017]	433.5M	2.9Gb	1.7Gb	1.1Gb	0.1Gb
iPSC [@Krijger2016]	427.9M	3.1Gb	1.9Gb	1.0Gb	53.1MB

: (#tab:tableMEM) Memory footprints of objects loaded into R.
Several functions rely on centromere-information. You can add this in the form of a BED-like three-column data.frame when constructing the experiment-object ¹. If not present, the centromeres will be emperically identified by searching for the largest stretch of no coverage on a chromosome.

centromeres = read.delim('../data/hg19_cytobandAcen.bed', 
                         sep = '\t', 
                         h = F, 
                         stringsAsFactors = F)
head(centromeres)

construct.experiment

Every Hi-C experiment will be stored in an experiment-object. This is done by invoking the construct.experiment function. Inside, several sanity checks will be performed, data is normalised to the total number of reads and scaled to a billion reads (the default value of the BPscaling-option). For this example, we are going to use the Hi-C maps of WT and $\Delta$ WAPL Hap1 cells from Haarhuis et al. [-@Haarhuis2017]. Since the genome-wide analyses do not need very high-resolution data, we will construct both 10kb, 40kb and 1Mb resolution experiment-objects.

Hap1_WT_10kb <- load_contacts(signal_path = '../data/symlinks/WT_10000_iced.matrix', 
                              indices_path = '../data/symlinks/WT_10000_abs.bed',  
                              sample_name = "WT", 
			                        colour = "black")
Hap1_WAPL_10kb <- load_contacts(signal_path = '../data/symlinks/WAPL_10000_iced.matrix', 
                                indices_path = '../data/symlinks/WAPL_10000_abs.bed', 
                                sample_name = "WAPL", 
                                colour = "red")
Hap1_SCC4_10kb <- load_contacts(signal_path = '../data/symlinks/SCC4_10000_iced.matrix', 
                                indices_path = '../data/symlinks/SCC4_10000_abs.bed', 
                                sample_name = "SCC4", 
                                colour = "green")
Hap1_WT_40kb <- load_contacts(signal_path = '../data/symlinks/WT_40000_iced.matrix', 
                                indices_path = '../data/symlinks/WT_40000_abs.bed', 
                                sample_name = "WT",  
                                colour = "black")
Hap1_WAPL_40kb <- load_contacts(signal_path = '../data/symlinks/WAPL_40000_iced.matrix', 
                                indices_path = '../data/symlinks/WAPL_40000_abs.bed', 
                                sample_name = "WAPL", 
                                colour = "red")
Hap1_WT_1MB <- load_contacts(signal_path = '../data/symlinks/WT_1000000_iced.matrix', 
                                indices_path = '../data/symlinks/WT_1000000_abs.bed', 
                                sample_name = "WT", centromeres = centromeres,
                                colour = "black")
Hap1_WAPL_1MB <- load_contacts(signal_path = '../data/symlinks/WAPL_1000000_iced.matrix', 
                                indices_path = '../data/symlinks/WAPL_1000000_abs.bed', 
                                sample_name = "WAPL", 
                                centromeres = centromeres,
                                colour = "red")

The resulting contacts-object has several slots. MAT and IDX are the signal- and index-data.tables. We also have slots for the included chromosomes (CHRS) and the given centromers (CENTROMERES).

str(Hap1_WT_10kb, width = 60,   vec.len=0, strict.width = 'wrap')

Finally, the object has a lot of specific attributes, like metadata and given parameters during loading. The amount of contacts in the ICE data.table is likely different from the input-data, because it is scaled to a fixed number of reads (which can be set with the scale_bp-option in load_contacts()).

as.data.frame(
  attributes(Hap1_WT_10kb)
  [-1])

Juicebox & cooler

Both .hic and .cooler files can be loaded from version 1 onwards. The same load_contacts() function can be used, which will automatically determine the file-type based on the extention.

Hap1_WT_10kb_juicer <- load_contacts(signal_path = '../data/symlinks/WT.hic', 
                                     sample_name = "WT", 
                                     resolution = 10e3, 
                                     balancing = 'KR', # this is the default
			                               colour = "black")
Hap1_WT_10kb_cooler <- load_contacts(signal_path = '../data/symlinks/WT.cooler', 
                                     sample_name = "WT", 
                                     balancing = T,
			                               colour = "black")

Genome-wide analyses

A good place to start your analyses are some functions on a genome-wide level. We can assess the quality of the library, identify translocations and generate contact probability (aka scaling or interaction decay plots).

Cis-quantification

Work by the group of Amos Tanay showed that the expected amount of intra-chromosomal contacts is the range of 90 to 93 percent [@Olivares-Chauvet2016]. Assuming that any extra inter-chromosomal contacts are due to debris/noise, the user might aspire to get the cis-percentages as close to 90% as possible. To compute the percentage of per-sample cis-contacts, we simply provide cis_trans() with the exp-object of interest. It will produce a boxplot of the percentages cis per sample (figure @ref(fig:cis)).

cisChrom_out <- cis_trans( list(Hap1_WT_1MB, Hap1_WAPL_1MB) )
barplot(cisChrom_out$cis, names.arg = cisChrom_out$sample, ylim = c(0, 100) )
abline(h = 90, col = 'red', lty = 3)
abline(h = 93, col = 'red', lty = 3)

The same function can also be ran on specific regions. For this example, we will compute the intra-arm percentages. Plotting this shows us that there are interesting differences between the amounts of intra-arm contacts in cis, which is can be attributed to the loss of TADs [@Haarhuis2017] (figure @ref(fig:cis2)).

p_arms <- data.frame('chromosome' = centromeres[,1],
                     'start' = 0,
                     'end' = centromeres[,2])
cisChrom_out <- cis_trans( list(Hap1_WT_10kb, Hap1_SCC4_10kb) , bed = p_arms)
barplot(cisChrom_out$cis, names.arg = cisChrom_out$sample )
abline(h = 90, col = 'red', lty = 3)
abline(h = 93, col = 'red', lty = 3)

chromosome plots

Hi-C has been shown to be a powerful data-source to detect chromosomal rearrangements [@Harewood2017]. To find possible translocations, we can plot the genome-wide enrichment of interactions between all combinations of chromosomes. The values in the matrix are $l o g 10 (o b s e r v e d / e x p e c t e d)$ . The Hap1 cell line has two known translocations, which we can easily see in the resulting plot (figure @ref(fig:chromMat1)). To narrow-in on this location, you could use the trans.compartment.plot-function (discussed below).

par(pty ='s')
# Lets remove mitochondrial and Y-chromosomal contacts
# You can also use chromsToUse to select specific chromosomes.
chromosomeMatrix(Hap1_WT_1MB, remove = c("chrM","chrY"), cut.off = 2)

# Lets remove mitochondrial and Y-chromosomal contacts
chromosomeMatrix(Hap1_WT_1MB, remove = c("chrM","chrY"))

RCP

The Relative Contact Probability computes the contact probability as a function of genomic distance, as described in [@Lieberman-Aiden2009]. This can be computed for a specific set of chromosomes or genome-wide. To be able to ignore centromeric contacts (which have a abberant RCP), centromeric information is need. This is taken from the experiment-object or found emperically by comparing trans-interactions.

RCP_out <- RCP(explist = list(Hap1_WT_40kb, Hap1_WAPL_40kb), 
               chromsToUse = 'chr1')

The user can decide to plot the RCP per chromosome. If the data is sparse, a LOESS-smooting could be convenient. It takes the color and name from the experiment-objects. If we look at the resulting plot, we can see that the $\Delta WAPL$ has more interactions in the $[\pm800kb, \pm2Mb]$ range (figure @ref(fig:RCPPLOT1)). The sizes of TADs are fall into this range, so a next step could be to dive into the TAD-specific analyses (discussed below). Moreover, the $\Delta WAPL$ has less interactions in the far-cis range ( $[10 M b, 100 M b]$ ): A- and B-compartments are often of these sizes, so a next step could be to look more into comparmentalisation with cis.compartment.plot or trans.compartment.plot, for example.

options(scipen = 1)

# Plot RCP: per-chromosome
visualise(RCP_out, 
          smooth = T)

differentials

It is also possible to directly compare samples to one (like WT versus WAPL). For this, metric has to be set to lfc and contrast to 1 (figure @ref(fig:RCPPLOT2)). The log-fold change of average probabilities are then plotted.

# Plot RCP: combined
visualise(RCP_out, contrast = 1, metric = 'lfc')

regions

But what if you want to compare the contact probabilities of specific regions, like Cohesin- or CTCF-bound regions? For this, we added the possibility to add a list of BED-data_frames to the bedList-argument. Under the hood, we perform a standard per-arm RCP (thus still enabling users to alse set the chromsToUse-parameter), whereafter we filter out Hi-C bins that do not have entries in the dataframe(s) of bedList. The same plot-function can be used: different BED-files will have different line-types. The fact that we use linetype for the bedList entries, allows us to still use multiple samples in experimentList, as shown in figure @ref(fig:RCPBED). But if you only provide one experiment-object, we will use different line-colours of the different BEDs.

CTCF = read.delim('../data/symlinks/CTCF_WT_motifs.bed', h = F)
SMC1 = read.delim('../data/symlinks/SMC1_WT_peaks.narrowPeak', h = F)
RCP_out = RCP(list(Hap1_WT_40kb, Hap1_WAPL_40kb),
              bedlist =  list("CTCF" = CTCF, 
                              'Cohesin' =SMC1), 
              chromsToUse = 'chr1')
visualise(RCP_out)
visualise(RCP_out, contrast = 1, metric = 'lfc')

A- and B-compartments

H3K27acPeaks = read.delim('../data/symlinks/H3K27ac_WT.narrowPeak', h = F)
CS_out = compartment_score(list(Hap1_WT_40kb, Hap1_WAPL_40kb), bed = H3K27acPeaks)
visualise(CS_out, chr = "chr17")

Saddle-analyses

saddle_out = saddle(list(Hap1_WT_40kb, Hap1_WAPL_40kb), 
                   CS_discovery = CS_out,
                   bins = 50)
visualise(saddle_out)

Compartment-strength

CSS <- quantify(saddle_out)
compared <- tidyr::spread(unique(CSS[,-c(3,4)]), key = 'exp', value = 'strength')
plot(compared$WT, compared$WAPL, xlim = c(0,4), ylim = c(0,4), pch = 20)
abline(a = 0, b = 1)

Interaction plots

GENOVA has serveral plotting-functions for genomic loci. cis.compartment.plot and trans.compartment.plot provide a easy way to plot whole chromosome arms, including compartmentalisation-score tracks. For more zomomed-in plots hic.matrixplot can be used. This function also allows rich annotation and between-experiment comparision possibilities. All functions try to guess the most appropriate color-scale limits, but finer control of this can be gotten by setting the cut.off-argument.

cis-interactions

The compartmentalisation of the chromatin into A and B van already described in the orignial Hi-C paper [@Lieberman-Aiden2009]. Serval papers have discribed the loss of compartmentalisation when the Cohesin complex is stabalised [@Haarhuis2017,@Wutz2017,@Gassler2017]. To view this interesting level of chromatin organisation, we can use cis.compartment.plot. With this, we can plot one arm of a chromosome with the compartment-score plotted above. To infer which compartment is A (viewed as the active state) and which is B, we can add a BED-data.frame of ChIP-seq peaks from active histone marks (e.g. H3K27ac, H3K4me1). In figure @ref(fig:CCP1) you can see the resulting plots, where you can see that the checkerboard-pattern in the matrix and the amplitude of the compartment-score are deminished in the WAPL-knockout.

H3K27ac_peaks = read.delim('../data/symlinks/H3K27ac_WT.narrowPeak', h = F)
cis.compartment.plot(exp1 = Hap1_WT_40kb, 
                     exp2 = Hap1_WAPL_40kb,
                     chrom = 'chr14', 
                     arm = 'q',
                     cs.lim = 1.75, # max compartment-score
                     cut.off = 15,
                     chip = H3K27ac_peaks)

The compartment-score is calculated by performing an eigenvector decomposition on the correlation-matrix of the expected over expected matrix. To view this O/E matrix, we can set the obs.exp-option to TRUE. This view gives a visually better view of the A- and B-compartments (figure @ref(fig:CCP3)).

cis.compartment.plot(exp1 = Hap1_WT_40kb, 
                     exp2 = Hap1_WAPL_40kb,
                     chrom = 'chr20', 
                     arm = 'q',
                     cut.off = 1, 
                     obs.exp = T,
                     chip = H3K27ac_peaks)

trans-interactions

As could be seen above, A-compartments interact more with other A-compartments and the same is true for B-compartments. However, is the same true for trans? The function trans.compartment.plot will allow the user to plot a trans-matrix (i.e. a matrix of the arms of two different chromosomes) along with the respective cis compartment-scores. This function could also be used to investigate chromosomsal translocations: the $9 q; 22 q$ translocation can be clearly seen if we use this function, as in figure @ref(fig:TCP).

trans.compartment.plot(exp = Hap1_WT_40kb, 
                       chrom1 = 'chr9', 
                       arm1 = 'q', 
                       chrom2 = 'chr22', 
                       arm2 = 'q', 
                       cut.off = 10,
                       chip = H3K27ac_peaks)

matrix plots

To produce richly annotated zoomed-in (i.e. max 10Mb) plots of specific regions, we use the hic.matrixplot function. In this, we can use one or two experiment objects: two can be shown either in diff-mode (the difference between the two) or upper/lower triangle mode. TAD- and loop-annotations can be added, as well as bigwig- and bed-tracks. Moreover, genemodel-files can be added. In this section, we will build up to a final, fully annotated, matrix from a humble one-experiment plot (figure @ref(fig:HMP1)).

hic.matrixplot(exp1 = Hap1_WT_10kb,
               chrom = 'chr7',
               start = 25e6,
               end=30e6, 
               cut.off = 50) # upper limit of contacts

two experiments

Adding a second experiment will give us the option of coplot, which can be dual (default) or diff. The first shows exp1 in the upper triangle and exp2 in the lower Exp1 is subtracked from exp2 in diff-mode: red is therefore more contacts in exp2 and blue denotes more contacts in exp1 (figure @ref(fig:HMPdiff1)).

hic.matrixplot(exp1 = Hap1_WT_10kb,
               exp2 = Hap1_WAPL_10kb,
               chrom = 'chr7',
               start = 25e6,
               end=30e6, 
               cut.off = 50) # upper limit of contacts
hic.matrixplot(exp1 = Hap1_WT_10kb,
               exp2 = Hap1_WAPL_10kb,
               coplot = 'diff',
               chrom = 'chr7',
               start = 25e6,
               end=30e6, # upper limit of contacts
               cut.off = 25)

TADs and loops

It can be very usefull to annotate the matrix with the postions of TADs and loops: take, for example, the situation where these structures are altered in a knockout for example. We are going to use the TAD- and loop-calls of WT Hap1 20-kb matrices from Haarhuis et al. [-@Haarhuis2017], generated with HiCseg [@Levy-Leduc2014].
Lets load some TAD- and loop-annotations:

WT_TADs = read.delim('../data/symlinks/WT_hicseg_TADs.bed', h = F)
WT_Loops = read.delim('../data/symlinks/WT_HICCUPS.bedpe', h = F, skip = 1)
sanborn2015_Loops = read.delim('../data/symlinks/GSE74072_Hap1_HiCCUPS_looplist.txt.gz')

WT_Loops$V1 = gsub(pattern = "^", replacement = "chr", x = WT_Loops$V1) 
WT_Loops$V4 = gsub(pattern = "^", replacement = "chr", x = WT_Loops$V4)
WT_TADs = WT_TADs[!WT_TADs$V3 < WT_TADs$V2,] 
sanborn2015_Loops[,1] = gsub(sanborn2015_Loops[,1], pattern = "^", replacement = "chr")
sanborn2015_Loops[,4] = gsub(sanborn2015_Loops[,4], pattern = "^", replacement = "chr")

Add them to the plot by using the tad- and loops-arguments. Both can be plotted in one or both of the traingles and colored as whished (figure @ref(fig:HMPtadloop)). Since loops are very small in a hic-matrixplot, they will be fully overlapped by the loop-annotations. To overcome this, we expand the annotations with a fixed bp using loops.resize. This will lead to a more box-like annotation surrounding the loop.

hic.matrixplot(exp1 = Hap1_WT_10kb,
               chrom = 'chr7',
               start = 25e6,
               end=30e6, 
               loops = WT_Loops, # see APA
               loops.colour = 'blue', # purple loops
               loops.type = 'upper', # only plot in upper triangle
               loops.radius = 20e3, # expand for visibility
               tads = WT_TADs, # see ATA
               tads.type = 'lower', # only plot in lower triangle
               tads.colour = '#ffd92f', # green TAD-borders
               cut.off = 25) # upper limit of contacts

BigWigs and BEDs

Manipulation of CTCF-binding sites can result in loss or gain of loops and/or TADs [@DeWit2015a]. If one is interested in the effects of a knockout on the binding of a protein in combination with changes in interaction frequencies, adding ChIP-seq signal or -peaks to the matrix can be very helpfull. Two tracks above and two tracks to the left can be added. These can be either BED-like data.frames or the paths the .bw files. For example, lets load a BED6-file (chrom, start, end, name, score, and strand ²) of CTCF-motifs under CTCF-ChIP peaks. The argument type can be set to either triangle or rectangle: triangle is nice to use if you want to look at the orientation of the BED-entries (figure @ref(fig:HMPchip)). If you only have a three column BED, then the autput will allways be rectangle.

CTCF = read.delim('../data/symlinks/CTCF_WT_motifs.bed', h = F)
SMC1 = read.delim('../data/symlinks/SMC1_WT_peaks.narrowPeak', h = F)

knitr::kable(
  head(CTCF, 3), caption = 'A data.frame holding a standard BED6 format.'
)

Moreover, we can use a bigwig (.bw) file to plot a track. For this example, we are using a SMC1 ChIP-seq track from [@Haarhuis2017]. We need the bigwrig package, which is easily installed from github using devtools::install_github(). The yMax argument is handy if you want to compare bigwig-tracks: it lets you set the y-axis maximum.

library(devtools)
install_github(repo ='bigwrig', username =  'jayhesselberth')

hic.matrixplot(exp1 = Hap1_WT_10kb,
               chrom = 'chr7',  start = 26.75e6,  end=28.5e6, 
               loops = WT_Loops, # see APA
               loops.colour = '#998ec3', # purple loops
               loops.type = 'upper', # only plot in upper triangle
               loops.radius = 20e3, # expand for visibility
               type = 'triangle',
               chip = list('../data/symlinks/SMC1_WT.bw', # inner top
                           CTCF),# outer-top
               symmAnn = F, # place annotations also on left side
               cut.off = 65) # upper limit of contacts

Genes

[@Dixon2012] showed that housekeeping-genes are enriched in the vincinity of TAD-borders. Another interesting question could be wheter differentially expressed genes are also found near TAD-borders or binding sites of specific proteins when studying a knockout. These type of questions can be tackled by adding the appropiate gene-models to hic.matrixplot. To do this, we make use of the data.fame, where each row is an exon from a gene. There are several ways to get this. One of the easiest is to use biomart to get exon-coordinates. This can be done with the biomaRt-package or via the web-based service. For this example, we downloaded data of all exons from the Ensembl biomart and plotted both the BED-file and the genes (figure @ref(fig:HMPchipGene)).

```{r biomart}
# features downloaded:
## Gene stable ID & Transcript stable ID & Chromosome/scaffold name &
## Transcript start (bp) & Transcript end (bp) & Exon region start (bp) &
## Exon region end (bp) & Strand
# martExport = read.delim('../data/mart_export.txt.gz', stringsAsFactors = F)
# colnames(martExport) = c('ENSG','ENST','chrom' , # change column names
#                          'txStart' , 'txEnd' , 
#                          'exonStart' , 'exonEnd' , 'strand')
# martExport$chrom = gsub(martExport$chrom, # add chr-prefix
#                         pattern = '^',
#                         replacement = 'chr') 
# martExport$strand = ifelse(martExport$strand == 1, '+',"-") # 1/-1 to +/-
load('../data/martExport.Rdata')

knitr::kable(
  head(martExport[,-c(1,2)], 5), caption = 'A data.frame holding the needed columns for plotting genes.'
)

hic.matrixplot(exp1 = Hap1_WT_10kb,
               chrom = 'chr7',  start = 26.75e6,  end=28.5e6, 
               genes = martExport,
               cut.off = 65) # upper limit of contacts

Everthing together

Finally, we can combine all these options in one. This may be complete overkill, but it could be quite handy. In this example, we can see that most TAD-borders and loop-anchors have clear SMC1- and CTCF-signal (figure @ref(fig:HMPall)). Both these are expected to be found at these locations according to the chromatin extrusion model. Moreover, we can also see that the CTCF-orientation of the upstream and downstream loop-anchor are forward and reverse, resp. This convergent rule is a known feature of loops [@DeWit2015].

hic.matrixplot(exp1 = Hap1_WT_10kb,
               chrom = 'chr7',
               start = 25e6,
               end=28.5e6, 
               loops = WT_Loops, # see APA
               loops.colour = '#998ec3', # purple loops
               loops.type = 'upper', # only plot in upper triangle
               loops.radius = 20e3, # expand for visibility
               genes = martExport,
               chip.colour = 'black',
               chip = list('../data/symlinks/SMC1_WT.bw', # inner-top
                           SMC1, # outer-top
                           '../data/symlinks/SMC1_WT.bw', # inner-left
                           CTCF), # outer-left
               tads = WT_TADs, # see ATA
               tads.type = 'lower', # only plot in lower triangle
               tads.colour = '#91cf60', # green TAD-borders
               cut.off = 50) # upper limit of contacts

TADs

Topologically Associated Domains (TADs) are $\pm.8-2Mb$ regions, which are seen as triangles in the matrix: regions that have more interactions within than outside. GENOVA has a repetoire of functions to generate and analyse TADs. Fist, we will use the insulation score to call TADs and compare the strength of TAD-borders between samples. Next, we will explore ATA to analyse aggegrates of TADs. Finally, wil investigate wheter TADs interact with their neighbouring TADs.

Insulation

To estimate the strength of TAD-borders, we can look at the insulation-score [@Crane2015]. At a TAD-border, this score reaches a local minimum: the lower the score, the stronger the insulation. We can generate this for a specific sliding-window size with insulation_score. The choice of window-size is quite tricky, since smaller will be sensitive to very local effect (i.e. mapping-errors, loops), while too big windows will lead a an underrepresentation. Luckily, we can generate a domainogram of a range of window-sizes for a specific genomic region with insulation.domainogram.

Domainogram

To make a domainogram, we simply choose our experiment and our region of interest. The window-size is directly proportional to the amount of Hi-C bins.

ID <- insulation_domainogram(Hap1_WT_10kb,
                       'chr7', 
                       25.5e6,
                       30e6, 
                       window_range = c(1, 101),
                       step = 2)

A nice feature of hic.matrixplot is that if you use it without plotting anything on the sides (i.e. genes and/or ChIP-tracks), you can insert other plots. This allows us to plot the domainogram directly under the matrix, making it very easy to compare the insulation with the actual data (figure @ref(fig:domainogram2)).

hic.matrixplot(exp1 = Hap1_WT_10kb,
               chrom = 'chr7',
               start = 25e6,
               end=29e6, 
               tads = WT_TADs, # see ATA
               tads.type = 'upper', # only plot in lower triangle
               tads.colour = '#91cf60', # green TAD-borders
               cut.off = 25, # upper limit of contacts
               skipAnn = T) # skip the outside annotation
plot(ID)

Computing the insulation score

To get the genome-wide insulation score in .bedgraph-format ³, we provide the genome.wide.insulation with an experiment-object and the window-size of choice. As can be seen in the domainogram above, at $W = 25$ we will catch the majority of the hotspots, while limiting the amount of noise. The visualise()-function can show both the insulation-scores and the difference between them.

Hap1_10kb_insulation = insulation_score(list(Hap1_WT_10kb, Hap1_SCC4_10kb),
                                           window = 25)
visualise(Hap1_10kb_insulation, chr = 'chr7', start = 25e6, end=29e6, contrast = 1)

Insulation-heatmap

We can align the border-strength of TADs in multiple samples to a specific BED-file, to compare “borderness” of feature. For example, let’s use the TAD-borders from [@Haarhuis2017]. In figure @ref(fig:INSalign) we can see that the average signal drops at the border (which is to be expected) and that this is a genome-wide feature, as we see in the heatmap.

heatmap_insulation(Hap1_10kb_insulation, bed = CTCF, bed_pos = 'center')

Call TADs

The insulation-discovery object can also be used to call TADs.

TADcalls <- call_TAD_insulation(Hap1_10kb_insulation)

hic.matrixplot(exp1 = Hap1_WT_10kb,
               exp2 = Hap1_SCC4_10kb,
               chrom = 'chr7',
               start = 24e6,
               end=27e6, 
               tads = list(TADcalls$SCC4, TADcalls$WT), # see ATA
               tads.type = list('lower', 'upper'), # only plot in lower triangle
               tads.colour = c('green', 'grey'), # green TAD-borders
cut.off = 25) # upper limit of contacts

ATA

ATA_Hap1_WTcalls   <- ATA(list("WT" = Hap1_WT_10kb,
                               'WAPL' = Hap1_WAPL_10kb),
                          bed = TADcalls$WT)

We can use visualise to plot the ATA-results.

visualise(ATA_Hap1_WTcalls, 
          colour_lim = c(0,50),
          colour_lim_contrast = c(-5,5), 
          focus = 1) # which entry to use as comparison

TAD+N

TAD_N_WT   <- intra_inter_TAD(list("WT" = Hap1_WT_10kb,
                                   'WAPL' = Hap1_WAPL_10kb),
                              tad_bed = TADcalls$WT, 
                              max_neighbour = 10)

We can compute the enrichment of contacts between TADs with the visualise()-function.

visualise(TAD_N_WT, geom = 'jitter')
visualise(TAD_N_WT, geom = 'violin')

Loops

For this section, we are using both the called and the extended loops from Haarhuis et al. [-@Haarhuis2017]. These are the anchor-combinations of the merged loop-calls of WT Hap1 5-, 10- and 20-kb matrices, generated with HICCUPS [@Rao2014].

WT_Loops_extended = anchors_extendedloops(Hap1_WT_10kb$IDX, 
                                          res = 10e3, 
                                          bedpe = WT_Loops)

options(scipen = 1e9)

knitr::kable(
  head(WT_Loops_extended, 5), caption = "Anchor-combinations of anchors_extendedLoops"
)

options(scipen = 1)

APA

Explain smalltreshold

APA_Hap1_WT  <- APA(list("WT" = Hap1_WT_10kb,
                          'WAPL' = Hap1_WAPL_10kb),
                          bedpe = WT_Loops)
APA_Hap1_WT_extended <- APA(list("WT" = Hap1_WT_10kb,
                               'WAPL' = Hap1_WAPL_10kb),
                            bedpe = WT_Loops,
                            anchors = WT_Loops_extended)

We can use visualise.APA.ggplot to combine the APA-results.

visualise(APA_Hap1_WT,
          title = "Hap1 Hi-C vs loops", 
          zTop = c(0,9.5), # set the zlims of the upper row
          zBottom = c(-5,5),
          focus = 1) # which item in APAlist to use as comparison
visualise(APA_Hap1_WT_extended,
          title = "Hap1 Hi-C vs extended loops", 
          zTop = c(0,9.5), # set the zlims of the upper row
          zBottom = c(-5,5),focus = 1) # which item in APAlist to use as comparison

To get some basic statistics on the output(s) of APA-run(s), we use quantify(). This function averages the region surrounding the center of each loop, where a region is defined as a square of $pixWidth \times pixWith$ .

quantifyAPA_out <-  quantify(APA_Hap1_WT)
quantifyAPA_out_extended <-  quantify(APA_Hap1_WT_extended)
# barplot(quantifyAPA_out$per_sample$contrast_lfc, ylim = c(0, 2))
# barplot(quantifyAPA_out_extended$per_sample$contrast_lfc, ylim = c(0, 2))
# pot boxplot with base-R (ggplot2 would be also easy)
boxplot(split(quantifyAPA_out_extended$per_loop$lfc, f = quantifyAPA_out_extended$per_loop$sample_name), 
        col = c('darkgrey', 'red'), outline = F, ylab = 'pixel enrichment extended loops')

Far-cis interactions

PE-SCAn

Some regulatory features, like super-enhancers come together in 3D-space. To test this, we implemented PE-SCAn. Here, the enrichment of interaction-frequency of all pairwise combinations of given regions is computed. The background is generated by shifting all regions by a fxed distance (1Mb: can be changed with the shift-argument).

superEnhancers = read.delim('../data/symlinks/superEnhancers.txt',
                            h = F, 
                            comment.char = "#")

knitr::kable(
  head(superEnhancers[,1:6], 5), caption = "A data.frame holding the output of homer's findPeaks -style super."
)

The baisc visualisation is comparable to ATA and APA: the first row shows the enrichment of all included samples, while the bottom row shows the difference.

WT_PE_OUT = PESCAn(list(Hap1_WT_40kb, Hap1_WAPL_40kb), bed = superEnhancers[,2:4])
visualise(WT_PE_OUT)

Another way of looking at the PE-SCAn results, is to make a perspective plot. Here, the enrichment is encoded as the z-axis.

RES = 40e3 # resolution of the Hi-C
persp(list(x = seq(-1*(RES*10),(RES*10), length.out = 21)/1e6, # x-ticks (MB)
           y = seq(-1*(RES*10),(RES*10), length.out = 21)/1e6, # y-ticks (MB)
           z = WT_PE_OUT$obsexp[,,'WAPL']), # PE-SCAn out
      phi = 25, # colatitude 
      theta = 40, # rotation
      col="#92c5de", # color of the surface
      shade=0.4, # how much shading 
      xlab="", 
      ylab="", 
      zlab="",
      cex.axis = .6,
      ticktype="detailed", 
      border=NA, 
      zlim = c(min(c(WT_PE_OUT$obsexp)), 
               max(c(WT_PE_OUT$obsexp))))

centromere.telomere.analysis

centromere.telomere.analysis
draw.centromere.telomere
We saw a enriched signal between chromosomes 15 and 19. We can wh

out1519 = centromere.telomere.analysis(Hap1_WT_40kb, chrom.vec = c('chr15', 'chr19'))
draw.centromere.telomere(out1519)

par(pty ='s')
out1519 = centromere.telomere.analysis(Hap1_WT_40kb, chrom.vec = c('chr15', 'chr19'))
draw.centromere.telomere(out1519)

Please post questions, comments and rants on our github issue tracker.

Session info

sessionInfo()

References

Please make sure that the chromosome-names match. ↩︎
https://genome.ucsc.edu/FAQ/FAQformat.html ↩︎
BED3 + signal column ↩︎

qq_39306047

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
GENOVA

vignettes/GENOVA.Rmd@@ -1,409 +1,409 @@title: “GENOVA: explore the Hi-C’s”author:name: Robin H. van der Weideaffiliation:Division of Gene Regulation, the Netherlands Cancer Institutename: Teun van den Brandaffiliation:Division of Gene Regulat
复制链接

扫一扫