matlab showpath,HiCDB: detect and visualize contact domain boundaries (CDBs) or differential CDBs ...

HiCDB – user guide

Authors: Fengling Chen,Guipeng Li

Overview

HiCDB is an open-source MATLAB/GNU OCTAVE code that detects the contact domain boundaries (CDBs) from Hi-C contact matrix. HiCDB.m takes raw or normalized contact matrix and outputs conservation annotated CDBs (option: singlemap) or differential CDBs (option: comparemap). visHiCDB.m takes raw or normalized contact matrix and HiCDB results and outputs visualization of CDBs on single Hi-C map (option: singlemap) or differential CDBs on two Hi-C maps (option: comparemap). You could run HiCDB and visHiCDB in MATLAB enviroment or in command line.

Here is the general features of HiCDB software.

pipeline.png

Here is the general steps of how we detect CDBs.

pipe.png

Requirements and install

install MATLAB or GNU OCTAVE and simply download HiCDB and run the scripts.

git clone https://github.com/ChenFengling/HiCDB.git

Quick start

Unzip the testdata.tar.gz, you will find the dense format Hi-C data of hESC (Doxin et al.).

tar -zxvf testdata.tar.gz

To run HiCDB or visHiCDB in command line with nohup, simply add the HiCDB software path and use the code as in MATLAB environment.

Run HiCDB to get the CDBs.

nohupmatlab -r "addpath(genpath('HiCDB_PATH/'));HiCDB({'FULL_PATH_TO_DATA_FOLDER/h1_rep1/'},40000,'hg19','ref','hg19');exit;" > mylog.txt < /dev/null &

This will take the intra-chromosome matrix ('chr1.matrix',...,'chr23.matrix') in 'FULL_PATH_TO_DATA_FOLDER/h1_rep1/' as input and set the resolution as 40000,chrsizes as 'hg19', the CTCF motif ref as 'hg19' and output the contact domain boundaries. "

Run visHiCDB to display the region chr17:67100000-71100000.

nohupmatlab -r "addpath(genpath('HiCDB_PATH/'));fig=visHiCDB({'FULL_PATH_TO_DATA_FOLDER/h1_rep1/chr17.matrix'},{'FULL_PATH_TO_DATA_FOLDER/h1_rep1/CDB.txt'},40000,17,67100000,71100000);exit;" > mylog2.txt < /dev/null &

You will get this output. The dot is CDB detected(dark blue:consistently detected CDBs; light blue:other CDBs)

17_67100000_71100000_HiCmap.png

Get .bed file

As "chrX" is named as "chr23" and as "23" in the output CDB.txt file. You could use the following shell code to change CDB.txt into .bed file.

awk -v OFS="\t" '{ print "chr"$1,$2,$3,$4,$5}' CDB.txt >CDB.bed

sed -i 's/chr23/chrX/g' CDB.bed

1. Run HiCDB

Input

hicfile: The directory of all intra-chromosome matrix of a sample. The intra-chromosome matrix must be named as "chr+number.matrix" according to the chromosome order like 'chr1.matrix','chr2.matrix',...,'chr23.matrix'. As HiCDB matches "chr*.matrix" to recognize the Hi-C matrix, avoid to use the "chr*.matrix" as the name of other files. The intra-chromosome matrix could be in a dense (a NxN matrix) or sparse (a Kx3 table,Rao et al.) format. hicfile should be set as {'SAMPLE_DIR'} when option is "singlemap" or {'SAMPLE_DIR1','SAMPLE_DIR2'} when option is ‘comparemap’. This is required.

Dense format containsthe contactfrequenciesof the Hi-C NxNmatrix.

Sparse format (Rao et al.) has three fields: i, j, and M_i,j. (i and j are written as the left edge of the bin at a given resolution; for example, at 10 kb resolution, the entry corresponding to the first row and tenth column of the matrix would correspond to M_i,j, where i=0, j=90000). As the Hi-C matrix is symmetric, only the upper triangle of the matrix is saved in sparse format. An example is as following:

50000

50000

1.0

60000

60000

1.0

540000

560000

1.0

560000

560000

59.0

560000

570000

1.0

560000

600000

1.0

560000

700000

1.0

690000

710000

1.0

700000

710000

1.0

710000

710000

66.0

resolution: resolution of Hi-C matrix. This is required.

chrsizes: Ordered chromosome sizes of the genome. Optional setting is ‘hg19’, ‘hg38’, ‘mm9’, ‘mm10’ or any other chromosome size files which can be generated following the instructions in annotation/README.md. This is required.

ref: ref should be set when you want to get a cutoff using a CTCF motif or the option is 'comparemap'. Optional ref is ‘hg19’, ‘hg38’, ‘mm9’, ‘mm10’ or any other custom motif locus files which can be generated from instructions in annotation/README.md. Only ‘hg19’ and ‘hg38’ can be annotated with conservation. To decide the cutoff in other organisms, users could use the motif of other insulators as a reference instead of CTCF. According to our experience, it is reliable to check the CDBs on Hi-C map under several cutoff to decide the cutoff in other organisms. As HiCDB implements visualizations for the Hi-C maps with annotated CDBs and works well under a broad parameter range, it won’t be too hard. The current cutoff in 40kb and 10kb human sample are approximately the half and third quitile of the total local maximum peaks respectively.

Run

To run HiCDB, type “HiCDB(hicfile,resolution,chrsizes,options & parameter values)” in MATLAB. A detailed description of how to use HiCDB of all the options and of how to modify parameter values can be found by typing “help HiCDB” in MATLAB or matlab -r "addpath(genpath('HiCDB_DIR/'));help HiCDB;exit;" in command line. See also some examples below.

Examples

1. Output all the local maximum peaks and let customers to decide the cutoff.

HiCDB({'/home/data/sample'},10000,'hg19');

HiCDB({'/home/data/sample'},10000,'hg19','outdir','/home/data/sample/outputs/');

2. Use GSEA-like methods to decide the cutoff .

HiCDB({'/home/data/sample'},10000,'hg38','ref','hg38');

HiCDB({'/home/data/sample'},10000,'hg19','ref','custom_motiflocs.txt');

3. To detect differential CDBs

HiCDB({'/home/data/sample1','/home/data/sample2'},10000,'hg19','option','comparemap','ref','hg19');

4. To detect differential CDBs with replicates

HiCDB({{'sample1_rep1/','sample1_rep2/'},{'sample2_rep1/','sample2_rep2/'}},'hg19',10000,'option','comparemap','ref','hg19');

Output(s)

1.CDB.txt:

chr

start

end

LRI

avgRI

conserve_or_not

consistent_or_differential

19

53100000

53140000

0.394707211

0.647392804

0

1

16

5060000

5100000

0.342727704

0.663101081

1

1

19

19620000

19660000

0.329837698

0.609237673

1

0

2. localmax.txt: all the local maximum peaks detected before cutoff decision. User can decide custom CDB cutoff upon this file.

3. EScurve.png: CTCF motif enrichment on ranked local maximum peaks.

These output files can be found in custom output directory or default directory namely the directory of the first sample.

2. Run visHiCDB

Input

hicfile: the same as in function HiCDB

CDBfile: CDBfile sould be a cell array storing the CDB location. The CDB files should be formatted as the output files of function HiCDB.

resolution: resolution of Hi-C map.

chr,startloc,endloc: observation locus on Hi-C map.

Run

To run visHiCDB, type “fig=visHiCDB(hicfile,CDBfile,resolution,chr,startloc,endloc,options & parameter values)” in the command line with input matrix, CDB file, options and parameter values. A detailed description of how to use visHiCDB of all the options and of how to modify parameter values can be found by typing “help visHiCDB” in MATLAB or matlab -r "addpath(genpath('HiCDB_DIR/'));help visHiCDB;exit;" in command line. See also some examples below.

Examples

1.Show CDB on single Hi-C map

fig=visHiCDB({'/home/data/sample/chr18.matrix'},{'CDB1.txt'},10000,18,25000000,31150000);

2. Show differential CDBs on Hi-C maps

fig=visHiCDB({'/home/data/sample1/chr18.matrix','/home/data/sample2/chr18.matrix'},{'/home/data/sample1/CDB1.txt','/home/data/sample2/CDB2.txt'},10000,18,25000000,31150000,'option','comparemap');

Output(s)

HiCmap.pdf: a pdf containing figure showing CDBs on single Hi-C map or different kinds of CDBs between two Hi-C maps.

These output files can be found in custom output directory or default directory namely the directory of the first sample.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值