1. Cooler: scalable storage for Hi-C data and other genomically-labeled arrays
Nezar Abdennur Leonid Mirny
Bioinformatics, btz540, https://doi.org/10.1093/bioinformatics/btz540
Published:10 July 2019
Abstract
Motivation
Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form(稠密形态/致密形态). Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature(稀疏性), while supporting efficient compression and providing fast random access to facilitate development(促进发展) of scalable algorithms for data analysis.
Results
We developed a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices(矩阵) at any resolution(分辨率/解析度). It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata(元数据/诠释数据). Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium.
Availability
Cooler is cross-platform, BSD-licensed, and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler.
Supplementary information
Supplementary data are available at Bioinformatics online.
2. MemBlob database and server for identifying transmembrane regions using cryo-EM maps
Bianka Farkas Georgina Csizmadia Eszter Katona Gábor E TusnádyTamás Hegedűs
Bioinformatics, btz539, https://doi.org/10.1093/bioinformatics/btz539
Published:10 July 2019
Abstract
The identification of transmembrane helices in transmembrane proteins is crucial, not only to understand their mechanism of action, but also to develop new therapies. While experimental data on the boundaries of membrane-embedded regions is sparse(稀少的), this information is present in cryo-electron microscopy (cryo-EM) density maps and it has not been utilized yet for determining membrane regions. We developed a computational pipeline, where the inputs of a cryo-EM map, the corresponding atomistic structure(原子结构), and the potential bilayer orientation(方向/定向/定位) determined by TMDET algorithm of a given protein result in an output defining the residues assigned to the bulk water phase, lipid interface, and the lipid hydrophobic core. Based on this method, we built a database involving published cryo-EM protein structures and a server to be able to compute this data for newly obtained structures.
Availability
Supplementary information
Supplementary data are available at Bioinformatics online.
3. Noise-Cancelling Repeat Finder: Uncovering tandem repeats in error-prone long-read sequencing data
Robert S Harris Monika Cechova Kateryna D Makova
Bioinformatics, btz484, https://doi.org/10.1093/bioinformatics/btz484
Published:10 July 2019
Abstract
Summary
Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered(解释/辨明) due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative(假定的/推定的) tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations(模拟/仿真), we validated(验证/确认) the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response.
Availability and Implementation
NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder.
Supplementary information
Supplementary data are available at Bioinformatics online.