Paper reading (六)：Machine learning in chemoinformatics and drug discovery_chemoinformatics and compound and fingerprints-CSDN博客

本文链接：https://blog.csdn.net/wxw060709/article/details/100900359

论文题目：Machine learning in chemoinformatics and drug discovery

scholar 引用：49

页数：9

发表时间：2018.08

发表刊物：Drug Discovery Today

作者：Yu-Chen Lo, Stefano E. Rensi, Wen Torng and Russ B. Altman

摘要：Chemoinformatics is an established discipline focusing on extracting, processing and extrapolating
meaningful data from chemical structures. With the rapid explosion of chemical ‘big’ data from HTS and
combinatorial synthesis, machine learning has become an indispensable tool for drug designers to mine
chemical information from large compound databases to design drugs with important biological
properties. To process the chemical data, we first reviewed multiple processing layers in the
chemoinformatics pipeline followed by the introduction of commonly used machine learning models in
drug discovery and QSAR analysis. Here, we present basic principles and recent case studies to
demonstrate the utility of machine learning techniques in chemoinformatics analyses; and we discuss
limitations and future directions to guide further development in this evolving field.

结论：Machine learning techniques have been widely applied in the field of chemoinformatics to discover and design new drugs with superior biological activities. Mathematical mining of chemical graphs enables the derivation of a constellation of 2D or 3D chemical descriptors, which are packaged as chemical fingerprints in a diverse array of machine learning models and predictive tasks. A key area of innovation in the field is the marriage of big data and machine learning to predict wider ranges of biological phenomena. Traditional drug design methods based on simple ligand–protein interactions are no longer sufficient for meeting clinical drug safety criteria. High drug attrition rates from severe side effects often involve biological pathways and systematic responses at higher levels. Consequently, incorporating multiple data types and sources, also known as ‘data fusion’ techniques, that aggregate structural, genetic and pharmacological data from the molecular to organism level, will be crucial for the discovery of safer and more-effective drugs [105]. Likewise, novel machine learning models capable of processing big data at high volume, velocity and veracity with great versatility are also needed. Recent evolution in deep learning networks has proven to be a promising architecture for efficient learning from massive datasets for modern drug discovery campaigns [106]. Other aspects of machine learning techniques such as increased data interpretability to prove mechanistic hypothesis as well as methods preventing overfitting are also important topics that warrant further development in the field of machine-learning-based drug discovery.

Introduction：Therefore, advanced chemoinformatics and machine learning techniques capable of modeling nonlinear datasets, as well as big data of increasing depth and complexity, are needed.

正文组织架构：

1. Overview of chemoinformatics

Chemical graph theory
Chemical descriptors
Chemical fingerprints
Chemical similarity analysis

2. Machine learning models in QSAR

Naive Bayes
Regression analysis
k-Nearest neighbors
Random forest
Support vector machines
Neural networks and deep learning

3. QSAR modeling

正文部分内容摘录：

Chemoinformatics is a broad field that encompass computer science and chemistry with the goal of utilizing information technology to solve problems in the field of chemistry such as chemical information retrieval and extraction, compound database searching and molecular graph mining.
To understand how the structures of chemicals influence their biological activities, it is imperative to review the foundations of chemical graph theory.
Chemical descriptors are numerical features extracted chemical structures for molecular data mining, compound diversity analylsis and compound activity prediction.
3D chemical descriptors extract chemical features from 3D coordinate representations and are considered the most sensitive to structural variations.
Chemical fingerprints are high-dimensional vectors, commonly used in chemometric analysis and similarity-based virtual screening applications, the elements of which are chemical descriptor values.
Chemical similarity search is a fundamental technique for ligand-based drug discovery.
Chemical similarity can also be evaluated based on 3D structural features of compounds.
the matched molecular pairs(MMP) formalism has emerged as a way to define a specific type of transformation or relationship, non-ring singlebound substitutions and facilitate the development of methods for indexing and searching analog relationships.
Machine learning techniques can be broadly classified as supervised or unsupervised learning.
Naive Bayes classifiers are probabilistic models based on Bayes' rule.
Regression analysis can refer to linear regression modeling for continuous data or logistic regression analysis for categorical data.
In kNN, the data containing labeled and unlabeled nodes are represented in a high-dimensional feature space and the labels from the closest nodes are transferred to the query using a majority-voting rule.
there is no principled way of choosing the number of nearest neighbors to use, and values of k that are too high or too low can yield unfavorable false-positive or false-negative rates.
Random forest is an ensemble learning method where multiple decision trees are built based on the training data and a majority voting scheme similar to kNN is used to make classification or regression predictions for new inputs.
SVMs solve the classification problem by using nonlinear kernel functions to map data into high-dimensional space by finding an optimally separating hyperplane.
Artificial neural networks (ANNs) are a family of machine learning algorithms, inspired by the operations of neurons in the brain.
Deep learning networks are a recent extension of ANNs, which utilize deep and specialized architectures to learn useful features from raw data.
Graph convolutional networks (GCNs) are variants of CNNs that have been commonly applied to 2D molecular graph anlysis.
Recurrent neural networks (RNNs) are another major family of deep neural networks that have been widely used in natural language processing.
The general protocol for construcing QSAR models for drug discovery has been systematized and consists of several modular steps involving the chemoinformatics and machine learning techniques previously discussed.