from rdkit import Chem
from rdkit.Chem import AllChem
def ClusterFps(fps,cutoff=0.2):
from rdkit import DataStructs
from rdkit.ML.Cluster import Butina
# first generate the distance matrix:
dists = []
nfps = len(fps)
for i in range(1,nfps):
sims = DataStructs.BulkTanimotoSimilarity(fps[i],fps[:i])
dists.extend([1-x for x in sims])
# now cluster the data:
cs = Butina.ClusterData(dists,nfps,cutoff,isDistData=True)
return cs
RDKit实现了Butina聚类算法,基于文献 `Unsupervised Database Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way to Cluster Small and Large Data Sets', JCICS, 39, 747-750 (1999)`
上面定义好了函数,接下来读取化合物。
ms = [x for x in Chem.SDMolSupplier('ZINC_example.sdf')] print (len(ms)) output: 10000
计算指纹并聚类,cutoff是指Tanimoto 相似度的阈值 ,差异度大于cutoff的化合物会被分到不同的类别中。
fps = [AllChem.GetMorganFingerprintAsBitVect(x,2,1024) for x in ms
clusters=ClusterFps(fps,cutoff=0.4)
print(clusters[100])
output: (553, 297, 547)
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
m1 = ms[553]
m2 = ms[297]
m3 = ms[547]
mols=(m1,m2,m3)
Draw.MolsToGridImage(mols)
cluster[100] 中有三个分子,分别是553,297,547号。使用Draw函数绘制如下:
RDkit的聚类算法可以支持10万级别的化合物库,然而在内存为125 GB的服务器上测试20万分子的聚类时,可以看到随着聚类的时间增长,占用的内存也不停增加,最终在半个小时后,聚类20万分子由于内存溢出而崩溃。