一、2D化学结构识别:
0、【2022综述】Review of techniques and models used in optical chemical structure recognition in images and scanned documents
1、在线转换网址:DECIMER Web Application
它对应的代码地址为:GitHub - Kohulan/DECIMER_Short_Communication
2、 DECIMER【目前用起来最准确】
https://github.com/Kohulan/DECIMER-Image_Transformer
这里我用的是2:DECIMER,按照readme上面的做就可以了
3、Img2Mol: inferring molecules from pictures(也挺好的)
GitHub - bayer-science-for-a-better-life/Img2Mol
4、Automated Recognition of Chemical Molecule Images Based on an Improved TNT Model(2022)
没有代码
5、scribe【文章写的比DECIMER准确,但是测试起来,还是DECIMER更加准确】
GitHub - thomas0809/MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation
备注:我测试过,其实并有没有很好地能确定image2SMILES转换的model,也不知道怎么判断这个转换后的SMILES与原始的image之间的置信度(因为一个 very invalid的分子也可以转为SMILES)
Img2Mol 使用方法:
这里的3比2好用,但是3最好使用local-cddd,因为不使用本地的,还需要联网,可能会出现很多错误,
conda env create -f environment.local-cddd.yml
conda activate img2mol
pip install .
然后:
If you are working with the local CDDD installation, please * download and unzip the CDDD model and 将 directory default_model to path/to/anaconda3/envs/img2mol/lib/python3.6/site-packages/cddd/data/
DECIMER使用方法(目前我在用这个):
我自己做好的位置和文档在:D:\Pycharm_workspace\MolImageGeneration\Uni-Dock\image2smiles\image2smiles.py
conda create --name DECIMER python=3.9
conda activate DECIMER
安装包:
pip install decimer
一定要再安装tensorflow==2.10.1,否则会报无法使用GPU的错误
pip install tensorflow==2.10.1
直接上代码:
image2smiles2image.py的代码:
其中input_images是生成器G生成的image;
image2smiles_all.csv是是G生成的image然后转为所有的的smiles
image2smiles_validity.csv是是G生成的image然后转为有效的smiles
image2smiles_unvalidity.csv是是G生成的image然后转为无效的的smiles
image2smiles2image:是G生成的image然后转为smiles再转为image的文件夹
command:
python image2smiles.py --input_images_path Tests/xxx --image2smiles2image_save_path Tests/image2smiles2image/ --image2smiles_all Tests/image2smiles_all.csv --image2smiles_validity Tests/image2smiles_validity.csv --image2smiles_unvalidity Tests/image2smiles_unvalidity.csv
code:
注意路径问题:# 在windows下使用“\\”,在linux下使用“/”,注意切换
from DECIMER import predict_SMILES
import glob
import os
import csv
from rdkit import Chem
import argparse
from rdkit.Chem import Draw
import time
"""
conda activate DECIMER
## Adenosine_A2a_receptor-sample-1k
python image2smiles/image2smiles.py --input_images_path data/Adenosine_A2a_receptor-sample-1k \
--image2smiles2image_save_path data/image2smiles2image/ \
--image2smiles_all data/image2smiles_all.csv \
--image2smiles_validity data/image2smiles_validity.csv \
--image2smiles_unvalidity data/image2smiles_unvalidity.csv
## Dopamine_D3_receptor-sample-1k
python image2smiles/image2smiles.py --input_images_path data/Dopamine_D3_receptor-sample-1k \
--image2smiles2image_save_path data/image2smiles2image-Dopamine_D3_receptor-sample-1k/ \
--image2smiles_all data/image2smiles_all-Dopamine_D3_receptor-sample-1k.csv \
--image2smiles_validity data/image2smiles_validity-Dopamine_D3_receptor-sample-1k.csv \
--image2smiles_unvalidity data/image2smiles_unvalidity-Dopamine_D3_receptor-sample-1k.csv
"""
# 记录开始时间
start_time = time.time()
# Get all png files under the input folder
parser = argparse.ArgumentParser(description='Testing script', add_help=False)
parser.add_argument('--input_images_path', default='../../eval_output_images/QM9/generator_images', help='Input images folder')
parser.add_argument('--image2smiles2image_save_path', default='../../eval_output_images/QM9/image2smiles2image/')
parser.add_argument('--image2smiles_all', default='../../eval_output_images/QM9/image2smiles_all.csv')
parser.add_argument('--image2smiles_validity', default='../../eval_output_images/QM9/image2smiles_validity.csv')
parser.add_argument('--image2smiles_unvalidity', default='../../eval_output_images/QM9/image2smiles_unvalidity.csv')
args = parser.parse_args()
input_img_path = glob.glob(args.input_images_path + "/*.[jp][pn]g")
image2smiles2image_save_path = args.image2smiles2image_save_path
def mkdir(path):
folder = os.path.exists(path)
if not folder: # 判断是否存在文件夹如果不存在则创建为文件夹
os.makedirs(path) # makedirs 创建文件时如果路径不存在会创建这个路径
print("--- create new folder... ---")
else:
print("--- There is this folder! ---")
mkdir(image2smiles2image_save_path)
i = 0
unrecover_images = 0
for file in input_img_path:
# 在windows下使用“\\”,在linux下使用“/”,注意切换
file_name = file.split('/')[-1]
SMILES = predict_SMILES(file)
i = i + 1
print("The current process image ", i, " is :", file, ", And the SMILES is :", SMILES)
f_validity = open(args.image2smiles_validity, 'a', newline='', encoding='utf-8')
csv_writer_validity = csv.writer(f_validity)
f_unvalidity = open(args.image2smiles_unvalidity, 'a', newline='', encoding='utf-8')
csv_writer_unvalidity = csv.writer(f_unvalidity)
# save all
f_all = open(args.image2smiles_all, 'a', newline='', encoding='utf-8')
csv_writer_all = csv.writer(f_all)
csv_writer_all.writerow([file_name, SMILES])
f_all.close()
# save valid
try:
mol = Chem.MolFromSmiles(SMILES)
canonical_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
img = Draw.MolsToGridImage([mol], molsPerRow=1, subImgSize=(256, 256))
img_save_path = str(image2smiles2image_save_path + file_name)
img.save(img_save_path)
if canonical_smiles :
csv_writer_validity.writerow([file_name, SMILES])
f_validity.close()
# save unvalid
# 不能从SMILES转换为分子结构图片【可能是包含R基的,但是不一定是错误的】
except Exception as e:
unrecover_images = unrecover_images + 1
print("The current process image ", i, " is :", file, ", And the SMILES is :", SMILES, " is not un-validity !")
csv_writer_unvalidity.writerow([file_name, SMILES])
f_unvalidity.close()
print("The totel images is :", i, " , And the unrecover_images is :", unrecover_images, ", And the success rate is:", (i - unrecover_images) / i)
end_time = time.time()
total_time = end_time - start_time
hours = int(total_time // 3600)
minutes = int((total_time % 3600) // 60)
seconds = total_time % 60
print(f"总共运行的时间: {hours} 小时 {minutes} 分钟 {seconds:.2f} 秒")