从图片中（批量）提取化学/分子结构——使用python进行批量提取（代码示例）

最新推荐文章于 2022-12-31 00:19:23 发布

鸾镜朱颜暗换

最新推荐文章于 2022-12-31 00:19:23 发布

阅读量6k

点赞数 3

分类专栏： python 文章标签： python selenium 人工智能

本文链接：https://blog.csdn.net/qq_34769162/article/details/118612037

版权

python 专栏收录该内容

62 篇文章 11 订阅

订阅专栏

如有不懂，留言评论

化学分子结构识别

example1

example2

首先推荐有兴趣和时间的小伙伴读一篇文章
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00465-0
这篇文章给出了众多在线和离线工具的介绍和评估。

易用的在线平台推荐

我个人使用过两个在线平台

https://cactus.nci.nih.gov/cgi-bin/osra/index.cgi（https://cactus.nci.nih.gov/osra/）
这是流传甚广的一个在线api，支持将图片转换为sd文件和smiles式。给大家举个例子：
https://molvec.ncats.io/#
这是mol2vec的作者做的一个前端，集成了mol2vec, osra以及imago。同样举个例子：

使用python从图片提取分子结构1

go top

osra本人没有找到提供的api，可以使用selnium来模拟浏览器点击进行提取。贴一下我本人写的代码

import time
import os
import json

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

def upload(img_path):
    select_xpath = '/html/body/center/form/table/tbody/tr[3]/td[1]/input[2]'
    submit_xpath = '//*[@id="b_upload"]'
    clear_xpath = '/html/body/center/form/table/tbody/tr[3]/td[1]/center/input[2]'

    firefox.find_element_by_xpath(clear_xpath).click()
    firefox.find_element_by_xpath(select_xpath).send_keys(img_path)
    submit_button = firefox.find_element_by_xpath(submit_xpath).click()

def get_information():
    get_smiles_xpath = '//*[@id="b_getsmiles"]'
    smiles_xpath = '/html/body/center/form/table/tbody/tr[3]/td[2]/input[1]'

    firefox.find_element_by_xpath(get_smiles_xpath).click()
    text = firefox.find_element_by_xpath(smiles_xpath).get_attribute("value")

    return text

def main():
    global firefox
    firefox = webdriver.Firefox()
    firefox.get('https://cactus.nci.nih.gov/cgi-bin/osra/index.cgi')
    wait = WebDriverWait(firefox, 20)

    img_folder = 'your_img_folder'
    imgs76 = os.listdir(img_folder)
    smiles_list = {}
    for img in imgs76:
        upload(img_folder.replace('/', '\\') + '\\' + img)
        time.sleep(7)
        try:
            tmp_text = get_information()
            firefox.save_screenshot('res/' + img.rstrip('.jpg') + '.png')
            smiles_list[img] = tmp_text
        except:
            smiles_list[img] = 'Sorry, no structures found'
        
        print(smiles_list)
    firefox.quit()

    with open('result.json', 'w') as fp:
        json.dump(smiles_list, fp)

if __name__ == '__main__':
    main()

上面的代码

打开你的目标文件夹（里面是一堆分子结构截图）
然后模拟浏览器行为进行批量处理。每个图片等待十秒并将结果截图。

需要注意的是你得下载一个webdriver，firefox或者chrome driver都可以（代码中firefox = webdriver.Firefox()是用了Firefox的driver，driver百度即可下载）。

如果你需要保存sd文件，那么模拟浏览器点击Get SD File即可。可以留言询问本人

使用python从图片提取分子结构2

go top

import requests
import os

'''
This script transfer molecule image to mol format (saved as sdf files).
You should change the image folder (specified in line 11) ane the name of result file (specified in line 26).
'''

def get_sdf(name, img_folder):

    imgs76 = os.listdir(img_folder)

    url = 'https://molvec.ncats.io/molvec'
    headers = {'Content-Type' : 'image/jpg'}

    for imgs in imgs76:
        with open('{}/{}'.format(img_folder, img), 'rb') as fp:
            r = requests.post(url, data=fp, headers=headers)

        with open('{}.sdf'.format(patent), 'a') as fp:
            fp.write(r.json()['molvec']['molfile'])
            fp.write('\n$$$$\n')

def main():
    name = 'abc'
    img_folder = "***"
    get_sdf(name, img_folder)

if __name__ == '__main__':
    main()