Spec2Vec快速入门

前言

前言:自己最近在做的工作是LC-MS解谱相关的工作,PLoS Computational Biology在今年2月发表了一篇题为 Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships 的文章。该文章介绍了一种新的计算质谱相似度的方法。新的算法Spec2Vec是基于自然语言处理中Word2Vec改进而来,与传统的余弦相似度相比,Spec2Vec提高了质谱相似度与结构相似度之间的相关性。楼主自己也刚刚了一下这个库的用法,尚不精通,如果有同行的话欢迎来交流学习。


matchms

matchms也是同一个作者开发的用于处理质谱的Python工具包,该部分内容来源于文章作者Florian Huber的博客:
https://blog.esciencecenter.nl/build-your-own-mass-spectrometry-analysis-pipeline-in-python-using-matchms-part-i-d96c718c68ee

matchms包直接使用pip安装即可

pip install matchms


导入和导出mgf文件步骤

from matchms.importing import load_from_mgf

spectrums=list(load_from_mgf("filename.mgf")) # 直接返回是一个generator,转为list

spectrums[0].peaks.mz # 获取mz,是一个numpy.ndarray
spectrums[0].peaks.intensities # 获取intensity
spectrums[0].metadata # 获取其他的信息,是以字典的形式来存的


from matchms.exporting import save_as_json
from matchms.exporting import save_as_mgf

save_as_mgf(spectrums,"filename.mgf")
save_as_json(spectrums,"filename.json")

里面用到的一个范例mgf文件可以在如下网址下载得到,也不一定要用他的,就随便一个mgf就可以
https://gnps-external.ucsd.edu/gnpslibrary/GNPS-NIH-NATURALPRODUCTSLIBRARY.mgf



处理metadata常用的步骤

from matchms.filtering import default_filters
from matchms.filtering import repair_inchi_inchikey_smiles
from matchms.filtering import derive_inchikey_from_inchi
from matchms.filtering import derive_smiles_from_inchi
from matchms.filtering import derive_inchi_from_smiles
from matchms.filtering import harmonize_undefined_inchi
from matchms.filtering import harmonize_undefined_inchikey
from matchms.filtering import harmonize_undefined_smiles
def metadata_processing(spectrum):
    spectrum = default_filters(spectrum)
    spectrum = repair_inchi_inchikey_smiles(spectrum)
    spectrum = derive_inchi_from_smiles(spectrum)
    spectrum = derive_smiles_from_inchi(spectrum)
    spectrum = derive_inchikey_from_inchi(spectrum)
    spectrum = harmonize_undefined_smiles(spectrum)
    spectrum = harmonize_undefined_inchi(spectrum)
    spectrum = harmonize_undefined_inchikey(spectrum)
    return spectrum


过滤碎片中的噪音常用步骤

from matchms.filtering import default_filters
from matchms.filtering import normalize_intensities
from matchms.filtering import select_by_intensity
from matchms.filtering import select_by_mz
def peak_processing(spectrum):
    spectrum = default_filters(spectrum)
    spectrum = normalize_intensities(spectrum)
    spectrum = select_by_intensity(spectrum, intensity_from=0.01)
    spectrum = select_by_mz(spectrum, mz_from=10, mz_to=1000)
    return spectrum

# normalize_intensities()可以把强度从绝对强度转化为0到1的相对强度
# select_by_intensity()只选择强度大于指定值的碎片,其他过滤掉


计算余弦相似度

from matchms import calculate_scores
from matchms.similarity import CosineGreedy
similarity_measure = CosineGreedy(tolerance=0.005)
scores = calculate_scores(spectrums, spectrums, similarity_measure, is_symmetric=True)
# 因为cos(a,b)跟cos(b,a)是一样的,这里设置是对称的减少一半计算量

# 返回的scores是一个Score对象,真正的分数存储在scores.scores中,也是一个ndarray,是一个二维矩阵,以第一行为例
>>> scores.scores[0]
array([(1.00000000e+00, 146), (1.38193083e-02,  10),
       (9.34809152e-01,  16), (3.16782638e-04,  10),
       (9.73884325e-01,   4), (3.96869959e-03,  13),
       (9.80994275e-03,   6), (1.28448202e-03,   7),
       (2.70988585e-03,   9), (1.51188680e-02,  20)],
      dtype=[('score', '<f8'), ('matches', '<i4')])
>>> scores.scores[0][0]['score']
1.0


根据相似度对结果排序

best_matches = scores.scores_by_query(spectrums[5], sort=True)
# 返回结果是个二元组
>>> best_matches[:5]
[(<matchms.Spectrum.Spectrum object at 0x000002971CE89F08>, (1., 146)), (<matchms.Spectrum.Spectrum object at 0x000002971CEABCC8>, (0.97388432, 4)), (<matchms.Spectrum.Spectrum object at 0x000002971CE8DD88>, (0.93480915, 16)), (<matchms.Spectrum.Spectrum object at 0x000002971CEB2D48>, (0.01511887, 20)), (<matchms.Spectrum.Spectrum object at 0x000002971CE95DC8>, (0.01381931, 10))]


计算改良cosine

from matchms.similarity import ModifiedCosine
similarity_measure = ModifiedCosine(tolerance=0.005)
scores = calculate_scores(spectrums, spectrums, similarity_measure,is_symmetric=True)

改良Cosine法的文献来源:
Mass spectral molecular networking of living microbial colonies


Spec2Vec

下面是spec2vec的教程,来自Florian Huber的另一篇博客:
https://blog.esciencecenter.nl/build-a-mass-spectrometry-analysis-pipeline-in-python-using-matchms-part-ii-spec2vec-8aa639571018

同样是使用简单的pip安装即可

pip install genism
pip install spec2vec

Note:
直接安装的spec2vec有几个小bugs要改一下,直接使用会报错,详见楼主另一篇博客:
Spec2Vec中的bugs


导入文章作者预训练好的模型

import gensim
filename = "spec2vec_AllPositive_ratio05_filtered_201101_iter_15.model"
model=gensim.models.Word2Vec.load(filename)

预训练模型下载地址:
https://zenodo.org/record/4173596



>>> next(iter(model.wv.key_to_index)) 
'peak@289.29'

每个单独的词就长这样了,这里也说明预训练模型保留的碎片到小数点后两位
这里原教程用的是model.wv.vocab,gensim更新后vocab属性改为了key_to_index



’计算spec2vec质谱相似度

from spec2vec import Spec2Vec
spec2vec_similarity = Spec2Vec(model=model,
                               intensity_weighting_power=0.5,
                               allowed_missing_percentage=5.0)

allowed_missing_percentage可能需要不断调整,默认是5.0,可能会太低,个人觉得调到二三十甚至更高都无所谓。这个参数的意思是允许有百分之多少的待检测碎片峰在训练集中是缺失的。

from matchms import calculate_scores
scores = calculate_scores(spectrums, spectrums, spec2vec_similarity,
                          is_symmetric=True)


训练一个新的spec2vec模型

from spec2vec import SpectrumDocument
spectrum_documents = [SpectrumDocument(s, n_decimal=2) for s in spectrums]
# 首先要将质谱参考库训练成一个"语料库文档"


from spec2vec.model_building import train_new_word2vec_model
model_file = "output_model.model"
model = train_new_word2vec_model(spectrum_documents, iterations=[30], filename=model_file,
                                 workers=2, progress_logger=True)
# 训练一个新模型,workers是用多少个核来算,iterations就是epochs

model2 = gensim.models.Word2Vec.load("output_model.model")
# 导入之前已经训练过的模型,就不用每次都重新训练

完整demo

下面是将预处理,训练模型,计算spec2vec相似度三个部分组合起来的一个demo,来源spec2vec的参考文档:
https://spec2vec.readthedocs.io/en/latest/index.html

from matchms.filtering import add_losses
from matchms.filtering import add_parent_mass
from matchms.filtering import default_filters
from matchms.filtering import normalize_intensities
from matchms.filtering import reduce_to_number_of_peaks
from matchms.filtering import require_minimum_number_of_peaks
from matchms.filtering import select_by_mz
from matchms.importing import load_from_mgf
from spec2vec import SpectrumDocument
from spec2vec.model_building import train_new_word2vec_model

def spectrum_processing(s):
    """This is how one would typically design a desired pre- and post-
    processing pipeline."""
    s = default_filters(s)
    s = add_parent_mass(s)
    s = normalize_intensities(s)
    s = reduce_to_number_of_peaks(s, n_required=10, ratio_desired=0.5, n_max=500)
    s = select_by_mz(s, mz_from=0, mz_to=1000)
    s = add_losses(s, loss_mz_from=10.0, loss_mz_to=200.0)
    s = require_minimum_number_of_peaks(s, n_required=10)
    return s

# Load data from MGF file and apply filters
reference_spectrums_mgf="your-reference-database.mgf"
spectrums = [spectrum_processing(s) for s in load_from_mgf(reference_spectrums_mgf)]

# Omit spectrums that didn't qualify for analysis
spectrums = [s for s in spectrums if s is not None]

# Create spectrum documents
reference_documents = [SpectrumDocument(s, n_decimals=2) for s in spectrums]

model_file = "references.model"
model = train_new_word2vec_model(reference_documents, iterations=[10,20,30], filename=model_file, progress_logger=True,workers=2)
# 原demo文件中的train_new_word2vec_model()调用时传参有误,此处已修正
print("Model training finished!")


'''上面部分已经完成了模型训练,下面进行相似度计算'''


import gensim
from matchms import calculate_scores
from spec2vec import Spec2Vec

query_spectrums_mgf="your-query-spectrums.mgf"
query_spectrums = [spectrum_processing(s) for s in load_from_mgf(query_spectrums_mgf)]

# Omit spectrums that didn't qualify for analysis
query_spectrums = [s for s in query_spectrums if s is not None]

# Import pre-trained word2vec model (see code example above)
model_file = "references.model"
model = gensim.models.Word2Vec.load(model_file)

# Define similarity_function
spec2vec = Spec2Vec(model=model, intensity_weighting_power=0.5,
                    allowed_missing_percentage=30.0)
# allowed_missing_percentage可能需要不断调整,之前默认是5.0,可能会太低
# 这个参数的意思是允许有百分之多少的待检测碎片峰在训练集中是缺失的

# Calculate scores on all combinations of reference spectrums and queries
scores = calculate_scores(reference_documents, query_spectrums, spec2vec)

# Find the highest scores for a query spectrum of interest
# best_matches = scores.scores_by_query(query_documents[0], sort=True)[:10]
best_matches = scores.scores_by_query(query_spectrums[0], sort=True)[:10]

# Return highest scores
print([x[1] for x in best_matches])

最后

如有错误,烦请批评指正。欢迎同行一起来交流学习,共同进步!

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值