NLP太卷，我去研究蛋白质了~

夕小瑶

于 2020-11-12 22:20:00 发布

阅读量596

点赞数 2

文章标签： css sms html openssh ipa

本文链接：https://blog.csdn.net/xixiaoyaoww/article/details/109665004

版权

为什么“单词”被省略了：单词的本质是含义简单且可以高频重复的信息，句子的本质是经过多个单词不断消歧最终包含指向性含义的信息。从基因角度来看，大的片段相当于句子，对这些片段再分段起单词作用，密码子（每三个核苷酸）对应一个氨基酸，本质上还是字母。从蛋白质角度来看，二级结构中由氢键造成的较为规律的折叠、螺旋可以视作单词，能实现特定功能的蛋白质才称得上句子。

参考文献

理论基础，思想很重要，但论证得并不好：
Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler‐Doty, M., & Grzybowski, B. A. (2014). Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie International Edition, 53(31), 8108-8112.
综述类，关联NLP方法和应用领域的表格挺有价值的：
Öztürk, H., Özgür, A., Schwaller, P., Laino, T., & Ozkirimli, E. (2020). Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discovery Today, 25(4), 689-705.
首度提出Protein Vector(Protvec)和Gene Vector(Genevec)的概念：
Asgari, E., & Mofrad, M. R. K. (2015). Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE, 10(11), 1–15.
Protein与word embedding的结合：
Bepler, T., & Berger, B. (2019). Learning protein sequence embeddings using information from structure. 7th International Conference on Learning Representations, ICLR 2019, 1–17.
虽然漫画中将2018年Schwaller发表的Seq2Seq（被期刊接收且效果好，见6）视作这个方法在生物分子领域的第一次成功应用，但做这方面的论文一般都会引用这篇作为一切故事的开端。两个韩国高中生的作业，能做到这样真的很厉害了：
Nam, J., & Kim, J. (2016). Linking the neural machine translation and the prediction of organic chemistry reactions. arXiv preprint arXiv:1612.09529.
Seq2Seq最佳：
Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C., & Laino, T. (2018). “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chemical science, 9(28), 6091-6098.
另一篇比较有价值的Seq2Seq：
Karimi, M., Wu, D., Wang, Z., & Shen, Y. (2019). DeepAffinity: Interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks. Bioinformatics, 35(18), 3329–3338.
漂亮的标题漂亮的intro，但内容不是很惊艳的BERT应用：
Vig, J., Madani, A., Varshney, L. R., Xiong, C., Socher, R., & Rajani, N. F. (2020). Bertology meets biology: Interpreting attention in protein language models. arXiv preprint arXiv:2006.15222.

萌屋作者：白鹡鸰

白鹡鸰（jí líng）是一种候鸟，天性决定了会横跨很多领域。已在上海交大栖息四年，目前以图像语义为食，但私下也对自然语言很感兴趣，喜欢在卖萌屋轻松不失严谨的氛围里浪~~形~~飞~~翔~~

因为刚开始Ph.D.，文章还统统是放在天上的卫星，接下来会尽早与大家正式见面的！知乎ID也是白鹡鸰，欢迎造访。

后台回复关键词【入群】

加入卖萌屋NLP/IR/Rec与求职讨论群

有顶会审稿人、大厂研究员、知乎大V和妹纸

等你来撩哦~

夕小瑶

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
NLP太卷，我去研究蛋白质了~

为什么“单词”被省略了：单词的本质是含义简单且可以高频重复的信息，句子的本质是经过多个单词不断消歧最终包含指向性含义的信息。从基因角度来看，大的片段相当于句子，对这些片段再分段起单词作用...
复制链接

扫一扫