大型语言模型的下一个前沿是生物学

a5b2175e9d1d27bebda8ad341c6e4dec.jpeg

导语

正如许多现代观察家所指出的,21世纪是生物学的世纪。过去几年,以AlphaFold为代表的人工智能技术已经成为破解生命复杂性难题的重要工具,而新兴的大语言模型有望从多个层次进一步提升生命科学研究。

研究领域:大型语言模型,人工智能,AI for Science,蛋白质设计,生物学基础模型

来源:科技世代千高原

作者:Rob Toews 

像 GPT-4 这样的大型语言模型因其对自然语言的惊人掌握而席卷了世界。然而,大语言模型最重要的长期机会将需要一种完全不同类型的语言:生物学语言。

过去一个世纪,生物化学、分子生物学和遗传学研究进展的长征中出现了一个引人注目的主题:事实证明,生物学是一个可破译、可编程、在某些方面甚至是数字系统。

DNA 仅使用四个变量——A(腺嘌呤)、C(胞嘧啶)、G(鸟嘌呤)和 T(胸腺嘧啶)来编码地球上每个生物体的完整遗传指令。将此与现代计算系统进行比较,现代计算系统使用两个变量(0 和 1)对世界上所有的数字电子信息进行编码。一个系统是二元系统,另一个是四元系统,但这两个系统在概念上有惊人的重叠;这两个系统都可以正确地被认为是数字化的。

再举一个例子,每个生物体中的每种蛋白质都由以特定顺序连接在一起的一维氨基酸串组成并定义。蛋白质的长度从几十到几千个氨基酸不等,有 20 种不同的氨基酸可供选择。

这也代表了一种非常可计算的系统,语言模型非常适合学习该系统。

正如 DeepMind 首席执行官兼联合创始人Demis Hassabis 所说:“在最基本的层面上,我认为生物学可以被视为一种信息处理系统,尽管它是一个极其复杂和动态的系统。正如数学被证明是物理学的正确描述语言一样,生物学也可能成为人工智能应用的完美类型。”

当大型语言模型能够利用大量信号丰富的数据,推断出远远超出任何人类吸收能力的潜在模式和深层结构时,它们就会变得最强大。然后,他们可以利用对主题的复杂理解来生成新颖、令人惊叹的复杂输出。

例如,通过吸收互联网上的所有文本,ChatGPT 等工具已经学会了就任何可以想象的话题进行深思熟虑和细致入微的交谈。通过摄取数十亿张图像, Midjourney等文本到图像模型已经学会了按需生成创意原创图像。

将大型语言模型指向生物数据——使它们能够学习生命的语言——将释放各种可能性,使自然语言和图像相比之下显得微不足道。

具体来说,这会是什么样子?

短期内,在生命科学中应用大型语言模型的最引人注目的机会是设计新型蛋白质。

蛋白质101

蛋白质是生命本身的中心。正如著名生物学家阿瑟·莱斯克 (Arthur Lesk) 说道:“在分子尺度的生命戏剧中,蛋白质是发挥作用的地方。”

蛋白质几乎参与所有生物体内发生的每项重要活动:消化食物、收缩肌肉、在全身输送氧气、攻击外来病毒。你的荷尔蒙是由蛋白质组成的;你的头发也是如此。

蛋白质非常重要,因为它们用途广泛。它们能够承担大量不同的结构和功能,远远超过任何其他类型的生物分子。这种令人难以置信的多功能性是蛋白质构建方式的直接结果。

如上所述,每种蛋白质都由一串按特定顺序串在一起的称为氨基酸的结构单元组成。基于这种一维氨基酸序列,蛋白质折叠成复杂的三维形状,使它们能够发挥其生物功能。

蛋白质的形状与其功能密切相关。举个例子,抗体蛋白折叠成形状,使它们能够精确识别和瞄准异物,就像钥匙插入锁中一样。另一个例子,酶——加速生化反应的蛋白质——经过专门设计,可以与特定分子结合,从而催化特定反应。因此,了解蛋白质折叠成的形状对于了解生物体如何运作以及最终了解生命本身如何运作至关重要。

半个多世纪以来,仅根据蛋白质的一维氨基酸序列确定蛋白质的三维结构一直是生物学领域的一项巨大挑战。它被称为“蛋白质折叠问题”,困扰了几代科学家。2007 年,一位评论员将蛋白质折叠问题描述为“现代科学中最重要但尚未解决的问题之一”。

深度学习和蛋白质:天作之合

2020 年底,在生物学和计算领域的分水岭时刻,一个名为AlphaFold的人工智能系统提出了蛋白质折叠问题的解决方案。AlphaFold由 Alphabet 的 DeepMind 构建,能够在大约一个原子的宽度内正确预测蛋白质的三维形状,远远优于人类曾经设计过的任何其他方法。
 
AlphaFold 的重要性怎么强调都不为过。长期蛋白质折叠专家约翰·莫尔特 总结得很好:“这是人工智能第一次解决严重的科学问题。”

然而,当谈到人工智能和蛋白质时, AlphaFold 仅仅是一个开始。

AlphaFold不是使用大型语言模型构建的。它依赖于一种称为多重序列比对(MSA)的较旧的生物信息学结构,其中将蛋白质的序列与进化上相似的蛋白质进行比较,以推断其结构。
 
AlphaFold 所表明的那样,MSA 非常强大,但它也有局限性。

其一,它速度慢且计算量大,因为它需要参考许多不同的蛋白质序列才能确定任何一种蛋白质的结构。更重要的是,由于 MSA 需要存在大量进化和结构相似的蛋白质才能推理出新的蛋白质序列,因此它对于所谓的“孤儿蛋白质”(几乎没有或没有相似类似物的蛋白质)的用途有限。这些孤儿蛋白大约占所有已知蛋白序列的 20%。

最近,研究人员开始探索一种有趣的替代方法:使用大型语言模型而不是多重序列比对来预测蛋白质结构。

结构和功能之间的复杂模式和相互关系:比如,如何改变蛋白质某些部分的某些氨基酸。蛋白质的序列会影响蛋白质折叠的形状。如果您愿意,蛋白质语言模型能够学习蛋白质的语法或语言学。
 
蛋白质语言模型的想法可以追溯到哈佛大学 George Church 实验室2019 年UniRep 的工作(尽管UniRep使用 LSTM 而不是当今最先进的 Transformer 模型)。

2022 年底,Meta推出了ESM-2和ESMFold ,这是迄今为止发布的最大、最复杂的蛋白质语言模型之一,参数达 150 亿个。(ESM-2 是 LLM 本身;ESMFold是其相关的结构预测工具。)
 
ESM-2/ ESMFold在预测蛋白质三维结构方面与AlphaFold大致相同。但与AlphaFold不同的是,它能够基于单个蛋白质序列生成结构,而不需要任何结构信息作为输入。因此,它比AlphaFold快60倍。当研究人员希望在蛋白质工程工作流程中同时筛选数百万个蛋白质序列时,这种速度优势会产生巨大的差异。对于缺乏进化上相似的类似物的孤儿蛋白, ESMFold 还可以比 AlphaFold 产生更准确的结构预测。

语言模型能够对蛋白质的“潜在空间”产生普遍的理解,这为蛋白质科学开辟了令人兴奋的可能性。
 
AlphaFold以来的几年里,更强大的概念进步已经发生。

简而言之,这些蛋白质模型可以逆转:不是根据蛋白质的序列来预测蛋白质的结构,而是可以逆转ESM-2等模型,并根据所需的特性用于生成自然界中不存在的全新蛋白质序列。
 

发明新蛋白质

当今世界上存在的所有蛋白质仅代表理论上可能存在的所有蛋白质的极小一部分。机会就在于此。

给出一些粗略的数字:人体中存在的全部蛋白质(即所谓的“人类蛋白质组”)估计有 80,000 到 400,000 种蛋白质。与此同时,理论上可能存在的蛋白质数量约为101300,这是一个大得难以想象的数字,比宇宙中原子的数量还要多很多倍。(需要明确的是,并非所有这 101300 种可能的氨基酸组合都会产生生物学上可行的蛋白质。远非如此。但某些子集会。)

数百万年来,蜿蜒的进化过程偶然发现了数万或数十万种这样的可行组合。但这只是冰山一角。

用领先的蛋白质人工智能初创公司Generate Biomedicines的联合创始人莫莉·吉布森(Molly Gibson)的话来说:“大自然在生命历史中采样的序列空间量几乎相当于地球所有海洋中的一滴水。”
 
我们有机会改善自然。毕竟,尽管自然选择的进化力量非常强大,但它并不是无所不能的。它不提前计划;它不会以自上而下的方式进行推理或优化。它随机且机会主义地展开,传播恰好有效的组合。

使用人工智能,我们可以第一次系统地、全面地探索蛋白质空间的广阔未知领域,以便设计出不同于自然界中曾经存在的任何蛋白质,专为我们的医疗和商业需求而设计。
 
我们将能够设计新的蛋白质疗法来解决所有人类疾病——从癌症到自身免疫性疾病,从糖尿病到神经退行性疾病。除了医学之外,我们将能够创造出新型蛋白质,并在农业、工业、材料科学、环境修复等领域具有革命性的应用。

一些利用深度学习进行从头蛋白质设计的早期努力并未利用大型语言模型。

一个突出的例子是ProteinMPNN ,它来自华盛顿大学世界著名的 David Baker 实验室。ProteinMPNN架构不使用 LLM,而是严重依赖蛋白质结构数据来生成新蛋白质。
 
Baker 实验室最近发布了RFdiffusion ,这是一种更先进、更通用的蛋白质设计模型。顾名思义, RFdiffusion 是使用扩散模型构建的,这种人工智能技术也为Midjourney 和 Stable Diffusion 等文本到图像模型提供支持。RFdiffusion可以生成新颖的、可定制的蛋白质“骨架”,即蛋白质的整体结构支架,然后可以将序列分层到其上。

ProteinMPNN和RFdiffusion等以结构为中心的模型取得了令人印象深刻的成就,推动了基于人工智能的蛋白质设计的最先进水平。然而,由于大型语言模型的变革能力,我们可能正处于该领域新的变革的风口浪尖。
 
与蛋白质设计的其他计算方法相比,为什么语言模型是一条如此有前途的道路?关键原因之一:规模化(scaling)。
 

标度律

人工智能近期取得的巨大进展背后的关键力量之一是所谓的“标度律”(Scaling law) :事实上,LLM 参数数量、训练数据和计算的持续增加带来了几乎令人难以置信的性能提升。
 
在规模每增加一个数量级时,语言模型都表现出了非凡的、意想不到的、新兴的新能力,超越了较小规模下的可能性。

近年来,正是OpenAI对扩展原则的承诺,使该组织跻身人工智能领域的最前沿。随着OpenAI从 GPT-2 转向 GPT-3、GPT-4 及更高版本,他们构建了更大的模型,部署了更多的计算,并在更大的数据集上进行了训练,比世界上任何其他组织都解锁了令人惊叹的、前所未有的 AI 功能。

标度律与蛋白质领域有何关系?

过去二十年来,科学突破使得基因测序变得更加便宜且更容易获得,可用于训练人工智能模型的 DNA 和蛋白质序列数据的数量呈指数级增长,远远超过了蛋白质结构数据。
 
蛋白质序列数据可以被标记化,并且出于所有意图和目的被视为文本数据;毕竟,它由按一定顺序排列的线性氨基酸串组成,就像句子中的单词一样。大型语言模型可以仅针对蛋白质序列进行训练,以深入了解蛋白质结构和生物学。

因此,这个领域已经成熟,可以进行由大语言模型支持的大规模扩展工作,这些努力可能会在蛋白质科学领域带来惊人的新见解和能力。

第一个使用基于 Transformer 的LLM 来设计从头蛋白质的作品是Salesforce Research 于 2020 年发表的ProGen 。最初的ProGen模型有 12 亿个参数。
 
ProGen的首席研究员Ali Madani此后成立了一家名为 Profluu Bio 的初创公司,以推进大语言模型驱动的蛋白质设计并将其商业化。

Madani率先使用大语言模型进行蛋白质设计,但他也清楚地意识到,仅靠原始蛋白质序列训练的现成语言模型并不是应对这一挑战的最有力方法。结合结构和功能数据至关重要。

“蛋白质设计的最大进步将在于来自不同来源的仔细数据管理和可以灵活地从这些数据中学习的通用模型的交叉点,”马达尼说。“这需要利用我们掌握的所有高信号数据,包括来自实验室的蛋白质结构和功能信息。”
 
Nabla Bio是另一家应用大语言模型设计新型蛋白质疗法的有趣的早期初创公司。Nabla是从哈佛大学 George Church 实验室分离出来的,由UniRep背后的团队领导,专门专注于抗体研究。鉴于当今 60% 的蛋白质治疗药物都是抗体,并且世界上最畅销的两种药物都是抗体治疗药物,因此选择这一选择并不令人意外。

Nabla决定不开发自己的疗法,而是向生物制药合作伙伴提供其尖端技术,作为帮助他们开发自己的药物的工具。

随着世界逐渐认识到蛋白质设计是一个巨大且尚未充分开发的领域,可以在其中应用大型语言模型看似神奇的功能,预计未来数月乃至数年该领域将出现更多的创业活动。
 

前方的路

弗朗西斯·阿诺德 (Frances Arnold) 在 2018 年诺贝尔化学奖获奖感言中表示:“今天,我们可以出于各种实际目的读取、写入和编辑任何 DNA 序列,但我们无法合成它。生命的密码是一首交响乐,引导着无数演奏者和乐器演奏出复杂而优美的部分。

也许我们可以从大自然的成分中剪切和粘贴片段,但我们不知道如何为单个酶通道写出条形。”
 
就在五年前,这也是事实。

但人工智能可能在生命史上第一次让我们有能力从头开始真正构建全新的蛋白质(及其相关的遗传密码),专门为我们的需求而构建。这是一个令人惊叹的可能性。

这些新型蛋白质将作为多种人类疾病的治疗药物,从传染病到癌症;他们将帮助基因编辑成为现实;他们将改变材料科学;它们将提高农业产量;它们将中和环境中的污染物;以及更多我们甚至无法想象的事情。
 
人工智能驱动(尤其是大语言模型驱动)的蛋白质设计领域仍处于新生阶段且未经证实。有意义的科学、工程、临床和商业障碍仍然存在。将这些新疗法和产品推向市场需要数年时间。

但从长远来看,人工智能的市场应用很少有比这更具有前景的。

在未来的文章中,我们将深入研究蛋白质设计的大语言模型,包括探索该技术最引人注目的商业应用,以及计算结果和现实世界湿实验室实验之间的复杂关系。
 
从头蛋白质设计并不是生命科学中大型语言模型唯一令人兴奋的机会。

语言模型可用于生成其他类别的生物分子,特别是核酸。例如,一家名为 Inceptive 的热门初创公司正在应用大语言模型来开发新型 RNA 疗法。

其他团体有着更广泛的愿望,旨在建立通用的“生物学基础模型”,可以融合基因组学、蛋白质序列、细胞结构、表观遗传状态、细胞图像、质谱、空间转录组学等多种数据类型。
 
最终目标是超越对蛋白质等单个分子的建模,转而对蛋白质与其他分子的相互作用进行建模,然后对整个细胞、组织、器官进行建模,最终对整个生物体进行建模。

设计复杂生物系统的每一个复杂细节的人工智能系统的想法是令人难以置信的。随着时间的推移,这将在我们的掌握之中。
 
二十世纪是由物理学的基本进步定义的:从阿尔伯特·爱因斯坦的相对论到量子力学的发现,从核弹到晶体管。正如许多现代观察家所指出的,二十一世纪正在成为生物学的世纪。人工智能和大型语言模型将在未来几十年解开生物学的秘密并释放其可能性方面发挥核心作用。

系好安全带。

7fd9f4c7824f0c260200cda069ebb8ea.jpeg

原文题目:

The Next Frontier For Large Language Models Is Biology

原文地址:

https://www.forbes.com/sites/robtoews/2023/07/16/the-next-frontier-for-large-language-models-is-biology/

68853bd6d9d994d647980dae70a6463d.jpeg

图片来源: U OF W, ROYAL SOCIETY, HARVARD

Large language models like GPT-4 have taken the world by storm thanks to their astonishing command of natural language. Yet the most significant long-term opportunity for LLMs will entail an entirely different type of language: the language of biology.

One striking theme has emerged from the long march of research progress across biochemistry, molecular biology and genetics over the past century: it turns out that biology is a decipherable, programmable, in some ways even digital system.

DNA encodes the complete genetic instructions for every living organism on earth using just four variables—A (adenine), C (cytosine), G (guanine) and T (thymine). Compare this to modern computing systems, which use two variables—0 and 1—to encode all the world’s digital electronic information. One system is binary and the other is quaternary, but the two have a surprising amount of conceptual overlap; both systems can properly be thought of as digital.

To take another example, every protein in every living being consists of and is defined by a one-dimensional string of amino acids linked together in a particular order. Proteins range from a few dozen to several thousand amino acids in length, with 20 different amino acids to choose from.

This, too, represents an eminently computable system, one that language models are well-suited to learn.

As DeepMind CEO/cofounder Demis Hassabis put it: “At its most fundamental level, I think biology can be thought of as an information processing system, albeit an extraordinarily complex and dynamic one. Just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI.”

Large language models are at their most powerful when they can feast on vast volumes of signal-rich data, inferring latent patterns and deep structure that go well beyond the capacity of any human to absorb. They can then use this intricate understanding of the subject matter to generate novel, breathtakingly sophisticated output.

By ingesting all of the text on the internet, for instance, tools like ChatGPT have learned to converse with thoughtfulness and nuance on any imaginable topic. By ingesting billions of images, text-to-image models like Midjourney have learned to produce creative original imagery on demand.

Pointing large language models at biological data—enabling them to learn the language of life—will unlock possibilities that will make natural language and images seem almost trivial by comparison.

What, concretely, will this look like?

In the near term, the most compelling opportunity to apply large language models in the life sciences is to design novel proteins.

Proteins 101

Proteins are at the center of life itself. As prominent biologist Arthur Lesk put it, “In the drama of life at a molecular scale, proteins are where the action is.”

Proteins are involved in virtually every important activity that happens inside every living thing: digesting food, contracting muscles, moving oxygen throughout the body, attacking foreign viruses. Your hormones are made out of proteins; so is your hair.

Proteins are so important because they are so versatile. They are able to undertake a vast array of different structures and functions, far more than any other type of biomolecule. This incredible versatility is a direct consequence of how proteins are built.

As mentioned above, every protein consists of a string of building blocks known as amino acids strung together in a particular order. Based on this one-dimensional amino acid sequence, proteins fold into complex three-dimensional shapes that enable them to carry out their biological functions.

A protein’s shape relates closely to its function. To take one example, antibody proteins fold into shapes that enable them to precisely identify and target foreign bodies, like a key fitting into a lock. As another example, enzymes—proteins that speed up biochemical reactions—are specifically shaped to bind with particular molecules and thus catalyze particular reactions. Understanding the shapes that proteins fold into is thus essential to understanding how organisms function, and ultimately how life itself works.

Determining a protein’s three-dimensional structure based solely on its one-dimensional amino acid sequence has stood as a grand challenge in the field of biology for over half a century. Referred to as the “protein folding problem,” it has stumped generations of scientists. One commentator in 2007 described the protein folding problem as “one of the most important yet unsolved issues of modern science.”

Deep Learning And Proteins: A Match Made In Heaven

In late 2020, in a watershed moment in both biology and computing, an AI system called AlphaFold produced a solution to the protein folding problem. Built by Alphabet’s DeepMind, AlphaFold correctly predicted proteins’ three-dimensional shapes to within the width of about one atom, far outperforming any other method that humans had ever devised.

It is hard to overstate AlphaFold’s significance. Long-time protein folding expert John Moult summed it upwell: “This is the first time a serious scientific problem has been solved by AI.”

Yet when it comes to AI and proteins, AlphaFold was just the beginning.

AlphaFold was not built using large language models. It relies on an older bioinformatics construct called multiple sequence alignment (MSA), in which a protein’s sequence is compared to evolutionarily similar proteins in order to deduce its structure.

MSA can be powerful, as AlphaFold made clear, but it has limitations.

For one, it is slow and compute-intensive because it needs to reference many different protein sequences in order to determine any one protein’s structure. More importantly, because MSA requires the existence of numerous evolutionarily and structurally similar proteins in order to reason about a new protein sequence, it is of limited use for so-called “orphan proteins”—proteins with few or no close analogues. Such orphan proteins represent roughly 20% of all known protein sequences.

Recently, researchers have begun to explore an intriguing alternative approach: using large language models, rather than multiple sequence alignment, to predict protein structures.

“Protein language models”—LLMs trained not on English words but rather on protein sequences—have demonstrated an astonishing ability to intuit the complex patterns and interrelationships between protein sequence, structure and function: say, how changing certain amino acids in certain parts of a protein’s sequence will affect the shape that the protein folds into. Protein language models are able to, if you will, learn the grammar or linguistics of proteins.

The idea of a protein language model dates back to the 2019 UniRep work out of George Church’s lab at Harvard (though UniRep used LSTMs rather than today’s state-of-the-art transformer models).

In late 2022, Meta debuted ESM-2 and ESMFold, one of the largest and most sophisticated protein language models published to date, weighing in at 15 billion parameters. (ESM-2 is the LLM itself; ESMFold is its associated structure prediction tool.)

ESM-2/ESMFold is about as accurate as AlphaFold at predicting proteins’ three-dimensional structures. But unlike AlphaFold, it is able to generate a structure based on a single protein sequence, without requiring any structural information as input. As a result, it is up to 60 times faster than AlphaFold. When researchers are looking to screen millions of protein sequences at once in a protein engineering workflow, this speed advantage makes a huge difference. ESMFold can also produce more accurate structure predictions than AlphaFold for orphan proteins that lack evolutionarily similar analogues.

Language models’ ability to develop a generalized understanding of the “latent space” of proteins opens up exciting possibilities in protein science.

But an even more powerful conceptual advance has taken place in the years since AlphaFold.

In short, these protein models can be inverted: rather than predicting a protein’s structure based on its sequence, models like ESM-2 can be reversed and used to generate totally novel protein sequences that do not exist in nature based on desired properties.

Inventing New Proteins

All the proteins that exist in the world today represent but an infinitesimally tiny fraction of all the proteins that could theoretically exist. Herein lies the opportunity.

To give some rough numbers: the total set of proteins that exist in the human body—the so-called “human proteome”—is estimated to number somewhere between 80,000 and 400,000 proteins. Meanwhile, the number of proteins that could theoretically exist is in the neighborhood of 101300—an unfathomably large number, many times greater than the number of atoms in the universe. (To be clear, not all of these 101300 possible amino acid combinations would result in biologically viable proteins. Far from it. But some subset would.)

Over many millions of years, the meandering process of evolution has stumbled upon tens or hundreds of thousands of these viable combinations. But this is merely the tip of the iceberg.

In the words of Molly Gibson, cofounder of leading protein AI startup Generate Biomedicines: “The amount of sequence space that nature has sampled through the history of life would equate to almost just a drop of water in all of Earth’s oceans.”

An opportunity exists for us to improve upon nature. After all, as powerful of a force as it is, evolution by natural selection is not all-seeing; it does not plan ahead; it does not reason or optimize in top-down fashion. It unfolds randomly and opportunistically, propagating combinations that happen to work.

Using AI, we can for the first time systematically and comprehensively explore the vast uncharted realms of protein space in order to design proteins unlike anything that has ever existed in nature, purpose-built for our medical and commercial needs.

We will be able to design new protein therapeutics to address the full gamut of human illness—from cancer to autoimmune diseases, from diabetes to neurodegenerative disorders. Looking beyond medicine, we will be able to create new classes of proteins with transformative applications in agriculture, industrials, materials science, environmental remediation and beyond.

Some early efforts to use deep learning for de novo protein design have not made use of large language models.

One prominent example is ProteinMPNN, which came out of David Baker’s world-renowned lab at the University of Washington. Rather than using LLMs, the ProteinMPNN architecture relies heavily on protein structure data in order to generate novel proteins.

The Baker lab more recently published RFdiffusion, a more advanced and generalized protein design model. As its name suggests, RFdiffusion is built using diffusion models, the same AI technique that powers text-to-image models like Midjourney and Stable Diffusion. RFdiffusion can generate novel, customizable protein “backbones”—that is, proteins’ overall structural scaffoldings—onto which sequences can then be layered.

Structure-focused models like ProteinMPNN and RFdiffusion are impressive achievements that have advanced the state of the art in AI-based protein design. Yet we may be on the cusp of a new step-change in the field, thanks to the transformative capabilities of large language models.

Why are language models such a promising path forward compared to other computational approaches to protein design? One key reason: scaling.

Scaling Laws

One of the key forces behind the dramatic recent progress in artificial intelligence is so-called “scaling laws”: the fact that almost unbelievable improvements in performance result from continued increases in LLM parameter count, training data and compute.

At each order-of-magnitude increase in scale, language models have demonstrated remarkable, unexpected, emergent new capabilities that transcend what was possible at smaller scales.

It is OpenAI’s commitment to the principle of scaling, more than anything else, that has catapulted the organization to the forefront of the field of artificial intelligence in recent years. As they moved from GPT-2 to GPT-3 to GPT-4 and beyond, OpenAI has built larger models, deployed more compute and trained on larger datasets than any other group in the world, unlocking stunning and unprecedented AI capabilities.

How are scaling laws relevant in the realm of proteins?

Thanks to scientific breakthroughs that have made gene sequencing vastly cheaper and more accessible over the past two decades, the amount of DNA and thus protein sequence data available to train AI models is growing exponentially, far outpacing protein structure data.

Protein sequence data can be tokenized and for all intents and purposes treated as textual data; after all, it consists of linear strings of amino acids in a certain order, like words in a sentence. Large language models can be trained solely on protein sequences to develop a nuanced understanding of protein structure and biology.

This domain is thus ripe for massive scaling efforts powered by LLMs—efforts that may result in astonishing emergent insights and capabilities in protein science.

The first work to use transformer-based LLMs to design de novo proteins was ProGen, published by Salesforce Research in 2020. The original ProGen model was 1.2 billion parameters.

Ali Madani, the lead researcher on ProGen, has since founded a startup named Profluent Bio to advance and commercialize the state of the art in LLM-driven protein design.

While he pioneered the use of LLMs for protein design, Madani is also clear-eyed about the fact that, by themselves, off-the-shelf language models trained on raw protein sequences are not the most powerful way to tackle this challenge. Incorporating structural and functional data is essential.

“The greatest advances in protein design will be at the intersection of careful data curation from diverse sources and versatile modeling that can flexibly learn from that data,” Madani said. “This entails making use of all high-signal data at our disposal—including protein structures and functional information derived from the laboratory.”

Another intriguing early-stage startup applying LLMs to design novel protein therapeutics is Nabla Bio. Spun out of George Church’s lab at Harvard and led by the team behind UniRep, Nabla is focused specifically on antibodies. Given that 60% of all protein therapeutics today are antibodies and that the two highest-selling drugs in the world are antibody therapeutics, it is hardly a surprising choice.

Nabla has decided not to develop its own therapeutics but rather to offer its cutting-edge technology to biopharma partners as a tool to help them develop their own drugs.

Expect to see much more startup activity in this area in the months and years ahead as the world wakes up to the fact that protein design represents a massive and still underexplored field to which to apply large language models’ seemingly magical capabilities.

The Road Ahead

In her acceptance speech for the 2018 Nobel Prize in Chemistry, Frances Arnold said: “Today we can for all practical purposes read, write, and edit any sequence of DNA, but we cannot compose it. The code of life is a symphony, guiding intricate and beautiful parts performed by an untold number of players and instruments. Maybe we can cut and paste pieces from nature’s compositions, but we do not know how to write the bars for a single enzymic passage.”

As recently as five years ago, this was true.

But AI may give us the ability, for the first time in the history of life, to actually compose entirely new proteins (and their associated genetic code) from scratch, purpose-built for our needs. It is an awe-inspiring possibility.

These novel proteins will serve as therapeutics for a wide range of human illnesses, from infectious diseases to cancer; they will help make gene editing a reality; they will transform materials science; they will improve agricultural yields; they will neutralize pollutants in the environment; and so much more that we cannot yet even imagine.

The field of AI-powered—and especially LLM-powered—protein design is still nascent and unproven. Meaningful scientific, engineering, clinical and business obstacles remain. Bringing these new therapeutics and products to market will take years.

Yet over the long run, few market applications of AI hold greater promise.

In future articles, we will delve deeper into LLMs for protein design, including exploring the most compelling commercial applications for the technology as well as the complicated relationship between computational outcomes and real-world wet lab experiments.

Let’s end by zooming out. De novo protein design is not the only exciting opportunity for large language models in the life sciences.

Language models can be used to generate other classes of biomolecules, notably nucleic acids. A buzzy startup named Inceptive, for example, is applying LLMs to generate novel RNA therapeutics.

Other groups have even broader aspirations, aiming to build generalized “foundation models for biology” that can fuse diverse data types spanning genomics, protein sequences, cellular structures, epigenetic states, cell images, mass spectrometry, spatial transcriptomics and beyond.

The ultimate goal is to move beyond modeling an individual molecule like a protein to modeling proteins’ interactions with other molecules, then to modeling whole cells, then tissues, then organs—and eventually entire organisms.

The idea of building an artificial intelligence system that can understand and design every intricate detail of a complex biological system is mind-boggling. In time, this will be within our grasp.

The twentieth century was defined by fundamental advances in physics: from Albert Einstein’s theory of relativity to the discovery of quantum mechanics, from the nuclear bomb to the transistor. As many modern observers have noted, the twenty-first century is shaping up to be the century of biology. Artificial intelligence and large language models will play a central role in unlocking biology’s secrets and unleashing its possibilities in the decades ahead.

Buckle up.

d80c45ba6cfd98a65fa0e02313689d3a.jpeg

未来智能实验室的主要工作包括:建立AI智能系统智商评测体系,开展世界人工智能智商评测;开展互联网(城市)大脑研究计划,构建互联网(城市)大脑技术和企业图谱,为提升企业,行业与城市的智能水平服务。每日推荐范围未来科技发展趋势的学习型文章。目前线上平台已收藏上千篇精华前沿科技文章和报告。

  如果您对实验室的研究感兴趣,欢迎加入未来智能实验室线上平台。扫描以下二维码或点击本文左下角“阅读原文”

265677afe9e8bc47c0a21424ec29a5d1.jpeg

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值