小程序
传感搜
传感圈

Proteins Never Seen in Nature Are Designed Using AI to Address Biomedical and Industrial Problems Unsolved by Evolution

2023-05-03 04:30:35
关注

Machine learning (ML) and other AI- based computational tools have proven their prowess at predicting real-world protein structures. AlphaFold 2, an algorithm developed by scientists at DeepMind that can confidently predict protein structure purely on the basis of an amino acid sequence, has become almost a household name since its launch in July 2021. Today, AlphaFold 2 is used routinely by many structural biologists, with over 200 million structures predicted.

This ML toolbox appears capable of generating made-to-order proteins too, including those with functions not present in nature. This is an appealing prospect because, despite natural proteins’ vast molecular diversity, there are many biomedical and industrial problems that evolution has never been compelled to solve.

Scientists are now rapidly moving toward a future in which they can apply careful computational analysis to infer the underlying principles governing the structure and function of real-world proteins and apply them to construct bespoke proteins with functions devised by the user. Lucas Nivon, CEO and cofounder of Cyrus Biotechnology, believes the ultimate impact of such in silico-designed proteins will be massive and compares the field to the fledgling biotech industry of the 1980s. “I think in 30 years 30, 40 or 50 percent of drugs will be computationally designed proteins,” he says.

To date, companies operating in the protein design space have largely focused on retooling existing proteins to perform new tasks or enhance specific properties, rather than true design from scratch. For example, scientists at Generate Biomedicines have drawn on existing knowledge about the SARS-CoV-2 spike protein and its interactions with the receptor protein ACE2 to design a synthetic protein that can consistently block viral entry across diverse variants. “In our internal testing, this molecule is quite resistant to all of the variants that we’ve seen thus far,” says cofounder and chief technology officer Gevorg Grigoryan, adding that Generate aims to apply to the FDA to clear the way for clinical testing in the second quarter of this year. More ambitious programs are on the horizon, although it remains to be seen how soon the leap to de novo design—in which new proteins are built entirely from scratch—will come.

The field of AI-assisted protein design is blossoming, but the roots of the field stretch back more than two decades, with work by academic researchers like David Baker and colleagues at what is now the Institute for Protein Design at the University of Washington. Starting in the late 1990s, Baker—who has co-founded companies in this space including Cyrus, Monod and Arzeda —oversaw the development of Rosetta, a foundational software suite for predicting and manipulating protein structures.

Since then, Baker and other researchers have developed many other powerful tools for protein design, powered by rapid progress in ML algorithms—and particularly, by advances in a subset of ML techniques known as deep learning. This past September, for example, Baker’s team published their deep learning ProteinMPNN platform, which allows them to input the structure they want and have the algorithm spit out an amino acid sequence likely to produce that de novo structure, achieving a greater than 50 percent success rate.

Some of the greatest excitement in the deep learning world relates to generative models that can create entirely new proteins, never seen before in nature. These modeling tools belong to the same category of algorithms used to produce eerie and compelling AI-generated artwork in programs like Stable Diffusion or DALL-E 2 and text in programs like chatGPT. In those cases, the software is trained on vast amounts of annotated image data and then uses those insights to produce new pictures in response to user queries. The same feat can be achieved with protein sequences and structures, where the algorithm draws on a rich repository of real-world biological information to dream up new proteins based on the patterns and principles observed in nature. To do this, however, researchers also need to give the computer guidance on the biochemical and physical constraints that inform protein design, or else the resulting output will offer little more than artistic value.

One effective strategy to understand protein sequence and structure is to approach them as ‘text’, using language modeling algorithms that follow rules of biological ‘grammar’ and ‘syntax’. “To generate a fluent sentence or a document, the algorithm needs to learn about relationships between different types of words, but it needs to also learn facts about the world to make a document that’s cohesive and makes sense,” says Ali Madani, a computer scientist formerly at Salesforce Research who recently founded Profluent.

In a recent publication, Madani and colleagues describe a language modeling algorithm that can yield novel computer-designed proteins that can be successfully produced in the lab with catalytic activities comparable to those of natural enzymes. Language modeling is also a key part of Arzeda’s toolbox, according to co-founder and CEO Alexandre Zanghellini. For one project, the company used multiple rounds of algorithmic design and optimization to engineer an enzyme with improved stability against degradation. “In three rounds of iteration, we were able to go from complete disappearance of the protein after four weeks to retention of effectively 95 percent activity,” he says.

A recent preprint from researchers at Generate describes a new generative modeling-based design algorithm called Chroma, which includes several features that improve its performance and success rate. These include diffusion models, an approach used in many image-generation AI tools that makes it easier to manipulate complex, multidimensional data. Chroma also employs algorithmic techniques to assess long-range interactions between residues that are far apart on the protein’s chain of amino acids, called a backbone, but that may be essential for proper folding and function. In a series of initial demonstrations, the Generate team showed that they could obtain sequences that were predicted to fold into a broad array of naturally occurring and arbitrarily chosen structures and subdomains—including the shapes of the letters of the alphabet—although it remains to be seen how many will form these folds in the lab.

In addition to the new algorithms’ power, the tremendous amount of structural data captured by biologists has also allowed the protein design field to take off. The Protein Data Bank, a critical resource for protein designers, now contains more than 200,000 experimentally solved structures. The Alpha-Fold 2 algorithm is also proving to be a game changer here in terms of providing training material and guidance for design algorithms. “They are models, so you have to take them with a grain of salt, but now you have this extraordinarily large amount of predicted structures that you can build upon,” says Zanghellini, who says this tool is a core component of Arzeda’s computational design workflow.

For AI-guided design, more training data are always better. But existing gene and protein databases are constrained by a limited range of species and a heavy bias towards humans and commonly used model organisms. Basecamp Research is building an ultra-diverse repository of biological information obtained from samples collected in biomes in 17 countries, ranging from the Antarctic to the rainforest to hydrothermal vents on the ocean floor. Chief technology officer Philipp Lorenz says that once the genomic data from these specimens are analyzed and annotated, they can assemble a knowledge-graph that can reveal functional relationships between diverse proteins and pathways that would not be obvious purely on the basis of sequence-based analysis. “It’s not just generating a new protein,” says Lorenz. “We are finding protein families in prokaryotes that have been thought to exist only in eukaryotes.” [Prokaryotes, single-celled organisms such as bacteria, lack the more sophisticated internal cellular structures found in eukaryotes, which are capable of becoming multicellular organisms.]

This means many more starting points for AI-guided protein design efforts, and Lorenz says that his team’s own design experiments have achieved an 80 percent success rate at producing functional proteins.

But proteins do not function in a vacuum. Tess van Stekelenburg, an investor at Hummingbird Ventures, notes that Basecamp, one of the companies funded by the firm, captures all manner of environmental and biochemical context for the proteins it identifies. The resulting ‘metadata’ accompanying each protein sequence can help guide the engineering of proteins that express and function optimally in particular conditions. “It gives you a lot more ability to constrain for things like pH, temperature or pressure, if that’s what you’re planning to look at,” she says.

Some companies are also looking to augment public structural biology resources with data of their own. Generate is in the process of building a multi-instrument cryo-electron microscopy facility, which will allow them to generate near-atomic-resolution structures at relatively high throughput. Such internally generated structural data are more likely to include relevant metadata about individual proteins than data from publicly available resources.

In-house wet lab facilities are another critical component of the design process because experimental results are, in turn, used to train the algorithm to achieve even better outcomes in future rounds. Grigoryan notes that, although Generate likes to spotlight its algorithmic tool- box, the majority of its workforce comprises experimentalists.

And Bruno Correia, a computational biologist at the École Polytechnique Fédérale de Lausanne, says that the success of a protein design effort depends on close consultation between algorithm experts and experienced wet-lab practitioners. “This notion of how protein molecules are and how they behave experimentally builds in a lot of constraints,” says Correia. “I think it’s a mistake to handle biological entities just as a piece of data.”

Biological validation is an extremely important consideration for investors in this sector, says van Stekelenburg. “If you are doing de novo, the real gold standard is not which architecture are you using—it’s what percentage of your designed proteins had the end desired property,” she says. “If you can’t show that, then it doesn’t make sense.” Accordingly, most companies pursuing computational design are still focused on tuning protein function rather than overhauling it, shortening the leap between prediction and performance.

Nivon says that Cyrus typically works with existing drugs and proteins that fall short in a particular parameter. “This could be a drug that needs better efficacy, lower immunogenicity or a better toxicity profile,” he says. For Cradle, the primary goal is to improve protein therapeutics by optimizing properties like stability. “We’ve benchmarked our model against empirical studies so that people can get a sense of how well this might work in an experimental setting,” says founder and CEO Stef van Grieken.

Arzeda’s focus is on enzyme engineering for industrial applications. They have already succeeded in creating proteins with novel catalytic functions for use in agriculture, materials and food science. These projects often begin with a relatively well-established core reaction that is catalyzed in nature. But to adapt these reactions to work with a different subtrate, “you need to remodel the active site dramatically,” says Zanghellini. Some of the company’s projects include a plant enzyme that can break down a widely used herbicide, as well as enzymes that can convert relatively low-value plant byproducts into useful natural sweeteners.

Generate’s first-generation engineering projects have focused on optimization. In one published study, company scientists showed that they could “resurface” the amino acid-metabolizing enzyme l-asparaginase from Escherichia coli bacteria, altering the amino acid composition of its exterior to greatly reduce its immunogenicity. But with the new Chroma algorithm, Grigoryan says that Generate is ready to embark on more ambitious projects, in which the algorithm can start building true de novo designs with user-designated structural and functional features. Of course, Chroma’s design proposals must then be validated by experimental testing, although Grigoryan says “we’re very encouraged by what we’ve seen.”

Zanghellini believes the field is near an inflection point. “We’re starting to see the possibility of really truly creating a complex active site and then building the protein around it,” he says. But he adds that many more challenges await. For example, a protein with excellent catalytic properties might be exceedingly difficult to manufacture at scale or exhibit poor properties as a drug. In the future, however, next-generation algorithms should make it possible to generate de novo proteins optimized to tick off many boxes on a scientist’s wish list rather than just one.

This article is reproduced with permission and was first published on February 23, 2023.

参考译文
利用人工智能设计出自然界从未出现过的蛋白质,以解决进化无法应对的生物医药和工业难题
机器学习(ML)和其他基于人工智能的计算工具,已证明其在预测真实蛋白质结构方面的能力。2021年7月发布以来,AlphaFold 2这一由DeepMind科学家开发的算法,仅凭氨基酸序列就能自信地预测蛋白质结构,几乎已成为家喻户晓的名称。如今,AlphaFold 2已被许多结构生物学家广泛使用,预测了超过2亿个结构。这一机器学习工具箱似乎也能生成定制化的蛋白质,包括那些自然界中并不存在功能的蛋白质。这是一幅诱人的前景,因为尽管天然蛋白质具有广泛分子多样性,但仍有众多生物医药和工业问题,进化从未被迫解决。科学家们正迅速迈向一个未来:他们可以进行精细的计算分析,推断真实蛋白质结构和功能背后的原理,并据此设计用户定制的功能性蛋白质。Cyrus Biotechnology的首席执行官兼联合创始人Lucas Nivon认为,这种通过计算机设计的蛋白质的最终影响将是巨大的,他将这一领域与20世纪80年代初期的生物技术产业相提并论。“我认为,在30年后,30%、40%甚至50%的药物将是通过计算设计出来的蛋白质,”他说。截至目前,从事蛋白质设计的企业大多专注于重新设计已有蛋白质以执行新任务或增强特定属性,而不是真正从头开始设计。例如,Generate Biomedicines的科学家们利用对SARS-CoV-2刺突蛋白及其与ACE2受体蛋白相互作用的已有知识,设计了一种合成蛋白,能够持续阻断多种变体的病毒入侵。“在我们内部测试中,这种分子对目前所见的所有变体都有很强的抗性,”联合创始人兼首席技术官Gevorg Grigoryan表示,并补充说,Generate计划于今年第二季度向FDA申请,以铺平临床测试的道路。更为雄心勃勃的项目正在酝酿中,但目前尚不清楚从头设计(即完全从零开始构建新蛋白质)的突破将有多快到来。人工智能辅助的蛋白质设计领域正在蓬勃发展,但其根源可以追溯到二十多年前,当时学术研究人员如David Baker及其在华盛顿大学现在称为蛋白质设计研究所的同事们,已开始相关研究。自20世纪90年代末以来,Baker——他共同创立了Cyrus、Monod和Arzeda等公司——指导开发了Rosetta这一预测和操控蛋白质结构的基础性软件套件。从那时起,Baker和其他研究人员利用机器学习算法的快速进步,特别是深度学习技术的进展,开发了许多其他强大的蛋白质设计工具。例如,今年9月,Baker的团队发布了一种深度学习平台ProteinMPNN,该平台允许研究人员输入目标结构,算法就能生成一个可能生成该结构的氨基酸序列,成功率超过50%。深度学习领域最令人兴奋的进展之一是生成模型,这些模型可以创造出自然界中从未见过的全新蛋白质。这些建模工具属于与Stable Diffusion、DALL-E 2和ChatGPT等程序中生成诡异而引人入胜的AI艺术品和文本的算法同一类别。在那些程序中,软件通过大量标注图像数据进行训练,然后利用这些知识来生成新图像以响应用户查询。同样的方式也可以用于蛋白质序列和结构设计,算法可以利用丰富的现实生物信息库,根据自然中观察到的模式和原理,创造出新的蛋白质。但要做到这一点,研究人员还需要向计算机提供有关蛋白质设计所涉及的生化和物理限制的指导,否则结果可能仅具有艺术价值。理解蛋白质序列和结构的一种有效策略是将其视为“文本”,使用遵循生物“语法规则”的语言建模算法。“要生成一段通顺的句子或文档,算法需要了解不同词之间的关系,还需要学习有关世界的真实知识,以便生成连贯且有意义的文档,”Profluent的创始人、前Salesforce Research的计算机科学家Ali Madani说。最近的一篇论文中,Madani及其同事描述了一种语言建模算法,它能够生成新颖的计算机设计蛋白,并在实验室中成功生产,催化活性可与天然酶相媲美。语言建模也是Arzeda工具箱中的关键部分,其联合创始人兼首席执行官Alexandre Zanghellini表示。在一个项目中,公司通过多轮算法设计与优化,设计出一种具有更强抗降解稳定性的酶。“在三轮迭代中,我们成功将蛋白在四周后几乎完全消失的情况,提升到保留了95%的活性,”他说。Generate研究人员最近的一篇预印文章中描述了一种新的基于生成建模的设计算法Chroma,该算法包括多种提高性能和成功率的特性。这些包括扩散模型,这是许多图像生成AI工具中使用的方法,使处理复杂、高维数据更加容易。Chroma还采用了算法技术来评估蛋白质氨基酸链(称为骨架)上远离彼此的残基之间的长程相互作用,这些相互作用可能对折叠和功能至关重要。在一系列初步演示中,Generate团队展示了他们可以生成能折叠成多种自然存在结构、任意选择结构以及子域(包括字母表中字母形状)的序列,尽管目前尚不清楚其中有多少能在实验室中形成这些结构。除了新算法的威力外,生物学家捕捉到的大量结构数据也使蛋白质设计领域得以腾飞。对蛋白质设计者来说至关重要的资源——蛋白质数据库(PDB),现在已包含超过20万个通过实验解析的结构。AlphaFold 2算法在为设计算法提供训练材料和指导方面也证明是一个游戏规则的改变者。“它们只是模型,所以你得持保留态度,但现在你有如此大量的预测结构可供构建,”Zanghellini表示,他补充说,这一工具是Arzeda计算设计工作流程的核心组成部分。对于人工智能引导的设计,更多的训练数据总是更好。但现有的基因和蛋白质数据库受到物种范围有限的限制,并且偏向于人类和常用模式生物。计算生物学家、瑞士洛桑联邦理工学院(École Polytechnique Fédérale de Lausanne)的Bruno Correia表示,蛋白质设计项目的成功依赖于算法专家与有经验的湿实验从业者之间的密切合作。“关于蛋白质分子如何存在以及如何在实验中表现的理解,引入了许多限制条件,”Correia说。“我认为,把生物实体仅仅当作数据来处理是一种错误。”生物验证是该领域投资者极为关注的因素,van Stekelenburg表示。“如果你在做从头设计,真正的黄金标准不是你使用的是哪种架构——而是你设计的蛋白质中,有百分之几具有所需的最终特性,”她说。“如果你无法证明这一点,那就不值得。”因此,大多数从事计算设计的公司仍专注于调整蛋白质功能,而不是彻底改变它,从而缩短预测与实际表现之间的差距。Nivon表示,Cyrus通常与现有药物和蛋白质合作,这些药物或蛋白在某些特定参数上表现不足。“这可能是一种需要更好疗效、更低免疫原性或更好毒理特性的药物,”他说。对于Cradle,主要目标是通过优化蛋白质的稳定性来改善蛋白质治疗。“我们已将模型与实证研究进行基准测试,以便人们了解它在实验环境中的效果,”创始人兼首席执行官Stef van Grieken表示。Arzeda的重点是酶工程在工业中的应用。他们已成功创造出具有新催化功能的蛋白质,应用于农业、材料和食品科学。这些项目通常以自然界中已知的相对成熟的核心反应为起点。但要将这些反应适配到不同的底物上,“你需要对活性位点进行大幅重塑,”Zanghellini表示。公司的一些项目包括一种植物酶,可以分解广泛使用的除草剂,以及可以将低价值的植物副产品转化为有用天然甜味剂的酶。Generate的第一代工程项目专注于优化。在一篇已发表的研究中,公司科学家展示了他们可以“重新表面”大肠杆菌中的氨基酸代谢酶l-天冬酰胺酶,通过改变其表面的氨基酸组成大大降低其免疫原性。但借助新的Chroma算法,Grigoryan表示,Generate已准备好开展更雄心勃勃的项目,在这些项目中,算法能够从用户指定的结构和功能特征开始,构建真正的从头设计。“当然,Chroma的设计方案仍需通过实验验证,但Grigoryan说,“我们对所看到的结果非常鼓舞。”Zanghellini认为,这一领域正接近一个关键转折点。“我们开始看到真正创建复杂活性位点并围绕它构建蛋白质的可能性,”他说。但他补充道,仍有许多挑战等待着我们。例如,一种具有卓越催化性能的蛋白质可能在大规模生产上极具挑战性,或作为药物时表现出较差的性质。然而,在未来,下一代算法应该能够生成针对科学家多个需求进行优化的从头设计蛋白质,而不仅仅是满足单一需求。本文已获得授权,首次发表于2023年2月23日。
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告

scientific

这家伙很懒,什么描述也没留下

关注

点击进入下一篇

认知ChatGPT的过程,就是消除偏见的过程

提取码
复制提取码
点击跳转至百度网盘