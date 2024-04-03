



To create proteins with useful functions, researchers typically start with a naturally occurring protein that has a desired function, such as being fluorescent, and then undergo many rounds of random mutations to finally optimize it. version of the protein.

This process has resulted in optimized versions of many important proteins, including green fluorescent protein (GFP). However, for other proteins, generating optimized versions has proven difficult. Researchers at MIT have now developed a computational approach that makes it easier to predict mutations that will produce better proteins, based on relatively small amounts of data.

Using this model, the researchers generated a protein containing mutations predicted to lead to an improved version of GFP and an adeno-associated virus (AAV)-derived protein used for DNA delivery for gene therapy. . They hope this can also be used to develop additional tools for neuroscience research and medical applications.

Protein design is a difficult problem because the mapping from DNA sequence to protein structure and function is highly complex. After 10 changes in the sequence, there could be a good protein, but each change in between could correspond to a completely non-functional protein. It's like trying to find your way to a river basin in a mountain range when there are craggy peaks blocking your view along the way. Ira Fiete, a member of the Massachusetts Institute of Technology's McGovern Institute for Brain Research, director of the K. Lisa Yang Center for Integrative Computational Neuroscience, and professor of brain and cognitive sciences at the Massachusetts Institute of Technology, said the current research He says his goal is to make it easier. of the study's senior author.

Regina Barzilay, Distinguished Professor of AI and Health in the MIT School of Engineering, and Tommi Jaakkola, MIT's Thomas Siebel Professor of Electrical Engineering and Computer Science, are also senior authors of an open-access paper on the research. Presented at the International Conference on Learning Representations in May. MIT graduate students Andrew Kirjner and Jason Yim are the study's lead authors. Other authors include MIT postdoctoral fellow Shahar Bracha and Czech Technical University graduate student Raman Samusevich.

Protein optimization

Many naturally occurring proteins have functions that are useful for research and medical applications, but require a little additional engineering to optimize them. In this study, the researchers were initially interested in developing a protein that could be used as a voltage indicator in living cells. These proteins, produced by some bacteria and algae, fluoresce when an electric potential is detected. Designing such proteins for use in mammalian cells could allow researchers to measure neuronal activity without electrodes.

For decades, research has been conducted to engineer these proteins to produce more intense fluorescent signals on faster time scales, but this has not been successful enough for widespread use. Braca, who works in Edward Boydens' lab at the McGovern Institute, contacted the Fietes lab to see if they could collaborate on computational approaches that could help speed up the protein optimization process. I took it.

The study exemplifies the human serendipity that characterizes many scientific discoveries, Feete said. It grew out of his Yantan Collective Retreat, a scientific conference of researchers from multiple of his MIT centers with a clear mission, unified by the common support of K. Lisa Yan. We believe that some of our interests and tools in modeling how the brain learns and optimizes, as practiced in the Boyden lab, are extending to the entirely different field of protein design. I learned something that could be applied.

For a particular protein that researchers want to optimize, there is an almost infinite number of possible sequences that can be generated by replacing different amino acids at each point in the sequence. With so many possible variants, it is impossible to test them all ly, so researchers are turning to computer modeling to predict which will work best.

In this study, the researchers set out to overcome these challenges by using data from GFP to develop and test a computational model that can predict better versions of the protein.

They first trained a type of model known as a convolutional neural network (CNN) on data consisting of a GFP sequence and its brightness, the features they wanted to optimize.

The model is based on a relatively small amount of data (from about 1,000 variants of GFP) to create a fitness landscape, a three-dimensional map that shows the fitness of a given protein and how much it differs from the original sequence. I was able to do.

These landscapes contain mountains representing more fit proteins and valleys representing less fit proteins. Predicting the path a protein must take to reach a fitness peak, because proteins often have to undergo mutations that reduce their fitness before reaching a nearby higher fitness peak. That can be difficult. To overcome this problem, the researchers used existing computational techniques to smooth out the fitness landscape.

Once these small irregularities in the terrain were smoothed out, the researchers retrained the CNN model and found that it could reach higher fitness peaks more easily. The model was able to predict an optimized GFP sequence containing as many as 7 different amino acids from the initial protein sequence, and the best of these proteins were approximately 2.5 times better fitted than the original protein. It was estimated that there were.

Once you have this landscape representing what your model thinks is nearby, you smooth it and retrain the model on a smoother version of the landscape, Kirjuner says. There is now a smooth path from the starting point to the top, and the model can reach it by making repeated small improvements. The same is often not possible with unsmoothed landscapes.

proof of concept

The researchers also showed that this approach worked well to identify new sequences in the viral capsid of adeno-associated virus (AAV), a commonly used viral vector for DNA delivery. In that case, we optimized the capsid to package the DNA payload.

We used GFP and AAV as a proof of concept to show that this is an effective method for very well-characterized datasets. As such, it should be applicable to other protein engineering problems, Bracha says.

The researchers now plan to use this computational technique on the data Bracha is generating on voltage indicator proteins.

Dozens of labs have been working on this research for 20 years, but nothing better yet exists, she says. She hopes that by generating smaller datasets, she will be able to train models on computers and make better predictions than the manual tests of the past 20 years.

This research was supported by the U.S. National Science Foundation, the Machine Learning Consortium for Drug Discovery and Synthesis, the Abdul Latif Jameel Clinic for Machine Learning in Health, the DTRA Emerging Threats Discovery Program, and the DARPA Accelerated Molecular Discovery program, Sanofi Computational Antibody Design Grant, U.S. Office of Naval Research, Howard Hughes Medical Institute, National Institutes of Health, K. Lisa Yang ICoN Center, K. Lisa Yang and Hock E. MIT Tan Center for Molecular Therapy.

