



The ESM Metagenomic Atlas database contains structure predictions for 617 million proteins. Credit: ESM Metagenomic Atlas (CC BY 4.0)

When London-based Deep Mind revealed the predicted structures of some 220 million proteins earlier this year, it covered nearly every protein in known organisms in its DNA database. Now another tech giant is filling in the dark matter of our protein universe.

Researchers at Meta (formerly Facebook, headquartered in Menlo Park, Calif.) used artificial intelligence (AI) to predict the structures of nearly 600 million proteins from uncharacterized bacteria, viruses and other microbes Did.

It Will Change Everything: DeepMinds AI Takes a Giant Leap in Solving Protein Structures

These are the structures we know the least about. These are incredibly mysterious proteins. Alexander Rives, Principal Investigator of the Meta AI Protein Team, believes it could provide great insights into biology.

The team used large-scale language models, a type of AI underlying tools that can predict text from just a few letters or words, to generate the predictions described in Preprint 1 on Nov. 1. did.

Language models are typically trained on large amounts of text. To apply them to proteins, Rives and his colleagues gave their sequences to known proteins that can be represented by chains of 20 different amino acids, each represented by a letter. The network then learned to autocomplete proteins with some hidden amino acids.

protein autocomplete

This training gave the network an intuitive understanding of protein sequences that hold information about shape, says Rives. His second step, inspired by DeepMinds’ pioneering protein structure AI AlphaFold, combines such insights with known protein structures and information about relationships between sequences to generate predicted structures from protein sequences. .

A Metas network called ESMFold isn’t quite as accurate as AlphaFold, but he says it’s about 60 times faster at predicting structures, as the Rives team reported earlier this summer. What this means is that structure prediction can be extended to much larger databases.

As a test case, they decided to use the model with a database of bulk-sequenced metagenomic DNA from environmental sources such as soil, seawater, human gut, skin, and other microbial habitats. The majority of possible protein-encoding DNA entries come from scientifically unknown organisms that have never been cultured.

In total, the Meta team predicted over 617 million protein structures. The work consumed him 2,000 computer processors in just two weeks (AlphaFold can take minutes to generate a single forecast). The predictions, like the code underlying the models, are freely available to everyone, he said.

AlphaFold and AI The Next Step in the Protein Folding Revolution

Of these 617 million predictions, the model considered more than one-third to be of high quality, allowing researchers to identify the correct overall protein shape and possibly finer atomic-level detail. I’m sure you can. Millions of these structures are completely novel and differ from databases of ly determined protein structures and his AlphaFold database of predictions from known organisms.

A significant portion of the AlphaFold database consists of structures that are nearly identical to each other, and the metagenomic database should cover much of the previously unseen protein world, says Martin, a computational biologist at Seoul National University. Steinegger says. Now is a great chance to unravel the darkness.

Sergey Ovchinnikov, an evolutionary biologist at Harvard University in Cambridge, Massachusetts, questions the hundreds of millions of predictions ESMFold made with low confidence. Some lack a defined structure, at least alone, and may be non-coding DNA misinterpreted as protein-coding material. There still seems to be more than half of the protein space that we know nothing about, he says.

Slimmer, simpler, cheaper

Burkhard Rost, a computational biologist at the Technical University of Munich in Germany, was impressed with the combination of speed and accuracy of the Metas model. However, it is questionable whether metagenome he outperforms AlphaFolds’ accuracy when predicting proteins from databases. Language-model-based prediction methods, including those developed by his team, are suitable for rapidly determining how mutations alter protein structure, which is not possible with AlphaFold. Structural predictions will become leaner, simpler and cheaper, he says, which will open the door to new things.

DeepMind currently has no plans to include metagenomic structure predictions in its database, but is not excluding this in future releases, according to company representatives. was used to predict the structures of approximately 30 million metagenomic proteins. They hope to find new types of RNA viruses by looking for new forms of genome-replicating enzymes.

Steinegger sees biology’s dark matter trolls as the obvious next step for such tools. I believe that the analysis of these metagenomic structures will explode soon.

