Deep learning AI model scans 'dark matter' of genomic data to find 70,000 never-before-seen RNA viruses

The world is absolutely teeming with viruses. The infectious agents are widely considered the most abundant biological entity on Earth. But a full understanding of viruses and the roles they play in our world is limited by our paltry knowledge of their incredible diversity. 

Researchers have now used a deep learning computer model to reveal thousands of the hidden viruses lurking in our world and genetic databases. Scientists built a tool that scanned the vast amounts of genomic data collected from environments around the world and identified more than 160,000 potential new RNA virus species, including about 70,000 that have never previously been identified as potentially novel species.

“It's a landmark study—not in the breadth of the discovery, it's in the depth by which they made that discovery,” Artem Babaian, Ph.D., a computational virologist at the University of Toronto who was not involved with the study, told Fierce Biotech in an interview. “They're focusing on these RNA viruses that were previously invisible to standard sequence alignment methods.”

As their name implies, RNA viruses have genomes made of RNA instead of DNA. Though the majority of viruses don’t infect humans, the group does include famous human pathogens like SARS-CoV-2, influenza and Ebola. RNA viruses are all around us, including in our own homes.

“If you go in your backyard,” Eddie Holmes, Ph.D., an evolutionary biologist and virologist at the University of Sydney, told Fierce in an interview, “you can take a sample of soil, and if you sequence that soil, you will find novel viruses.”

But sequencing all the genetic material in a pile of dirt will also produce a lot of so-called “dark matter”: DNA and RNA that don’t closely match any known organism. Much of this material is suspected to be viral, Holmes said, partially because of just how many viruses there are in the world, and partially because RNA viruses in particular evolve rapidly.

As Holmes put it, “RNA is intrinsically error-prone.” While DNA has the power to correct mistakes, RNA doesn’t; as an RNA virus replicates, changes to its genome can stack up quickly and, over time, make the virus not look like its relatives.

To find the RNA viruses hiding in this “dark matter,” a team led by Holmes and Mang Shi, Ph.D., a virologist at Sun Yat-sen University in China, took advantage of one thing all RNA viruses do have in common: the RNA polymerase enzyme. This is the protein RNA viruses use to copy their genomes when they replicate. Because of its vital function, its structure is highly conserved, even though the sequence of the gene that codes for it can vary widely.

“We trained this AI method to recognize the structure of every RNA polymerase known,” Holmes said, explaining that the program, called LucaProt, can then sort through new data and seek out RNA sequences that produce proteins that look like RNA polymerase. “Lo and behold, it finds them. It finds lots and lots and lots of them.”

“Instead of actually doing this expensive computational step of predicting the full structure, then doing a structure search that involves a whole different set of tools, they've essentially cut into the deep part of the deep learning,” Babaian said. 

Still, running the program took weeks, Holmes said, and they teamed up with Chinese tech company Alibaba to secure the computing power they needed.

The program scoured 51 terabytes of data sequenced from all around the world, including hot springs, Antarctic soil, salt marshes and compost piles. All the data are housed in the public Sequence Read Archive, maintained by the National Institutes of Health’s National Center for Biotechnology Information (NCBI).

“This is an amazing, amazing treasure trove,” Holmes said. “The NCBI is like science’s equivalent to the Library of Alexandria. Everything is in there.”

Some of the 161,979 viruses the team sequenced were so different from other RNA viruses that they could form 180 new separate supergroups. Holmes said finding a new supergroup is similar to finding a new phylum of animals—meaning some of these viruses are as different from each other as crabs are to earthworms or cats are to jellyfish.

Holmes, Shi and their colleagues have made LucaProt publicly available so other researchers can use it to search for new RNA viruses in their own data sets. Holmes thinks new viruses could help provide new useful enzymes and proteins; for example, a virus that lives in a hot spring would have an RNA polymerase that somehow withstands extreme temperatures.

“How is it doing that biochemically?” he said. “If we can find how RNA can survive and replicate at that temperature, if it indeed does, that would be an extremely fascinating enzyme.”

For Babaian, these results and others like it represent just the tip of the iceberg. Thanks to all of the data available and the increasing computational power to analyze those data, he said, “we're in the middle of a revolutionary change in our understanding of the virome and virus biodiversity.” 

We’re beginning to understand how viruses can impact our health even if they don’t cause disease, Babain said, like a virus that infects the human parasite Toxoplasma gondii and seems to mediate whether the parasite causes disease.

“When you go into the deep, the dark and the unknown, that's when you make significant strides in medicine,” he said. “You have to understand the interconnection of viruses, us and our environments and the species that are around us.”