A team of US scientists has developed a new foundation model that can sift through genetic code across biology’s five kingdoms to give insights into disease-causing mutations and genome design.

The model, called Evo 2, was trained on the DNA of more than 100,000 species across evolutionary relationships between all known organisms. Jointly developed by the Arc Institute in Palo Alto, California; and tech giant NVIDIA, the model will be made publicly accessible to help advance research in the life science sector.

Evo 2 follows Evo 1, a model trained on single-cell genomes that was unveiled in a research article in Science in November 2024. Nvidia and the Arc Institute state that Evo 2 is the largest AI model in biology to date that is publicly accessible, as it has been trained on more than 9.3 trillion nucleotides. This gives the foundation model, comprising AI neural networks trained on large-scale data dates, a wide window into the genome.

Evo 2 was trained on information from humans, plants, and bacteria, meaning it can provide insights into connections between distant parts of an organism’s genetic code and processes such as cell function, gene expression and disease.

“Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology, as the models have enabled machines to read, write, and think in the language of nucleotides,” said Arc Institute co-founder Patrick Hsu and co-senior author on the Evo 2 paper.

The researchers are hopeful that the applications of Evo 2 will be wide-ranging in scientific understanding. They emphasise the model’s ability to identify genetic changes that could lead to protein dysfunction. In tests with the gene BRCA1, variants of which are responsible for breast cancer, Evo 2 was able to predict with 90% accuracy which mutations were pathogenic.

Computational biologist Hani Goodarzi, who was also involved in the mode’s development, said Evo 2 could be deployed in drug discovery.

“If you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells. This precise control could help develop more targeted treatments with fewer side effects,” Goodarzi explained.

An increasing number of pharmaceutical companies are touting pipelines that have been primarily led and designed by AI. Indeed, AI’s potential in drug discovery was scientifically endorsed when the 2024 Nobel Prize in Chemistry was awarded to the DeepMind team for their work on AlphaFold, an AI system that accurately predicts protein structures.

While its advantages in expediting and tweaking the development of innovative medicines are clear, experts have said finding large datasets that capture the full complexity of biology are hard to find.