Combining AI with traditional wet lab work creates a virtuous circle from lab to data and back to the lab.
While the public’s attention has been captured by AI chatbots, a quiet revolution has been brewing in the sciences. One of the most promising fields AI is impacting is biology—long dominated by the tradition of the “wet lab,” which favors pure experimental data over computer simulation. Deep learning is changing that. It enables computers to understand complex patterns in data and generate ideas based on those patterns, and that’s making AI more and more central to experimental biology. There is no field more complex in its patterns than biology, so AI is the perfect tool for understanding it. And—as a host of new companies are showing—engineering it.
Oddly enough, even the success of large-language models, or LLMs, like ChatGPT could be seen as evidence of AI’s ability to understand biological complexity. After all, language is a product of a particular biology—ours. The text on which ChatGPT is trained is just one of many complex forms of data produced by biological systems. In nature, this complexity reaches up from the smallest working parts—our proteins, our DNA—but also extends up through cells, organs, to physiology, disease, and behavior.
AI, with the right data, can span all of these scales and make sense of the data we collect on all of them. It’s poised to accelerate basic science, the business of biotechs, the behemoth pharmaceutical companies, and the broader bioeconomy.
A 2020 paper from San Francisco-based industry leader OpenAI investigated what factors had the biggest impact on LLM performance. Three variables determined how well they learned:
- Compute—the number of CPU cycles spent training a model.
- Parameters—the number of learnable features in a model.
- Dataset—the number of examples used in training the AI model.
The amount of “compute,” or the computing resources required, and the number of parameters can be dialed up by computer scientists as needed. But the same is not true of datasets. AI for biology can’t draw on gobs of pre-existing data like the creators of ChatGPT did by scraping the internet for everything it contained. While there is useful biological data in public research databases, it’s not enough. AI can only craft solutions to problems similar to what it has seen, and in biology, there is still a lot we don’t know.
In the new AI-driven era, research efforts must be purposefully organized to generate the right data for AI models.
AlphaFold, the breakthrough protein structure prediction program from Google’s DeepMind, was trained using existing public databases like the Protein Data Bank (PDB), a massive collection of basic structural data on the molecules of life. But it only learned what the PDB contained, which tends toward smaller, soluble proteins that fold into a compact, globular shape. AlphaFold doesn’t do well predicting the structure of a number of important classes of protein. For example, it has difficulty figuring out protein-protein interactions (an important consideration for designing drugs). Nor is it necessarily adept at snapshotting how a single protein can change shape (which can be critical for function). And the structures of proteins which span cell membranes are more difficult to predict because relatively few have actually been solved. These important problems won’t go away without new data.
While databases like PDB have come about as post hoc efforts to organize data produced by academics pursuing their disparate biological interests, in the new AI-driven era, research efforts must be purposefully organized to generate the right data for AI models. While some of these efforts will take the form of large-scale public research projects, akin to the Human Genome Project, others will be small and focused, carried out on proprietary data inside companies. Still others, like the Human Immunome Project, which seeks to use machine learning to build models of the human immune system, could take another form, like a public-private partnership.
(Editor’s note: The author’s partner Linda Avey has recently joined the board of the Human Immunome Project. Proto.life founder Jane Metcalfe is the chair of HIP’s board.)
AI models driving wet lab experiments is a trend we’re going to see more of in the coming years. It will be quite different from the kind of generalist AI models inside ChatGPT and Midjourney. Those projects are built from large corpuses of publicly accessible data, are general purpose, and are expensive to train. The marrying together of wet lab techniques with AI—“wet AI”—is the opposite of “dry” AI models like ChatGPT in many ways. They will be small (not large), special purpose (not general), built from specially created proprietary datasets, (rather than pre-existing public ones), and many will be trivially cheap to train—not costing between $30 million and $100 million like many LLMs. Although they will be small and private, they are on track to have enormous impact.
Wet AI gives companies a proprietary advantage compared to LLM and image generation AI companies. The latter are built on public data, and that means others can copy them. For example, Facebook released LLaMA, an open source model that exceeds the latest version of ChatGPT on most non-coding or math reasoning prompts, and open source developers and academics have seized on it, building new and better versions. And other companies like San Francisco-based Anthropic have entered the market with their own proprietary LLMs. This dynamic causes investors to worry that these high-flying companies may collapse under their own weight as competitors and new methods drive down profits. The wet AI approach relies on a different business model built on highly proprietary technologies that produce billion-dollar products protected by patents.
Along came the biologists…
Computer scientists were among the first to recognize AI’s application to biological problems. In 2008, Abe Heifets, the founder of Atomwise, was a computer science graduate student with an office across the hall from the godfather of modern AI, Geoffrey Hinton. His work wasn’t initially focused on AI, but he saw that something interesting was going on across the hall, and he had the early insight that by representing proteins as images of atomic structures, he could use image processing AI—convolutional networks—to discover drugs that fit in the active sites of proteins.
Daphne Koller, the first AI hire in Stanford’s computer science department, took an early interest in biology and was one of the first scientific advisors to 23andMe. From the beginning, she reasoned that AI could sort high-throughput data and human genetics to find and prioritize targets. She focused these efforts in Insitro, one of the earliest AI-driven biotech companies.
Computer scientists paved the way, democratizing the tools and methods for a new wave of engineers and scientists. Chemists and biologists realized that AI could augment their work. For example, a common technique for discovering drugs is to screen libraries of compounds for desirable properties—binding to the right target, how they interact with tissues and get into cells, their stability, or their distribution in the body. While screening has been a workhorse method for scientists, it’s limited by the cost of producing compounds and doing experiments. By training generative AI on screening data, scientists can explore orders of magnitude more chemical structures, finding compounds with improved properties.
A clever combination of science and AI
One company taking this approach is Unnatural Products (UNP)—which in full disclosure is a portfolio company funded in part by my own investment firm, Humain Ventures. UNP was started by graduate students out of the lab of Scott Lokey, a professor at the University of California, Santa Cruz. Lokey focused on a special class of natural compounds—peptide macrocycles. These are short proteins that have been “cyclized,” or turned into a ring. That ring structure gives them special properties, locking the peptide into a semi-rigid structure that forms more stable bonds with its binding target. The ring is also more resistant to degradation than linear peptides but leaves enough flexibility to allow shape changes that give macrocycles a peculiar knack for slipping into cells, where they can bind to important targets. Nature has harnessed these structures because of their drug-like properties.
Cyclosporine is one example. It’s a well-known drug given to people with rheumatoid arthritis, psoriasis, and organ transplants. Produced by the fungus Trichoderma polysporum, it binds to proteins that mediate the immune response. Cyclosporine seems to be used by the fungus to help it infect insects, but doctors now use its immunosuppressive functions to prevent tissue rejection in organ transplant patients.
Cyclosporine is one of about 100 or so pharmacologically useful macrocyclic compounds. Almost all of them are from natural sources. Despite the success of cyclosporine, drug developers avoided macrocycles because they were perceived to be too complex to engineer. This perspective was reinforced by Pfizer’s top chemist, Christopher Lipinski, in his famous “Rule of 5,” which says that an orally available drug must not have more than one violation of several criteria Lipinski determined after reviewing the company’s successes and failures. Any time Pfizer’s scientists tried to build compounds which were larger than a certain size, or had too many hydrogen bonds, they failed, largely due to the inability of those molecules to get into cells where they could act. These rules all but exclude macrocycles as useful drug candidates because they are guaranteed to violate Lipinski’s first and third criteria.
Yet Lokey was undeterred and convinced that there must be a way to engineer what nature had shown us. Some of his colleagues thought he was plotting a risky course outside of the mainstream, but he doggedly pursued this vision over decades, continuing to hack away at the problem, and developing the technology for building large libraries of these compounds, a critical step in the drug discovery process.
Two of Lokey’s graduate students, Cameron Pye and Josh Schwochert, were instrumental in developing clever parallel synthesis and screening techniques. But even with these advances, screening only accesses a small part of the chemical space. They realized that by training AI on the data from their screening experiments, they could search a much larger space of molecules.
Macrocycles are peptides, short chains of about 8–20 amino acids. Although there are only 20 different types of natural amino acids, bioengineered macrocycles can include “unnatural” ones —hence the name of their company (Unnatural Products)—which can push that number to hundreds of thousands. This represents a huge set of possible compounds. For example, a library of macrocycles each 10 amino acids long, where each position is one of 100,000 unnatural amino acids would contain 100,00010, or 1050, different molecules. That is around the number of atoms that constitute the entire globe we call Earth.
Synthesizing just one copy of each molecule in such a library would require a mass the size of Jupiter. So screening is only feasible for a tiny fraction of this number—billions of compounds rather than more than a billion billion billion billion billion of them. AI can discover patterns in screening data that enable it to suggest an ideal drug candidate that had never actually been synthesized or screened. As a result, Unnatural Products has been able to create compounds that both bind very tightly (i.e., with “picomolar” dissociation constants) and remarkably, get through cell membranes, breaking Lipinski’s famous rule.
Wet AI uses experiments to get you to the ballpark and AI to get you to the seat in the stadium.
Macrocycles combine the best properties of large and small molecules. Over the last 20 years, antibodies have become popular in drug development. They are very specific, but must be injected and can’t access the inside of cells. Toxicity sometimes results because their mechanism of action is indirect—antibody makers target molecules on the outside of the cell which they hope will influence the key molecule inside the cell. But this indirect strategy sometimes doesn’t work and can trigger unwanted side effects. On the other hand, small molecules, which can be taken in pill form, obey Lipinski’s rules and can get inside cells but they’re unspecific, so they often require higher doses and can produce toxicity because of “off-target” effects. Macrocyclic peptides are a kind of Goldilocks drug: neither too big nor too tiny but rather just the right size for drug development—large enough to bind very specifically but small enough to sneak inside cells.
This application of wet AI uses experiments to get you to the ballpark and AI to get you to the seat in the stadium. It’s a remarkable combination of scientific ingenuity and computer science knowhow that enables a fundamentally new ball game. A large team of medicinal chemists at Merck spent 10 years designing one macrocyclic peptide—a feat so impressive it was nominated by their peers as “molecule of the year” in 2021. Techniques like the one Unnatural Products is pioneering are finding thousands of such molecules with better properties in less than a third of the time, with a much smaller team. It’s a taste of things to come.
Although the numbers seem large, wet AI models like this will be tiny and relatively inexpensive to build compared to models like ChatGPT. A typical model is a thousand times smaller, only ~50Mb versus 570Gb for ChatGPT. But there is obviously huge value in these small models.
While this kind of model is built on a one-off basis for each protein, more general foundation models could give insights into basic properties of a company’s technology. For example, by collecting tissue distribution data for a drug library, an AI model could be trained to generate drug candidates which also target the right cells. This will be more costly than the kind of screens UNP is currently doing, but it could pay off in spades. Failure of a drug to get to the right tissue is a common cause of failure to get through the FDA review process.
Wet AI is also going to drive investment in general use foundation models. AlphaFold was trained on public data culled from years of work by labs pursuing disparate biological problems, but there are gaps that will be most efficiently closed by focused public efforts. Mark Murcko, the former CTO of Vertex Pharmaceuticals, is promoting the concept of prizes to convince labs to focus on structures of key proteins that affect toxicity, what he calls the “avoidome.” These are proteins that drug developers don’t want their molecules to interact with. Many of these are proteins that span cell membranes and represent difficult targets for structural biologists, but they are not impossible. Methods like cryo-EM combined with other experimental techniques can get us there. Wet AI could drive public efforts akin to the Human Genome Project, which published the first final reference genome of our species 20 years ago, or the NIH’s All of Us project, which aims to collect genomic data and other health information from a million people. In the coming era, scientists will drive large-scale public data collection efforts to serve the needs of AI models.
Wet AI isn’t just about drug discovery. It also has a critical role to play in health and the emerging synthetic biology-driven bioeconomy. The most exciting aspect of wet AI, however, is that it promises to turn the molecular world into a programmable medium. That could have far reaching impacts, giving us the ability to imagine new categories of products based on engineered biology. I’ll cover these topics in future posts.