Biobanks are on the Cusp of Translating Big Data Into New Medicine

The rise of biobanks around the world promised lots of cheap and plentiful human DNA for scientists to study. Here’s how that’s going.

In the spring of 2020, as COVID-19 ricocheted around the world, geneticists everywhere joined forces to try to answer a shattering question: Why did some people infected with the coronavirus become severely ill and die while most did not? Suspecting the answer might be genetic, these scientists tapped into a global network of repositories called biobanks—storehouses of DNA that for years had been collecting bio-samples (mostly blood) from millions of people—to help researchers better understand how genes tucked inside us influence disease.

Within weeks, the researchers quickly overcame rivalries, assembled the necessary funds, and founded the COVID-19 Host Genetics Initiative. Eventually they signed up 150 biobanks and research organizations in 80 countries to comb through the DNA and medical records of coronavirus-infected people who have participated in programs in locales ranging from Boston to Brazil, England to Estonia, Chile to China, Israel to India, and Salt Lake City to Qatar. 

Now, more than two years later, the initiative has discovered dozens of genetic variants that correlate with severe COVID-19 reactions. Mostly, these variants have to do with how the body’s immune system works or doesn’t work to fend off an attack by the virus. 

These include two critical genetic pathways, according to Andrea Ganna, a senior researcher at the Institute for Molecular Medicine in Helsinki, Finland, and one of the organizers of the COVID-19 initiative. “One is the interferon pathway,” said Ganna, interferon being a protein that suppresses viral replication. “The other is the lung fibrosis pathway that determines how well the pulmonary defense system works.”

The discoveries demonstrate the potential power of biobanks to unravel genetic influences not only for COVID-19, but for a range of diseases. But they also reveal the limitations of what biobank data can provide. For one thing, genes only account for part of what makes people ill with most diseases. Non-genetic factors also play a role, including a person’s overall health, age, lifestyle, geography, socioeconomics, ethnicity, and just plain bad luck. Other critical molecules inside people, like proteins and metabolites, also contribute. 

Getting all of this right is still a work in progress as a gap remains in translating the findings of biobank research into what everyone wants—simple tests that doctors and healthcare systems can use to tell you if you are high risk for COVID-19 complications, diabetes, cancer, stroke, or other common diseases. “We don’t yet have the magic bullet of a clinical test for those at high risk for severe reactions to COVID-19,” says Ganna, who also works with Finland’s biobank, called FinnGen. 

Yet the gap is narrowing as biobanks may at last be on the cusp of providing useful information about a person’s health future.

Born out of exuberance, rational and irrational

Biobanks were born in the early 2000s during the great flush of excitement as the Human Genome Project—really two projects, one private and one public—was nearing completion. Pulses raced and billions of private and public dollars flowed as Homo sapiens created the first map ever of the previously hidden world of their As, Cs, Ts, and Gs. 

Hailed as a monumental achievement right up there with the wheel and the internet, the near-completion of the first human whole-genome sequencing was announced in a White House ceremony in April 2000. “Today, we are learning the language in which God created life,” said President Bill Clinton

At the same time, geneticist Craig Venter, who headed up the private human genome project effort as founder and CEO of Celera Corporation, declared around the year 2000 that within a few years everyone would carry a card with a magnetic data strip containing their complete genetic information as part of their basic health care record. “With this profound new knowledge, humankind is on the verge of gaining an immense new power to heal,” says Venter.

It’s something we’re still waiting for. 

All this buzz inspired some scientists back then to call for an audacious next step: To compare differences in people’s DNA to better understand disease and traits by sequencing huge numbers of people—thousands, then millions. Audacious because circa 2000, only a handful of people had been sequenced, with costs prohibitively high, technologies still nascent, and a vast human DNA continent that remained largely terra incognita

Genomics had already produced a few important biomarkers by the turn of the 21st century. These included a high-risk variant in the APOE gene (APOE4), which turned out to be a solid, if terrifying, predictor that a person was at high risk for Alzheimer’s disease. Variants in the BRCA1 and BRCA2 genes also had been discovered that correlate with a very high—if rarely occurring—risk for breast cancer. Other genetic markers, mostly linked to other rare conditions like Huntington’s disease and Down syndrome, were also being discovered. Biomarkers for the most common diseases, including diabetes and heart disease, remained elusive.

“You must contact a cardiologist! You must take statins!”

In 2001, I was deep into this early exploration myself as a reporter for Wired when I became one of the first humans to be genetically sequenced. Tested for hundreds of “genotypes”—genetic traits—scientists at the San Diego-based molecular testing company Sequenom (now part of Labcorp) gave me a crude scorecard ranking my proclivity for about 200 diseases and traits. These included Alzheimer’s, heart attack, and several types of cancer, plus genetic clues to attributes predicting that my eyes are blue (they are), my hair is curly (it is not), and that I am a cyclist with muscles built more for endurance than speed (which is true). It was an early 23andMe-style report coming six years before 23andMe existed.

Otherwise, I mostly discovered mutations that I did not have, which verified that I was basically healthy. But I did have a few higher-risk variants to potentially look forward to in my future—including for hearing loss and hypertension. 

In 2003, I had more DNA sequenced by DeCode Genetics, based in Reykavik, Iceland. The company back then was planning to sequence most of Iceland’s 350,000 citizens, which they eventually did, and remains one of the most advanced genetic testing and analysis outfits in the world (now as part of Amgen). One morning, DeCode’s co-founder and CEO, Kári Stefánsson—a physician-scientist descended from the murderous Viking Erik the Red—called to inform me that three markers on my chromosome 9 had come up high risk for a future heart attack. 

“You must contact a cardiologist!” he screamed into the phone with the same intensity one imagines him sharing with his famously volatile ancestor. “You must take statins!”—the cholesterol-lowering drug that is associated with decreasing one’s heart attack risk. 

I thanked Stephenson but did not call a cardiologist or take statins. This was because of two big downsides to that era’s genetic testing. One was that most supposedly high-risk biomarkers were based on statistical correlations coming from differences in single letters of DNA—say an A rather than a T—that didn’t take into account the multitude of other factors like lifestyle and age that clearly influence complex maladies like heart disease. The other drawback was that higher-risk biomarkers were determined by comparing a population of people with, say, heart disease or diabetes to a population that didn’t, to see which genetic markers statistically indicate a higher risk. These results mostly told you the population’s risk, not your own personal risk. 

Kári Stefánsson is the pioneering CEO of DeCode Genetics, who has been working for decades to connect the dots between a patient’s genome and their disease risk. DeCode Genetics

Another issue was a lack of ground truthing—matching up the statistical risks with what happens in actual people in longitudinal studies as the years roll by. For instance, how did those three markers on chromosome 9 impact me and others almost 20 years after Stephenson’s phone call? It turns out that they statistically are minor risk factors for heart disease compared to other factors like diet and age.  

In 2007, I visited one of the first major biobanks, the UK Biobank, founded three years earlier, and headquartered in a former textile mill in Manchester, England. Initially funded in 2004 with £62 million (about $175 million today) from the U.K. government’s British Medical Research Council and from the private philanthropy foundation Wellcome Trust, the British researchers told me that they were planning to collect DNA from 500,000 Brits—back then a breathtakingly huge cohort of people.

Critics at the time insisted that this was a waste of time and money that would mostly produce piles of DNA and other data that would take years to sift through and understand. They also worried that such a massive effort would distract from smaller, more focused projects that cost less and had greater short-term potential to yield specific results in research and in the clinic. 

“I think when the MRC and the Wellcome Trust decided to set up UK Biobank, it was a real leap of faith about what might occur in the future,” says epidemiologist Rory Collins, professor of medicine at the University of Oxford and CEO of the U.K Biobank since 2005. “It was very optimistic and ambitious in what it hoped was possible.” 

Sure enough, it took another decade for the UK Biobank to actually sequence the DNA for a half-million folks. Even then the bank collected only a tiny fraction of each person’s DNA using gene arrays called “gene chips” that detected around 800,000 single-letter genetic markers out of the billions of DNA letters tucked inside each of us. 

These 800,000 markers were carefully chosen to correlate with known and suspected biomarkers called single nucleotide polymorphisms (SNPs), one-letter markers that had long been used to scan for genetic variants that seemed to correlate with an increased risk of disease. SNP chips, however, by 2016 were no longer the state of the art as the latest tech allowed scientists to sequence a person’s entire genome—all 6 billion letters, or nucleotides, of DNA—for around $1,000, down from millions of dollars ten years earlier. 

Still, testing a half-million people on even an 800K SNP chip led to researchers discovering millions of new variants. All this sequencing also advanced another critical shift then underway in the science of genetics: a focus away from the effects of just one or two DNA letters toward understanding how multiple or “polygenic” genes work together to impact a person’s disease risk. 

“For many years the emphasis was on finding one or two genes causing something like heart disease,” says Peter Donnelly, a statistician at the University of Oxford and the CEO of Oxford-based Genomics PLC, a genetic testing company that works with UK Biobank based near Manchester, England. “We now know that there were many, many positions in your genome which individually affect your risk of heart disease,” he says, “but only recently have we been able to work out ways to connect and measure the impact of millions of these positions in our DNA, and to come up with a polygenic risk score”—which is much more powerful and accurate for predictions than, say, the profile I got back in 2001.

Early this year, Donnelly’s company, working with Britain’s National Health Service, is running a study called HEART (Healthcare Evaluation of Absolute Risk Testing), which is using UK Biobank participants and their data to test known genetic markers to pinpoint polygenic risk scores for heart disease and stroke. Testing people between ages 45 and 65, Genomics PLC is combining polygenic scores and more conventional risk factors like cholesterol levels, weight, and age to predict a person’s proclivity for heart disease years before they show symptoms. “These are people going about their daily lives who are at high risk of cardiovascular disease and are currently invisible to the NHS,” Donnelly told the Guardian in January. “We can find these people who are actually at quite high levels of risk but are not aware of it.” 

“For men whose polygenic risk score is high, their lifetime risk of heart disease is between 40 percent and 45 percent,” he says, “which is more than a tenfold increased risk” compared to people with the lowest risk. Knowing this allows these men’s doctors to recommend changes in diet and exercise. “If it’s above a certain level,” he says, “they would talk to you about going on statins,” drugs that lower cholesterol and reduce the risk of heart attack. 

Donnelly’s company and other researchers working with the UK Biobank also recently released polygenic risk scores for 28 other diseases, including breast cancer, prostate cancer, and type 2 diabetes. “These tests only need to be done once,” says Donnelly, “which is a fantastically cost-effective way to aim for prevention.”  

Last month, the search for genetic markers took yet another leap forward when a DeCode team led by Kári Stefánsson identified a half billion SNP and other variants coming from 150,000 people in the UK Biobank who had recently had their complete genomes sequenced. 

Published in Nature, this study was an astonishing feat of molecular exploration even if finding such a massive crush of markers doesn’t mean scientists yet know what most of them do or how they will impact health, a process that may take years to sort out. The study was partially funded by four drug companies—Amgen, Johnson & Johnson, GSK, and AbbVie—who got first crack at the markers to use in drug development for a 9-month exclusive period before they became available to other researchers. 

The UK Biobank and other banks around the world—still mostly located in the West and East Asia, with a few programs operating in the developing world—are also beginning to incorporate another set of molecular markers beyond genomics to measure a person’s present and future state of health. These include what’s happening inside your body with proteins—biological molecules made by your cells with instructions from your DNA. Proteins include enzymes, hormones, antibodies, and structural elements in your skin, hair, fingernails, and much more. Researchers also are beginning to factor in everything from metabolites—the chemicals that the body creates when it breaks down food, drugs, and other chemicals—to the impact of the bacteria that make up the microbiome of your gut.

“We’re planning to include proteins and metabolites in future risk scores,” says Rory Collins of the UK Biobank, although the ability of scientists to find, measure, and make sense of this non-genetic data remains limited. 

Promises, promises

Yet even with all this progress and decades of promises, genomics and other molecular measurements remain mostly a research project that has yet to make a substantial dent in everyday health care—outside of rare cases. 

“I think the field achieved more in the last five years of the UK Biobank than even the most enthusiastic people had thought was possible when we started,” Peter Donnelly says, insisting that biobank-based risk assessments were poised to deliver in the clinic in the next 10-15 years. This sounds fine except that this timeframe is eerily similar to what Bill Clinton and scientists like Craig Venter were claiming 20-plus years ago for genetics to transform medicine—which continues to be elusive. 

Other experts are more cautious. “Our biobank has fostered over 500 research projects,” says Beth Karlson, a rheumatologist and the director of the Mass General Brigham Biobank in Boston. “But people are not using genomic risk scores routinely in the clinic, or for prevention.” 

This doesn’t surprise her. “Science as complex as this takes time,” she says. “First, we had to get the science right, and to make the predictions meaningful. This used to be the biggest barrier. Now the science has gotten much better, which is shifting the debate more to implementation—to how to integrate genomics into a medical system that isn’t set up very well to use prediction and prevention. 

“Medical care is reactionary,” she continues. “Doctors react when a patient has a problem, and their focus is on trying to diagnose that problem and figure out what to do about the problem. The system isn’t set up very well to try to prevent future problems.” Karlson talks about the need to beef up “implementation science” that works to integrate new science and protocols into the clinic. “When you implement a new intervention,” she says, “you need to understand how doctors behave. Did they order the test? Did the patients follow through with the test? Was the test useful and something that everyone understood?”  

“What if we could rule out colon cancer for most people with a simple test and then identify the small number of people who need a colonoscopy?”

Karlson says that her team at Harvard and other U.S. biobanks are working with primary care doctors to see how to integrate new predictive tests and profiles being developed by her biobank and others. “Most of them are really excited about it, which is interesting,” she says, “especially if we do the integrative report for them, and it’s at their fingertips. Click on this box, see all the risk factors.”

“We don’t need to massively educate GPs,” adds Rory Collins, referring to General Practitioners in the United Kingdom. “They just need to know these tests are useful and cost effective. Right now, they use a crummy test for colon cancer—blood in stool and an expensive colonoscopy. What if we could rule out colon cancer for most people with a simple test and then identify the small number of people who need a colonoscopy?” 

“We are trying to get doctors to think of these genetic scores like they think of cholesterol levels,” Donnelly says, “that it’s all routine and part of the basic exam.”

This nagging complexity of human biology plus the need to get the science right also has kept the private sector from leaping into complex polygenic and other biobank-generated risk scores, although pharmaceutical companies have been working with biobanks for years to help them develop new drugs. A handful of companies like DeCode in Iceland, Donnelly’s Genomics PLC in the United Kingdom, and a New York-based company called Allelica have been developing products using polygenic risk profiles for customers in pharma and for health care systems like the National Health Service in the U.K. There are lots of companies working in the longevity space—like Richmond, California-based BioAge Labs—that are using the longitudinal data found in biobanks to develop treatments for diseases associated with aging. Most commercialization efforts, however, remain in early stages as the science continues to develop.

Too many Caucasians: the need for diversity

Another challenge with biobanks is diversity. DNA sequences inside the world’s biobanks remain overwhelmingly white and European. For instance, the UK Biobank has sequenced more than 500,000 people, but fewer than 30,000 are people of color. That’s about 6 percent of this cohort in a country that is made up of 14.4 percent ethnic minorities. In the U.S., sequenced genomes are also overwhelmingly Caucasian, while globally only two percent of the total genomes sequenced are African. Asian representation remains low, too, although robust efforts in China, Japan, and Korea are increasing numbers. Most other Asian populations lag behind, which adds to deficits in how we’re trying to understand the rich diversity of genes across our entire species. 

An illustration of what we’re missing came out of a recent study called the Human Heredity and Health in Africa (H3Africa). Scientists there sequenced the complete genomes of 426 Africans across 50 ethnolinguistic groups, a tiny number compared to the millions of genomes in Western biobanks. Yet the project discovered more than 3 million previously unknown variants. These included markers associated with immune responses to viruses and those that help with gene repair. Other studies on African populations have helped elucidate genetic factors behind hearing impairment and schizophrenia and could contribute much more knowledge about the global gene pool that could benefit everyone.

Ambroise Wonkam is a medical geneticist and director of the McKusick-Nathans Institute at Johns Hopkins University. He is calling for a multi-billion-dollar effort to sequence three million African genomes over the next 10 years. University of Cape Town

“African genomes can reveal genes and variants that contribute to health and disease not found in previous, Eurocentric studies,” wrote geneticist Ambroise Wonkam of the University of Cape Town in a recent commentary in a Nature commentary calling for a new multi-billion-dollar effort to sequence three million African genomes over the next 10 years. 

“Studies in African genomes will also help to correct injustice,” says Wonkam, who heads the African Society of Human Genetics and led the H3Africa study. “Estimates of genetic risk scores for people of African descent that predict, say, the likelihood of cardiomyopathies or schizophrenia can be unreliable or even misleading using tools that work well in Europeans. To promote discovery and produce reliable clinical tools, genotyping and analysis must be re-optimized using genomes from more populations.”

An intimate future portrait of you?

Back in 2007, when I visited the nascent UK Biobank, I met with Tim Peakman, a young molecular biologist and then the biobank’s deputy chief executive. He told me with great enthusiasm the grand plans for the biobank and how it would transform health care. 

It was a cold day in February and snow was on the way as Peakman showed me rows of refrigerators in the UK Biobank’s converted textile mill. Some already contained two test tubes on ice for each of the first recruits’ blood samples—one that could be accessed anytime by researchers, and another “like a molecular time capsule to be studied in the future,” he says. 

“The samples are the equivalent of a snapshot of where these people are when they give their samples,” says Peakman. “It will be fascinating to compare them to samples in the future.” At the time, that prompted my younger self to look around in the old mill, wondering what it would be like to have “samples from the poor textile workers who once toiled here. How would they compare with their descendants if we had their samples?”

Now, in 2022, we’re in the future for the people whose samples I saw that freezing day in 2007, even as the work continues to uncover what secrets their DNA back then has revealed about their state of health today—and also what the snapshots of those providing samples today will tell them about their future 15 years or more from now. 

This includes snapshots from the COVID-19 years, which may prove critical to understanding the next inevitable pandemic. This brings us back to the frantic moments early in the pandemic when Andrea Ganna and others established the COVID-19 Human Host Initiative, which two years later has learned a great deal about the genetics of how the virus does its damage inside humans, which brings us closer to a clinical test, but still not yet over the finish line. This comes as biobanks are also moving closer to a time when your genes—combined with your lifestyle and health records, and someday with proteins, metabolites, and the rest—will truly be able to peek inside our bodies and augur our future health.

“The data is much clearer,” says Rory Collins. “Now we need to make sure this data is available to prevent more people from dying of something like heart disease.”

After all the hype and promise over the years, let’s hope that this time it truly happens.

Author’s note: I’m currently co-authoring a book with Craig Venter, who is mentioned in this article, to be published in 2023 by Harvard University Press. The book project was paid for by the J. Craig Venter Institute in La Jolla, California.

Go Deeper