Within 10 years, archival data may be routinely stored in base pairs rather than bytes.
Each year, humanity is producing more and more data.
In 2003, humans since the dawn of time had, in total, produced about five exabytes of data. By 2010, we were producing that much in just two days. By 2025, our Google searches, text messages, and social media posts, along with things like clinical trials, NASA images, and experiments at the Large Hadron Collider, will be generating an estimated 175 zettabytes of information per year.
While not all of this data is necessarily important, there’s still an almost incomprehensibly large amount of data that we’ll want to preserve for the future. And current methods probably aren’t going to cut it: Old-fashioned magnetic tapes would take up enormous amounts of space and be difficult to maintain, and according to biomolecular engineers George Church and William Hughes, there simply isn’t enough microchip-grade silicon in the world to store all the data we’ll create over the next few decades alone.
But perhaps a more efficient and sustainable data storage has been inside us all along.
This isn’t a metaphor. DNA is an incredibly efficient way to store data: It can store billions of base pair “bits”—the entire set of instructions to make a whole human, in a single, microscopic cell. While a computer stores information as ones and zeros, DNA stores information using four types of nucleotides: adenine, thymine, guanine, and cytosine, or A, T, G, and C for short. Our genes are simply long or short runs of sequences of those bases.
In addition to increasing data storage density, there are other advantages to DNA data storage as well, says Lee Organick, a PhD student in the Molecular Information Systems Lab at the University of Washington. Traditional magnetic tape needs to be rewritten every ten to twenty years. DNA, on the other hand, “lasts one hundred, two hundred years, or even longer—it really depends on how you store it,” she says.
DNA is what we call eternally relevant.
Indeed, according to Emily Leproust, the CEO and founder of the San Francisco-based synthetic DNA company Twist Bioscience, scientists have recently found mammoth DNA that has remained at least partially intact for a million years.
Perhaps even more important for preserving information for future centuries or even millennia, is that DNA is always going to be around. One day, we might not have machines capable of reading VHS tapes or CDs or even today’s hard drives. But the same thing probably won’t happen with DNA. “DNA is what we call eternally relevant,” says Organick. “So, we’re always going to be interested in reading and writing DNA as long as life on Earth is DNA-based.”
Warehouse in a sugar cube
The idea that DNA could be used for information storage has been around for a surprisingly long time—mathematician Norbert Wiener discussed it in an interview in 1964, barely a decade after scientists had figured out the structure of DNA. But it’s only in the last few years that technological advancements have begun to make DNA a feasible solution for the world’s ever-expanding data storage demands.
The way DNA data storage works is that specific nucleotide sequences corresponding to specific pieces of information are written into stable strands, then stored in a cell or on a chip; later, the sequences can be read out and the information decoded.
While this technology clearly has enormous potential, there are still many practical questions about the best ways to encode, store, and access DNA-based data. Besides that, costs will need to be substantially reduced before this storage method can compete with more traditional data storage.
In theory, scientists know that DNA should be able to efficiently store massive amounts of data. But traditional DNA synthesis is prone to error, so at least for now, scientists will need to store multiple copies of each piece of information or have some other form of redundancy. Exactly how many copies is an essential question: If they don’t store enough copies, important information could be lost, but if they store too many copies, storage efficiency would be greatly reduced.
Organick and her colleagues at the University of Washington and Microsoft undertook an experiment to determine how many copies were needed to address the redundancy problem and whether a single file (or group of DNA sequences) could be efficiently retrieved from among billions of other DNA sequences.
The researchers used a common laboratory technique—polymerase chain reaction (PCR)—in which DNA primers are used to define the start and finish of specific chunks of DNA. By using the correct primers, researchers could make many, many copies of the piece of DNA that contained the data they wanted, allowing them to easily find it among many other sequences. A similar PCR technique is used by forensics investigators to amplify the appropriate regions in a tiny DNA sample from a crime scene to allow them to build a DNA profile of the suspect. This study also showed that only 10 copies of the DNA strand of interest were needed for this technique to work.
So what does this mean? DNA-based data can be stored even more efficiently than was previously shown. According to this research, DNA can store data at a density of 17 exabytes of information per gram of material—that’s a 17 followed by 18 zeros. At that density, you could store the entire 175 zettabytes of data that humanity is projected to produce in 2025 in something the size of a case of beer.
This study confirms that DNA storage could save massive amounts of space, Organick says. “If you were to take a warehouse full of magnetic tape—which right now is the industry standard—you could shrink that down to the size of a sugar cube.”
DNA punch cards and bacterial memory
One of the most intense areas of research in DNA information storage concerns how to create the DNA itself. We’re already able to assemble the basic building blocks (A, T, C, and G) into strands of information, but the process is relatively slow and extremely expensive. It also produces toxic by-products, and the error rate means that longer strands of DNA can’t always be accurately synthesized, says Richie Kohman, lead of the Synthetic Biology Platform at Harvard’s Wyss Institute.
Kohman and others at Harvard are trying to solve some of these issues using a new method of DNA synthesis that uses light-activated enzymes. The researchers flood the surface of a chip with one of the four DNA building blocks, which become attached to a DNA primer on the chip’s surface. Then, they use precisely targeted beams of UV light to activate the enzymes at specific locations on the chip—each location representing one strand of DNA that is being constructed. At those locations, the activated enzymes go to work adding another building block to the strand. Then the light is turned off, the enzymes become inactive, and a new type of building block is poured onto the chip. The process repeats, and another set of building blocks is added to the growing chains in precise locations. This process doesn’t use harsh solvents, and multiple strands of DNA can be synthetized simultaneously. In a proof-of-concept study, the Harvard researchers synthesized 12 strands of DNA simultaneously on one chip, and Kohman says future work could involve thousands of strands on a single chip.
But this is just one of many potential methods to improve ways to store information in DNA. A team at the University of Illinois is trying to lower costs using a “punch card” system that stores information by creating nicks in already-existing DNA strands instead of having to create synthetic DNA from scratch. In this system, 1’s and 0’s are stored in predetermined locations on a strand of DNA as either a tiny notch cut by a special enzyme (a “1”) or no notch (a “0”). Later, these notches can be read using commercial DNA sequencing technology and converted back to 1’s and 0’s.
Startups like Twist Bioscience are also competing to synthesize and store DNA more cheaply and efficiently. Twist CEO Leproust says the company is working on fitting more strands of DNA on a single chip, shrinking the space between the active sites where the strands of DNA are built. Since strands on the same chip can be built simultaneously, increasing the number of strands on the chip can substantially decrease the cost of the DNA synthesis process. “Right now, we are working on the one micron [one thousandth of a millimeter] chip, which is almost there,” she says. “Our next chip, when we get to 150 nanometers [between active sites], that one would be price-competitive with hard drives and tape.”
One day, living cells could also actually record information about their environment in their DNA.
While many researchers are planning to store data-containing DNA on chips, others are exploring the potential of storing it within the genomes of living organisms. Harris Wang, a systems biology professor at Columbia University, has created a system to encode data directly into the DNA of living bacterial cells using CRISPR. Although this would result in less-dense data storage, since there’s only so much extraneous DNA you can add to a cell’s genome, it would also make it very easy to make copies of the data because frozen or freeze-dried cells could be pulled out and allowed to copy themselves many, many times over. Furthermore, some bacteria can form spores which are highly resistant to heat, radiation, and chemicals, meaning that data stored in particular types of bacterial DNA could be much less vulnerable than data stored by synthetic DNA on its own. Frozen spores could last more or less indefinitely.
Wang says that in the future, using cells to store data could also help reduce the costs of writing information into DNA, because cells are already equipped to make all of the enzymes and raw materials needed for DNA synthesis, they just need to be given food and the appropriate growth conditions.
One day, Wang says, living cells could also actually record information about their environment in their DNA. One especially interesting environment could be the human body. Although researchers are still figuring out exactly how to do this, it’s possible that a cell’s natural abilities to sense certain chemicals, for example, could be hooked up to some type of CRISPR system which would modify the cell’s DNA in specific patterns whenever the cell sensed a given chemical. This would allow living cells to record information about what’s happening in the body in real time, and write it into their DNA so that this information could be read later.
While there are many advantages to DNA-based data storage, in general, the cost of synthesizing and reading DNA remains too prohibitively expensive for widespread use. But the cost is steadily dropping as researchers come up with new methods to write and store DNA-based data, and it may not be very long before this technique becomes a standard form of storage. “I think in ten years for sure all the archiving is going to be done in DNA,” says Leproust.