Tikalon Header Blog Logo

DNA Data Storage

February 8, 2013

Possibly the first data storage medium most of my generation encountered was the scantron optical answer sheet, that notorious method for transforming a student's pencil marks into failing grades. Today, children are born into a world replete with embedded memory chips, but it's not likely they will ever see the ones and zeros of their data in these. A scantron sheet clearly shows the data in human-readable form, but today's memory media have moved to a higher level of abstraction.

In the early days of computing, much data existed on optical media similar to the scantron sheet. There were punched cards and punched tape, both of which had a history before computers. Punched cards were used to control patterns in textile manufacture, starting with the Jacquard loom. Punched tape was used to prepare messages for teleprinting.

Teleprinter and computer paper tape

Five hole paper tape, used in teleprinters, and eight hole paper tape, used in early computer systems.

(Photo by Ted Coles, modified, via Wikimedia Commons.)


The venerable nine-track magnetic tape digital storage medium has been with us since 1964, and it proved useful through the start of this century. Magnetic tape is still the most cost-effective means of storing large amounts of archival data, but now such tape is in high capacity cartridges and not on reels.

The commonality between paper tape and magnetic tape data storage is that data are stored serially, along a string. Our DNA similarly stores our genetic data on a string. As I wrote in a previous article (DNA Extended Character Set, June 26, 2012), it's possible to assign a binary one to a certain base pair, and a binary zero to the other. Human chromosomes have between 51 million to 245 million base pairs. A hundred million base pairs could encode about 10 megabyte, or just enough to store about a hundred of these blog articles.

DNA synthesis is still an expensive process, with a cost of about forty cents per base pair for short sequences.[1] This means that writing a DNA archive would be expensive. The reading process is much less expensive. You can read (sequence) a million base pairs for about ten cents.[2] I reviewed some of the technologies driving DNA sequencing to even lower cost in another article(Full Genome Sequencing, June 7, 2012)

As the Large Hadron Collider has shown, cost is not that much of an impediment to basic scientific research. In 2012, at team of scientists at Harvard Medical School, the Wyss Institute for Biologically Inspired Engineering and Johns Hopkins University wrote 5.27 megabits of data onto DNA, and they were able to successfully read it back.[3] Part of these data were a copy of a book authored by one of these scientists and nicely available in a digital format.[4]

The pioneering Harvard work had the problem that the simple technique used is not scalable and it didn't have an internal error-correction coding.[5] Now, a team from the European Bioinformatics Institute (EBI, Hinxton, UK) and Agilent Technologies (Santa Clara, California) has published an improved data coding scheme that addresses these issues, and also the cost issue.[5-15]

In the tradition of much important science, the idea started with two men, Ewan Birney and Nick Goldman of EBI, in a pub.[14]. Says Goldman, as quoted in The Guardian, "We wrote on napkins and sketched out details, and realized we could probably do this."[12] In the end, they proved their method by the successful storage of such things as a PDF file of Watson and Crick's famous DNA paper and a text file of all 154 Shakespeare sonnets.[6]

DNA double helix atomic model

(DNA double helix molecular model, still image from an animation, via Wikimedia Commons.)


One problem that arises in trying to encode information in DNA is that long stretches of any single base are not synthesized properly using today's techniques. To solve this problem, the European team used just three of the four available bases, and they converted their data into ternary code.[16] The fourth base, in their case guanine (G), was used to indicate that a quartet composed of three of another base plus guanine represented four of the other base. Thus, the quartet, "TTGT," in which T is thymine, represents TTTT.[10]

Short sequences of DNA were needed to keep down the cost, so the coding was done by having indexing information (the data location of the data snip) in each sequence. Data were overlapped on sequences as a form of error correction, and error would have to exist on four different sequence snippets to cause an error in the encoded message.[9] Hundred bases DNA sequences were staggered on the sequence snippets so consecutive snippets had a 75-base overlap. Even then, there were two 25-base gaps when the data of the complete message were read.[10]

How safe is it to have such rogue DNA in the environment? Goldman is quoted by the BBC and Guardian as saying,
"The DNA we've created can't be incorporated accidentally into a genome, it uses a completely different code to that used by the cells of living bodies. If you did end up with any of this DNA inside you it would just be degraded and disposed of."[11-12]

Nick Goldman of EMBL

Nick Goldman of EMBL-EBI, looking at synthesized DNA.

(Image by EMBL Photolab.)


If the cost can be reduced, the DNA approach may be viable for very long-term data storage. The Harvard team had a data density of 700 terabits per gram, which is more than six orders of magnitude greater than hard drive storage, and the EBI team increased that to 2.2 petabytes per gram.[7] According to the EBI team, today's synthesis prices make the approach viable for things you would want to store for 600 years. If the cost could be decreased a hundred times, you could make a viable fifty year storage medium.[7]

One problem is that the data would not be random access. You would need to sequence an entire phial of DNA to get any part of an archival record.[7]

References:

  1. Keith Robison , "Will Cheap Gene Synthesis Squelch Cheaper Gene Synthesis?" Omics! Omics! Blog, February 20, 2011.
  2. DNA Sequencing Costs, NHGRI Genome Sequencing Program.
  3. George M. Church, Yuan Gao and Sriram Kosuri, "Next-Generation Digital Information Storage in DNA," Science, Vol. 337 no. 6102 (September 28, 2012), p. 1628.
  4. Ed Regis and George M. Church, "Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves," Basic Books, October 2, 2012, 306 pages (Via Amazon).
  5. Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos and Ewan Birney, "Towards practical, high-capacity, low-maintenance information storage in synthesized DNA," Nature (January 23, 2013), doi:10.1038/nature11875.
  6. EMBL-EBI researchers make DNA storage a reality, European Molecular Biology Laboratory Press Release, January 23, 2013.
  7. Robert F. Service, "Half a Million DVDs in Your DNA," Science Now, January 23, 2013.
  8. Makiko Kitamura, "DNA Method May Enable Storage of All World's Data," Bloomberg News/Business Week, January 23, 2013.
  9. Richard Chirgwin, "Squillions of bytes in one cup of DNA," The Register (UK), January 23, 2013.
  10. by John Timmer, "MP3 files written as DNA with storage density of 2.2 petabytes per gram," Ars Technica, January, 23 2013.
  11. Jonathan Amos, "DNA 'perfect for digital storage'," BBC News, January 23, 2013.
  12. Ian Sample, "Shakespeare and Martin Luther King demonstrate potential of DNA storage," The Guardian (UK), January 23, 2013.
  13. Nick Collins, "Computer files stored accurately on DNA in new breakthrough," Telegraph (UK), January 23, 2013.
  14. Adam Cole, "Shall I Encode Thee In DNA? Sonnets Stored On Double Helix," NPR Morning Edition, January 24, 2013.
  15. Rachel Ehrenberg, "DNA stores poems, a photo and a speech - Scientists store and then retrieve 750 kilobytes of data in DNA," Science News, January 23, 2013.
  16. I used ternary code in a simple circuit I published a while ago. Dev Gualtieri, "Expand Microcontroller Input Capacity Using Ternary Logic," Electronic Design, Sep. 19, 2011.

Permanent Link to this article

Linked Keywords: Data storage medium; baby boomer; my generation; optical answer sheet; scantron; pencil; grade; children; embedded system; embedded memory chip; abstraction layer; level of abstraction; history of computing hardware; early days of computing; optical storage; optical media; punched card; punched tape; textile manufacturing; Jacquard loom; teleprinting; Wikimedia Commons; nine-track magnetic tape; magnetic tape; archive; archival; cartridges and cassettes; high capacity cartridge; reel; DNA; genetic data; binary numeral system; base pai; chromosome; megabyte; blog; DNA synthesis; nucleic acid sequence; technology; Large Hadron Collider; pure research; basic scientific research; scientist; Harvard Medical School; Wyss Institute for Biologically Inspired Engineering; Johns Hopkins University; megabit; DNA digital data storage; Amazon Kindle; digital format; scalability; scalable; forward error correction; error-correction coding; European Bioinformatics Institute; Hinxton, UK; Agilent Technologies; Santa Clara, California; tradition; science; Ewan Birney; Nick Goldman; The Guardian; Portable Document Format; PDF file; James Watson; Francis Crick; Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid; famous DNA paper; Shakespeare sonnets; paleontology; paleontologist; extinction; extinct; species; woolly mammoth; nucleic acid double helix; DNA double helix; molecular model; ternary numeral system; ternary code; guanine; thymine; BBC News; EMBL Photolab; terabit; gram; order of magnitude; hard disk drive; hard drive; petabyte; random-access memory; random access; phial; Ed Regis and George M. Church, "Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves."