Microsoft makes progress in fast DNA data storage-TechRepublic

2021-12-11 02:15:17 By : Ms. Hanny Li

It won’t take long for you to write megabytes of data per second on synthetic DNA, which can be read for thousands of years.

IDC predicts that not all of the 9 zettabytes of data storage that will be needed by 2024 contains information that needs to be stored for a long time; IoT sensor readings and application performance telemetry may not be useful enough to be retained for decades. But in the business and scientific fields, there is indeed a need to archive a large number of data sets, whether it is information flow from the Large Hadron Collider or pension data (according to British law, everyone in a pension plan must keep these for their lifetime. data) ). 

See also: Metaverse cheat sheet: everything you need to know (free PDF) (TechRepublic)

In 2020, GitHub will store 21TB of data together with manuscripts scanned from the Vatican Apostle Library in the Arctic Code Library, and use the PIQL digital storage system to print the QR code of the compressed data onto the film strip, which can still be read hundreds of years later This is much longer than the life of tape archives. Tape archives need to be rewritten about every 30 years, but if you really want long-term storage, what about the molecule-DNA-that has stored information for thousands of years? Convert EB's data to one cubic inch? These nine zettabytes (plus the equipment to read and write them) can be placed in a data center rack instead of a room full of tapes.

We already have equipment for synthesizing, copying, and reading DNA for gene sequencing and scientific research (we will not stop doing this, so the technology for reading DNA will not become obsolete in a few hundred years). Karin Strauss, senior principal research manager at Microsoft, said: "Using DNA allows us to take advantage of the ecosystem that has existed and will exist for a long time."

However, using DNA to store data requires some additional steps. First, the coding software converts the common 1 and 0 of digital files into the four bases (A, C, T, and G) in DNA, and creates the correct bases Sequence of the DNA strand. 

When you are ready to read the information, the DNA sequencer will transcribe the base sequence in the DNA strand, and the decoding software will convert it back to bytes.   

  The working principle of reading and writing data using DNA.

In order to be able to write data into DNA fast enough for use, DNA storage technology needs to process at least kilobytes of data per second, preferably megabytes, which means you need to be able to write multiple DNA strands at once . As with the CPU, the key to increasing speed and reducing costs is the parallelism that packs more functions into the same space. 

Microsoft senior researcher Bichlien Nguyen said: “We can think of four DNA bases as these small building blocks, and you can add them chemically.” “In DNA synthesis, there is a surface composed of a series of dots. The point is where you add A, C, T, and G in a specific order to make them produce DNA polymers." 

How many DNA synthesis points you can pack without interfering with each other determines how many DNA strands you can construct at the same time (and you need to make multiple copies of each strand for redundancy). To put a new base into the DNA strand, you must first add the base, then use the acid to prepare the chain for the next base, and you don't want the base or acid to go into the wrong place.

Previous methods used tiny mirrors or light patterns (called photomasks) instead of acid or sprayed droplets of acid like ink in an inkjet printer. Another lesson learned from the CPU, Microsoft Research (in collaboration with the University of Washington) is using a series of electrodes in micro glass wells, each well is surrounded by cathodes to create spots for DNA growth and arrange them closely in Together a thousand times.

"What really matters is the distance or spacing between these points, and the size of these points," Nguyen said. "We did reduce the size of the spots from about 20 microns to 650 nanometers. And we also reduced the spacing between them to 2 microns. This allows us to fill as many different spots as possible. We can grow different and unique DNA strand."

Applying a voltage produces acid at the anode, which prepares the DNA strand to join the next base while releasing the correct base to add to the strand at the cathode. If any acid does overflow from one glass well, it will flow into the alkali produced by the cathode and cannot reach the other well.

See: Artificial Intelligence Ethics Policy (TechRepublic Premium)

This is essentially a molecular controller and DNA writer on the chip, and is equipped with a PCIe interface. Microsoft has its work, although it is currently a proof of concept, and uses it to build four strands of synthetic DNA at a time, storing a version of the company's mission statement: "Let everyone store more!"

As a proof of concept rather than off-the-shelf hardware, the micro DNA writing mechanism is now generating 100-base-long strands. Longer chains show more errors, but with the development of hardware, this can be improved, perhaps by making the delivery of reagent fluids more complicated. 

DNA data storage does not need to be completely error-free, just like current storage systems. There are multiple levels of redundancy built-in. The first is the growth of multiple copies of DNA, which Strauss calls physical redundancy: "We are making many molecules that encode the same information." There are also built-in error correction functions, Using logical redundancy, she says that this is roughly the same as the cost of error correction memory: "For example, if all DNA copies made in the same place have errors, then you can correct it."

"This work is about making the spots smaller. The smaller the spots you make, the fewer copies you have. However, we are still at the scale where we have many copies of DNA, so this is not a problem. In the future, you may end up with We get several copies of DNA, but we think there is still considerable room to reduce the size of this part and still maintain minimal redundancy."

Using proof-of-concept hardware, the write speed is equivalent to 2KB/sec. "We can expand the scale by creating more such arrays, or we can further reduce the spacing and size," Nguyen said.

In the future, Microsoft plans to add logic to control millions of electrode points, using the same 130nm process node used to build the system. This is what chip makers used 20 years ago. The shift to smaller, more modern processes will mean that the array can be expanded to billions of electrodes and megabytes per second of data storage; performance and cost are closer to tape storage. 

"We can make more blocks of the same size with higher write throughput," Strauss added. "In order to do this, you either make smaller dots and place more dots in the same area, or increase the area. The area is proportional to the cost. So the more you install, the lower the cost. You're basically The above is to amortize all costs through more DNA fragments."

So far, Microsoft has been optimizing the bandwidth for writing DNA data. She said this is a more important measure, but there are also plans to improve read latency.

"We believe that DNA storage is good for archive storage and storage in the cloud, at least in the initial stage. For writing, latency is not that important, because you can buffer information in electronic systems and write them in batches, just like we As you do here, as long as the throughput can keep up with the amount of information you store, it doesn’t matter how long it takes to write."

When you read back the DNA, the delay will affect how long you have to wait to get the information. Current DNA sequencing technology is also based on batch reading of DNA. "This has high latency, but we are seeing the development of real-time nanopore readers," Strauss said, which will speed up the process.

Microsoft also plans to study the chemical properties of solvents and reagents used with DNA, which are now fossil-based. Switching to enzymes (this is the way to build and read DNA in animals and plants) will be more environmentally sustainable and will also speed up the chemical reactions that actually build DNA strands. Nguyen said: "The enzymatic reaction takes place much faster than the current chemical process."

The ability to use electronic devices to control such molecules is an exciting technology, and it can also be used in many other fields beyond storage—from screening new drug treatments and finding disease biomarkers to detecting environmental pollutants—and much more. Secondary use may bring costs down through economies of scale. 

There are more than 40 companies in the DNA Data Storage Alliance, including familiar drive manufacturers (such as Seagate and Western Digital), tape experts (such as Quantum and Spectra Logic), and biological science organizations. Strauss warned that the production system for DNA storage still has some way to go. "There are still quite a few projects that need to enter the commercial system to reduce the error rate, make the system more automated and integrated, and so on."

But the research published by Microsoft here shows that large-scale commercial DNA data archives look very feasible.

Read these Windows and Office tips, tricks, and cheat sheets to become a Microsoft insider in your company. Ship Monday and Wednesday

Mary Branscombe is a freelance technology journalist. Mary has been engaged in technical writing for nearly 20 years, covering everything from early versions of Windows and Office to the first smartphone, the emergence of the Internet, and most of the content in between.

Oracle Linux Checklist: What to do after installation

Recruitment Toolkit: Blockchain Integration Expert

Android 12 tips: practical tips to make the most of the Google mobile operating system

Recruitment Toolkit: Computer Vision Engineer