Expected Recovery Time in DNA-based Distributed Storage Systems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We initiate the study of DNA-based distributed storage systems, where information is encoded across multiple DNA data storage containers to achieve robustness against container failures. In this setting, data are distributed over $M$ containers, and the objective is to guarantee that the contents of any failed container can be reliably reconstructed from the surviving ones. Unlike classical distributed storage systems, DNA data storage containers are fundamentally constrained by sequencing technology, since each read operation yields the content of a uniformly random sampled strand from the container. Within this framework, we consider several erasure-correcting codes and analyze the expected recovery time of the data stored in a failed container. Our results are obtained by analyzing generalized versions of the classical Coupon Collector’s Problem, which may be of independent interest.

💡 Research Summary

The paper introduces a theoretical framework for DNA‑based distributed storage systems (DNA‑DSS) in which data are spread across M independent DNA containers, each holding n distinct oligonucleotide strands. Unlike classical distributed storage, a read operation on a DNA container returns a single strand chosen uniformly at random, reflecting the nature of high‑throughput sequencing. The authors model the recovery of a failed container as a stochastic process: at each time step, every surviving container yields one random strand, and recovery is achieved once enough distinct strands have been observed to reconstruct the missing data.

Two families of erasure‑correcting codes are examined: scalar MDS codes and MDS array codes. For scalar MDS codes, each row of the data matrix (size n × (M‑r)) is encoded into a length‑M codeword, where r is the redundancy. Recovery of a failed column requires, for every row, at least M‑r distinct symbols from the surviving containers. This requirement maps directly to n independent coupon‑collector processes, each needing to collect m = M‑r different coupons. The authors formalize this via a matrix‑growth process A(t) and prove (Theorem 1) that the expected time to satisfy the per‑row requirement is

Expected Recovery Time in DNA-based Distributed Storage Systems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment