Entromics -- thermodynamics of sequence dependent base incorporation into DNA reveals novel long-distance genome organization
Zero mode waveguide technology of next generation sequencing demonstrated sequence-dependence of the enzymatic reaction, incorporating a base into the genomic DNA. We show that these experimental results indicate existence of a previously uncharacterized physical property of DNA, the incorporation reaction chemical potential {\Delta}{\mu}. We use the combination of graph theory and statistical thermodynamics to derive entromics - a series of results providing the thermodynamic model of {\Delta}{\mu}. We also show that {\Delta}{\mu}i is quantitatively characterized as incorporation entropy. We present formulae for computing {\Delta}{\mu} from the genome DNA sequence. We then derive important restrictions on DNA properties and genome assembly that follow from thermodynamic properties of {\Delta}{\mu}. Finally, we show how these genome assembly restrictions lead directly to the evolution of detectable coherences in incorporation entropy along the entire genome. Examples of entromic applications, demonstrating functional and biological importance are shown.
💡 Research Summary
The authors introduce “entromics,” a theoretical framework that links the sequence‑dependent dwell times observed in zero‑mode waveguide (ZMW) single‑molecule sequencing to a novel thermodynamic quantity: the chemical potential of base incorporation (Δμ), which they term incorporation entropy. By representing DNA as a linear graph whose vertices correspond to nucleotides and edges to phosphodiester bonds, they define a set of overlapping “moving windows” around each base. The distance between a window graph g and a reference graph g* (d(g,g*)) is interpreted as an energy difference, and the distribution of these distances is maximized under an entropy principle, yielding a partition function Z(β) analogous to a Boltzmann sum over microstates.
A key mathematical construct is the “soft‑core” graph H_i, which encodes all sequences that share the same distance δ from the reference for a given central base i. Because DNA is linear, H_i is an Eulerian graph, allowing the authors to apply a modified BEST theorem to compute M_i, the number of iso‑distance sequences, exactly (formula S1). Larger M_i corresponds to lower Δμ_i, meaning a reduced activation barrier for incorporation at that position. Importantly, Δμ_i is independent of GC content and invariant under systematic base recoding, emphasizing its entropy‑like nature.
The authors then average Δμ_i over window lengths ranging from 21 to 181 nucleotides, producing a position‑specific incorporation potential that captures subtle thermodynamic constraints encoded in the genome. They show that extreme values of M_i (near 1 for homopolymers, very high for random sequences) would lead to either prohibitively high activation energies or an unrealistic bias toward completely random genomes. To reconcile this, they propose that the genome is assembled from “multiplons”—sets of segments sharing identical Δμ values. The statistical mechanics of these multiplons follows Bose‑Einstein statistics, and the probability distribution of multiplon assemblies adopts a Planck‑like form (equation 5) with species‑specific parameters A, a, and Q. The (Q‑1) factor naturally excludes the possibility of a single unique segment, enforcing the presence of multiple copies with the same incorporation entropy.
Experimental validation is provided on two fronts. First, the inter‑pulse interval histograms from ZMW sequencing fit the Planck‑like distribution, confirming that Δμ governs the observed non‑Gaussian dwell‑time behavior. Second, histograms derived from whole‑genome sequences of several organisms (human, mouse, yeast) also require equation 5 for accurate modeling, and the fitted parameters are distinct for each species, indicating that incorporation‑entropy coherence is a species‑specific genomic signature.
Beyond statistical description, the authors demonstrate that coherent regions of low‑Δμ (high M_i) often correspond to functional genomic elements such as transcription‑factor binding sites and replication origins, suggesting that long‑range coherence networks derived from entromic analysis reflect biologically meaningful organization that is invisible to conventional sequence‑similarity or multiple‑alignment methods.
In summary, this work establishes a bridge between single‑molecule kinetic measurements and a graph‑theoretic thermodynamic model of DNA. The incorporation entropy Δμ provides a new, sequence‑dependent physical variable that captures constraints on genome assembly, predicts long‑distance coherence, and offers a novel lens for interpreting functional genomics. Entromics thus opens a pathway for integrating statistical physics, combinatorial graph theory, and high‑throughput sequencing data to uncover hidden layers of genomic organization.
Comments & Academic Discussion
Loading comments...
Leave a Comment