Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction

Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Traditional non-biological storage media, such as hard drives, face limitations in both storage density and lifespan due to the rapid growth of data in the big data era. Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium due to their high storage density, structural stability, and long lifespan. The sequencing of mirror-image peptides relies on \textit{de-novo} technology. However, its accuracy is limited by the scarcity of tandem mass spectrometry datasets and the challenges that current algorithms encounter when processing these peptides directly. This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences. In this work, we introduce DBond, a deep neural network based model that integrates sequence features, precursor ion properties, and mass spectrometry environmental factors for the prediction of mirror-image peptide bond cleavage. In this process, sequences with a high peptide bond cleavage ratio, which are easy to sequence, are selected. The main contributions of this study are as follows. First, we constructed MiPD513, a tandem mass spectrometry dataset containing 513 mirror-image peptides. Second, we developed the peptide bond cleavage labeling algorithm (PBCLA), which generated approximately 12.5 million labeled data based on MiPD513. Third, we proposed a dual prediction strategy that combines multi-label and single-label classification. On an independent test set, the single-label classification strategy outperformed other methods in both single and multiple peptide bond cleavage prediction tasks, offering a strong foundation for sequence optimization.


💡 Research Summary

In the era of exploding data volumes, conventional storage media such as magnetic tapes and hard drives are approaching physical limits in density and longevity. Biological macromolecules have emerged as a promising alternative, with DNA already demonstrating petabyte‑per‑gram densities and millennial stability. This paper focuses on mirror‑image peptides—synthetic polymers composed entirely of D‑amino acids—as an even more stable and dense storage medium. The key bottleneck for peptide‑based storage is de‑novo sequencing, which must accurately recover the original D‑amino‑acid sequence from tandem mass spectrometry (MS/MS) data. Existing de‑novo algorithms perform well on natural (L‑) peptide datasets but struggle with mirror‑image peptides because of the scarcity of training data and the distinct physicochemical behavior of D‑amino acids.

To address this, the authors propose an indirect improvement strategy: design peptide sequences that are intrinsically easier to fragment, thereby increasing the likelihood of successful sequencing. They introduce DBond, a deep neural network that predicts the cleavage probability of each peptide bond in a mirror‑image peptide. The workflow consists of four major contributions:

  1. MiPD513 Dataset – The authors synthesized 513 distinct mirror‑image peptides, incorporating both the 20 canonical amino acids and several non‑canonical D‑residues (D‑Dap, D‑Orn, D‑X, D‑Cha). Using a Thermo Fisher Q‑Exactive Plus mass spectrometer with high‑energy collisional dissociation (HCD) at multiple normalized collision energies, they acquired 477 669 MS/MS spectra. After processing with MSConvert, the dataset provides a comprehensive resource for training and evaluation.

  2. Peptide Bond Cleavage Labelling Algorithm (PBCLA) – PBCLA automatically extracts bond‑cleavage information from raw spectra. It matches six ion types (b, y, b‑H₂O, b‑NH₃, y‑H₂O, y‑NH₃) within a 20 ppm tolerance, allowing charge states of 1 or 2. For each peptide bond, the presence of a matching fragment ion marks the bond as “cleaved” (label = 1); otherwise it is “uncleaved” (label = 0). Applying PBCLA to MiPD513 yields 12 473 724 labeled instances covering 303 distinct bond positions.

  3. DBond Architecture – Input features are grouped into four logical sets: (a) State features (precursor charge, m/z, intensity), (b) Bond features (relative position of the bond), (c) Environmental features (collision energy, scan number), and (d) Sequence features (the D‑amino‑acid string). Each group is embedded appropriately and fed into a multi‑head self‑attention (MSA) transformer block that captures global dependencies among residues, bond positions, and experimental conditions. The model outputs a probability for each bond; training minimizes binary cross‑entropy across all bonds.

  4. Prediction Strategies and Optimization – Two strategies are evaluated. The multi‑label approach predicts the entire cleavage vector for a peptide in one forward pass, while the single‑label approach treats each bond as an independent binary classification problem, effectively performing sequential predictions. Empirical results on an independent test set show that the single‑label strategy consistently outperforms the multi‑label one in terms of F1‑score, precision, and recall, especially for longer peptides where fragmentation patterns become more heterogeneous.

Using the predicted cleavage probabilities, the authors define a cleavage ratio (g(seq) = \frac{1}{l-1}\sum_{i=1}^{l-1} \hat{y}_i) where (l) is peptide length and (\hat{y}_i) is the predicted cleavage status of bond (i). The sequence design problem is cast as maximizing this ratio over all possible mapping rules that translate raw binary data into D‑peptide sequences. By selecting the mapping rule that yields the highest predicted cleavage ratio, the resulting peptide library is intrinsically more amenable to MS/MS fragmentation, thereby improving downstream de‑novo sequencing accuracy without altering the sequencing algorithms themselves.

Overall, the paper delivers a complete pipeline—from data generation (MiPD513) and automated labeling (PBCLA) to a specialized deep‑learning predictor (DBond) and a practical optimization framework—for enhancing peptide‑based data storage. The work demonstrates that intelligent sequence design, guided by machine‑learning‑driven fragmentation predictions, can substantially mitigate the current limitations of mirror‑image peptide sequencing. Future directions include expanding the dataset with additional non‑canonical residues, integrating DBond into real‑time encoding pipelines, and validating the approach on large‑scale storage‑retrieval experiments.


Comments & Academic Discussion

Loading comments...

Leave a Comment