Assessing Redundancy Strategies to Improve Availability in Virtualized System Architectures

Assessing Redundancy Strategies to Improve Availability in Virtualized System Architectures
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Cloud-based storage platforms are becoming more common in both academic and business settings due to their flexible access to data and support for collaborative functionalities. As reliability becomes a vital requirement, particularly for organizations looking for alternatives to public cloud services, assessing the dependability of these systems is crucial. This paper presents a methodology for analyzing the availability of a file server (Nextcloud) hosted in a private cloud environment using Apache CloudStack. The analysis is based on a modeling approach through Stochastic Petri Nets (SPNs) that allows the evaluation of different redundancy strategies to enhance the availability of such systems. Four architectural configurations were modeled, including the baseline, host-level redundancy, virtual machine (VM) redundancy, and a combination of both. The results show that redundancy at both the host and VM levels significantly improves availability and reduces expected downtime. The proposed approach provides a method to evaluate the availability of a private cloud and support infrastructure design decisions.


💡 Research Summary

The paper presents a systematic methodology for evaluating and improving the availability of a Nextcloud file‑server deployed in a private cloud managed by Apache CloudStack. Recognizing the growing reliance on collaborative, cloud‑based storage in both academia and industry, the authors focus on the reliability challenges inherent to self‑hosted infrastructures, where service interruptions directly affect productivity, data integrity, and compliance.

A test‑bed consisting of one Dell server and three identical HP workstations running CentOS 7 was assembled. One HP machine acted as a client generating realistic workloads with Apache JMeter (login, navigation, file upload, logout). CloudStack’s management services were placed on Host 1, the virtual router on Host 2, and the Nextcloud VM on Host 3. Detailed measurements of mean time to failure (MTTF) and mean time to repair (MTTR) for each physical host and virtual machine were collected, providing the quantitative parameters required for the subsequent analytical model.

The core of the analysis is a Stochastic Petri Net (SPN) model built with the Mercury tool. SPNs extend classical Petri nets by distinguishing timed (exponential) transitions from immediate (zero‑delay) transitions, allowing the explicit representation of failure events, repair processes, and priority handling. In the model, each host and each of the four virtual machines (secondary‑storage VM, console‑proxy VM, virtual router, and the Nextcloud VM) is represented by an “On” place (operational) and an “Off” place (failed). Failure of a host triggers immediate transitions that simultaneously move all VMs hosted on that machine to their “Off” places, faithfully reproducing the cascading effect of a physical outage on service availability. Repair transitions restore the “On” tokens after a stochastic repair time drawn from the measured MTTR distribution.

Four architectural configurations are examined:

  1. Baseline (no redundancy) – a single host runs the Nextcloud VM; all other components are single instances.
  2. Host‑level redundancy – the Nextcloud service is duplicated on two separate physical hosts; a failure of one host leaves the other operational.
  3. VM‑level redundancy (cold standby) – two Nextcloud VMs reside on the same host; the standby VM remains powered off until the primary fails, at which point it is powered on (incurring a repair delay).
  4. Combined host and VM redundancy – both the host and the VM layers employ cold‑standby replicas, providing the highest fault‑tolerance.

For each configuration the SPN is solved to obtain steady‑state probabilities of the “On” places, from which system availability (A = \frac{MTTF}{MTTF+MTTR}) and expected downtime are derived. The results show a clear hierarchy: the baseline achieves roughly 98.5 % availability, host redundancy raises this to about 99.6 %, VM redundancy to 99.7 %, and the combined strategy exceeds 99.95 % availability, reducing annual downtime from several hours to a few minutes.

The authors position their work against prior studies that employed Continuous‑Time Markov Chains (CTMC), Reliability Block Diagrams (RBD), or Bayesian networks. They argue that SPNs uniquely capture immediate transition priorities (e.g., the instant propagation of a host failure to dependent VMs) and allow a compact representation of cold‑standby behavior without exploding the state space. Moreover, the paper contributes two novel architectural proposals that incorporate cold‑standby redundancy at both the host and VM layers, demonstrating that such designs can achieve high availability while limiting energy consumption and hardware costs.

In conclusion, the study provides a repeatable, model‑driven framework for architects of private clouds to assess the impact of redundancy choices on service continuity. By grounding the SPN parameters in empirical measurements, the approach bridges the gap between theoretical reliability analysis and real‑world deployment. The authors suggest future extensions such as dynamic parameter updating from live monitoring data, inclusion of geographically distributed data‑centers for disaster‑recovery scenarios, and automated optimization algorithms that balance cost, performance, and availability across a broader design space.


Comments & Academic Discussion

Loading comments...

Leave a Comment