Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo
While distributed tracing and chaos engineering are becoming standard for microservices, resilience models remain largely manual and bespoke. We revisit a trace-discovered connectivity model that derives a service dependency graph from traces and uses Monte Carlo simulation to estimate endpoint availability under fail-stop service failures. Compared to earlier work, we (i) derive the graph directly from raw OpenTelemetry traces, (ii) attach endpoint-specific success predicates, and (iii) add a simple asynchronous semantics that treats Kafka edges as non-blocking for immediate HTTP success. We apply this model to the OpenTelemetry Demo (“Astronomy Shop”) using a GitHub Actions workflow that discovers the graph, runs simulations, and executes chaos experiments that randomly kill microservices in a Docker Compose deployment. Across the studied failure fractions, the model reproduces the overall availability degradation curve, while asynchronous semantics for Kafka edges change predicted availabilities by at most about 10^(-5) (0.001 percentage points). This null result suggests that for immediate HTTP availability in this case study, explicitly modeling asynchronous dependencies is not warranted, and a simpler connectivity-only model is sufficient.
💡 Research Summary
This paper presents an empirical evaluation of a trace-discovered resilience model for microservices systems, with a specific focus on assessing the impact of modeling asynchronous communication semantics. The core methodology involves automatically deriving a service dependency graph from raw OpenTelemetry traces, rather than relying on aggregated API endpoints. This low-level trace analysis allows for the identification and tagging of asynchronous edges, particularly those traversing Kafka, based on span attributes and semantic conventions.
The model enhances prior work by incorporating endpoint-specific success predicates. For each HTTP endpoint of interest (e.g., /api/checkout), the model defines an entry service and a set of target backend services that must be reachable (under rules like all_of) for the request to be considered immediately successful. System availability under fail-stop service failures is then estimated via Monte Carlo simulation over the discovered graph.
A key innovation is the introduction and comparison of two model semantics: an “all-blocking” semantics where all observed dependencies are treated as required for immediate success, and an “async” semantics where Kafka edges are considered non-blocking for the immediate HTTP response, reflecting the decoupled nature of event-driven processing.
To validate the model, the study uses the OpenTelemetry Demo (“Astronomy Shop”) as a case study. An automated GitHub Actions workflow orchestrates the entire experiment: deploying the system, collecting traces, discovering the dependency graph, running Monte Carlo simulations for both semantics, and executing chaos experiments. The chaos experiments randomly kill a fraction of microservices in a Docker Compose deployment and actively probe HTTP endpoints to measure real-world availability.
The results across a range of failure fractions (0.1 to 0.9) show that the model with all-blocking semantics accurately reproduces the overall availability degradation curve observed in live chaos experiments. Surprisingly, applying the async semantics to account for Kafka-based dependencies resulted in negligible changes to predicted availability values—differences were on the order of 10^(-5) (0.001 percentage points).
This null finding is significant. It indicates that for the specific SLO of immediate HTTP availability in this case study, the asynchronous Kafka dependencies did not lie on the critical path to the synchronous backend services defined in the endpoints’ success predicates. Consequently, explicitly modeling these asynchronous semantics did not improve predictive accuracy. The paper concludes that a simpler, connectivity-only model is sufficient for predicting immediate HTTP availability in architectures similar to the one studied, providing a practical guideline for when model simplicity can be favored over semantic complexity. The integrated workflow demonstrates the feasibility of combining automated model discovery from observability data with chaos engineering for continuous resilience assessment.
Comments & Academic Discussion
Loading comments...
Leave a Comment