Exploiting Spot Instances for Time-Critical Cloud Workloads Using Optimal Randomized Strategies

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper addresses the challenge of deadline-aware online scheduling for jobs in hybrid cloud environments, where jobs may run on either cost-effective but unreliable spot instances or more expensive on-demand instances, under hard deadlines. We first establish a fundamental limit for existing (predominantly-) deterministic policies, proving a worst-case competitive ratio of $Ω(K)$, where $K$ is the cost ratio between on-demand and spot instances. We then present a novel randomized scheduling algorithm, ROSS, that achieves a provably optimal competitive ratio of $\sqrt{K}$ under reasonable deadlines, significantly improving upon existing approaches. Extensive evaluations on real-world trace data from Azure and AWS demonstrate that ROSS effectively balances cost optimization and deadline guarantees, consistently outperforming the state-of-the-art by up to $30%$ in cost savings, across diverse spot market conditions.

💡 Research Summary

The paper tackles the problem of scheduling a single, deadline‑constrained computational job in a hybrid cloud that offers two types of virtual machines: cheap but unreliable spot instances (cost = 1 per time unit, availability unknown in advance) and expensive but always‑available on‑demand instances (cost = K > 1 per time unit). The job requires L units of compute and must finish by a hard deadline D (D ≥ L). The objective is to minimize total monetary cost while guaranteeing that the job completes before D, despite the adversarial nature of spot availability.

Background and Motivation
On‑demand instances provide predictable performance but dominate cloud spend. Spot instances, sold at 3–10× discounts, can dramatically reduce costs but may be reclaimed at any moment. Many batch workloads tolerate interruptions via checkpointing, yet latency‑sensitive applications (real‑time recommendation, IoT pipelines, edge analytics, video processing) cannot afford missed deadlines. Existing heuristic policies (e.g., “greedy” switching or Uniform Progress from prior work) are deterministic and suffer from a worst‑case competitive ratio that grows linearly with the cost ratio K. An adversary can align spot revocations with the scheduler’s deterministic decisions, forcing the algorithm to fall back on expensive on‑demand resources for the entire job.

Theoretical Contributions
The authors first prove a lower bound (Theorem 1) showing that any deterministic online scheduler must incur a competitive ratio of Ω(K) in the worst case. This establishes a fundamental limitation of all prior deterministic approaches.

To overcome this, they introduce ROSS (Randomized Online Spot Scheduler), a three‑phase randomized algorithm:

Warm‑up Phase – While the slack ratio ((D−t)/(L−ϕ(t))) exceeds a critical threshold ((1+2√K)/(1+√K)), the algorithm runs a warm‑up policy (either greedy or uniform) to accumulate partial progress.
Randomized On‑Demand Injection – At the moment the slack drops to the threshold, the algorithm records the current time ξ₁, computes the remaining work (L−ϕ(ξ₁)), and defines a duration (\delta = (L−ϕ(ξ₁))/(1+√K)). It then selects a random interval I(δ) of length δ within (

Exploiting Spot Instances for Time-Critical Cloud Workloads Using Optimal Randomized Strategies

💡 Research Summary

Comments & Academic Discussion

Leave a Comment