ACOS: Arrays of Cheap Optical Switches
Machine learning training places immense demands on cluster networks, motivating specialized architectures and co-design with parallelization strategies. Recent designs incorporating optical circuit switches (OCSes) are promising, offering improved cost, power efficiency, and long-term bandwidth scaling than packet switches. However, most existing approaches rely on costly high-radix OCSes and/or combine them with packet switches to achieve competitive performance at scale. Unfortunately, high-radix OCSes are both expensive and slow to reconfigure, limiting both scalability and performance. We propose Arrays of Cheap Optical Switches (ACOS), which bring application co-design directly to the structure of the reconfigurable fabric. Using low-radix OCSes as building blocks, ACOS supports the forms of reconfiguration needed in training clusters including topology selection, workload adaptation, and failure resilience. The cost of ACOS scales with supported topologies and adaptations rather than with port count, breaking past the scalability barriers of current specialized ML networks. We show through simulation that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.
💡 Research Summary
The paper addresses the growing demand for high‑bandwidth, low‑latency interconnects in large‑scale AI model training, where traditional packet‑switched fabrics and high‑radix optical circuit switches (OCS) become prohibitively expensive and slow to reconfigure. The authors propose “Arrays of Cheap Optical Switches” (ACOS), a novel fabric built from commodity low‑radix OCS devices (e.g., 1×2, 2×2, 1×4). By arranging many such switches in a structured manner, ACOS can dynamically instantiate only the topologies required for each phase of training—data‑parallel, tensor‑parallel, pipeline‑parallel, and expert‑parallel—rather than maintaining a full N×N connectivity matrix at all times.
Key architectural components are:
- Topology‑Selection OCS (1×k) attached to each accelerator. It is reconfigured every training iteration to connect the accelerator to the appropriate logical topology for the current collective operation. Because reconfiguration is performed locally by the accelerator, no global synchronization is needed, and the latency can be hidden within the compute‑communication overlap.
- Topology‑Adaptation OCS (2×2) used at job allocation time to resize logical topologies (rings, chains, tori, expanders) to match the size of the participating group. This one‑shot configuration allows the fabric to support a wide range of collective sizes without over‑provisioning.
- Topology‑Resilience OCS (2×2) that provides immediate path‑fallback in case of switch, link, or NIC failures, eliminating single points of failure in a large‑scale deployment.
The design leverages two industry trends: (a) the availability of inexpensive low‑radix OCS with sub‑10 ms (or even sub‑100 µs) switching times, and (b) modern 800 Gbps Ethernet NICs that expose multiple independent 100 Gbps lanes (port‑splitting), enabling each node to have a modest degree without additional cost. By combining these, ACOS can keep the optical depth shallow—typically one or two switch stages—thereby limiting insertion loss and avoiding costly optical regeneration.
Cost modeling shows that ACOS deployments are cheaper than equivalent packet‑switched fabrics across all scales examined. For a 4 K‑GPU system the cost reduction is 27 %, and for a 32 K‑GPU data‑center it is 19 %. Smaller clusters that accept reduced flexibility or resilience can achieve up to 70 % savings. The authors also project that future higher‑bandwidth transceivers will further amplify these savings.
Performance is evaluated via large‑scale simulations of six state‑of‑the‑art large language models (Qwen‑2, Mixtral‑7B, Llama‑8B/70B, Llama‑4 Maverick, etc.) on ACOS fabrics ranging from 16 GPUs to 32 K GPUs. Across most models ACOS matches the throughput of an ideal non‑blocking packet‑switched network, with only a few percent overhead. The only notable deviation occurs with Qwen‑2, which is highly network‑sensitive; however, increasing the per‑link bandwidth eliminates the gap. The authors demonstrate that the intra‑iteration reconfiguration latency (≤10 ms) contributes less than 0.1 % to overall training time, confirming that the latency can be effectively hidden.
The paper concludes that by “providing only the topologies that are needed,” ACOS breaks the quadratic cost scaling of high‑radix OCS and offers a practical path to cost‑effective, scalable, and resilient optical interconnects for AI training. Open challenges include deeper multi‑stage optical loss optimization, automated topology scheduling, and real‑world prototype deployments. Nonetheless, ACOS represents a compelling step toward next‑generation, low‑cost optical fabrics that can keep pace with the relentless growth of AI model sizes.
Comments & Academic Discussion
Loading comments...
Leave a Comment