Deep Reinforcement Learning for Fault-Adaptive Routing in Eisenstein-Jacobi Interconnection Topologies

Deep Reinforcement Learning for Fault-Adaptive Routing in Eisenstein-Jacobi Interconnection Topologies
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The increasing density of many-core architectures necessitates interconnection networks that are both high-performance and fault-resilient. Eisenstein-Jacobi (EJ) networks, with their symmetric 6-regular topology, offer superior topological properties but challenge traditional routing heuristics under fault conditions. This paper evaluates three routing paradigms in faulty EJ environments: deterministic Greedy Adaptive Routing, theoretically optimal Dijkstra’s algorithm, and a reinforcement learning (RL)-based approach. Using a multi-objective reward function to penalize fault proximity and reward path efficiency, the RL agent learns to navigate around clustered failures that typically induce dead-ends in greedy geometric routing. Dijkstra’s algorithm establishes the theoretical performance ceiling by computing globally optimal paths with complete topology knowledge, revealing the true connectivity limits of faulty networks. Quantitative analysis at nine faulty nodes shows greedy routing catastrophically degrades to 10% effective reachability and packet delivery, while Dijkstra proves 52-54% represents the topological optimum. The RL agent achieves 94% effective reachability and 91% packet delivery, making it suitable for distributed deployment. Furthermore, throughput evaluations demonstrate that RL sustains over 90% normalized throughput across all loads, actually outperforming Dijkstra under congestion through implicit load balancing strategies. These results establish RL-based adaptive policies as a practical solution that bridges the gap between greedy’s efficiency and Dijkstra’s optimality, providing robust, self-healing communication in fault-prone interconnection networks without requiring the global topology knowledge or computational overhead of optimal algorithms.


💡 Research Summary

The paper investigates fault‑adaptive routing in Eisenstein‑Jacobi (EJ) interconnection networks, a class of 6‑regular hexagonal torus topologies derived from Eisenstein integers. EJ networks offer higher node density and lower diameter than conventional 2‑D mesh or torus fabrics, making them attractive for many‑core and high‑performance computing (HPC) systems. However, their performance is highly sensitive to node or link failures because traditional minimal greedy routing can become trapped in local minima when faulty nodes block the geometrically shortest direction.

Three routing strategies are evaluated under identical fault‑injection and traffic‑load scenarios: (1) deterministic Greedy Adaptive Routing (GAR), (2) Dijkstra’s algorithm with complete global knowledge (theoretical optimum), and (3) a deep reinforcement‑learning (RL) approach based on Proximal Policy Optimization (PPO). The RL formulation treats each router as an independent agent. The state includes the router’s coordinates, the destination coordinates, and a six‑bit mask indicating which of the six EJ neighbors are operational. The action space consists of the six unit directions {±1, ±ρ, ±ρ²}. The reward function is multi‑objective: +100 for reaching the destination, –50 for stepping into a faulty node, and –1 per hop to encourage short paths. PPO’s actor‑critic architecture, combined with Generalized Advantage Estimation (GAE) and a clipping parameter ε = 0.2, ensures stable policy updates despite the non‑stationary fault environment.

Simulation experiments use two stress models: random fault injection and clustered faults that create challenging local‑minimum regions, and they sweep offered load from 0.1 to 0.9 to assess congestion effects. With nine faulty nodes (≈2 % of the network), GAR’s effective reachability and packet delivery ratio collapse to about 10 % because many packets encounter dead‑ends and are dropped. Dijkstra’s algorithm, which explores the full fault‑aware graph, achieves 52‑54 % reachability and delivery—representing the topological optimum under the given fault pattern. The RL agent attains 94 % reachability and 91 % delivery, a performance gap of roughly 42 percentage points from the theoretical optimum but dramatically superior to GAR. Importantly, the RL policy requires only local information and modest per‑packet computation, making it feasible for distributed deployment in NoC environments.

Throughput measurements reveal that the RL approach maintains normalized throughput above 90 % across all load levels. Under high load, RL even outperforms Dijkstra, whose globally optimal paths tend to concentrate traffic on a few links, leading to buffer saturation and reduced throughput. The RL policy implicitly learns load‑balancing behavior, distributing packets more evenly across the six available directions. In clustered‑fault scenarios, the RL agent successfully navigates around fault “holes” without getting trapped, demonstrating robust fault avoidance.

The authors conclude that deep RL can bridge the gap between the computational simplicity of greedy routing and the optimality of Dijkstra in EJ topologies. By learning to infer global fault patterns from local observations, the RL agents achieve near‑optimal delivery while preserving the scalability and low overhead required for on‑chip routers. The paper suggests future work on multi‑agent cooperation, multi‑objective optimization (including power and latency), hardware prototyping, and extension to dynamic link‑fault models. Overall, the study provides strong evidence that learning‑based adaptive routing is a practical and effective solution for fault‑prone, high‑density interconnection networks.


Comments & Academic Discussion

Loading comments...

Leave a Comment