GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes

GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autonomous AI agents on embedded platforms require real-time, risk-aware scheduling under resource and thermal constraints. Classical heuristics struggle with workload irregularity, tabular regressors discard structural information, and model-free reinforcement learning (RL) risks overheating. We introduce GraphPerf-RT, a graph neural network surrogate achieving deep learning accuracy at heuristic speeds (2-7ms). GraphPerf-RT is, to our knowledge, the first to unify task DAG topology, CFG-derived code semantics, and runtime context (per-core DVFS, thermal state, utilization) in a heterogeneous graph with typed edges encoding precedence, placement, and contention. Evidential regression with Normal-Inverse-Gamma priors provides calibrated uncertainty; we validate on makespan prediction for risk-aware scheduling. Experiments on three ARM platforms (Jetson TX2, Orin NX, RUBIK Pi) achieve R^2 = 0.81 on log-transformed makespan with Spearman rho = 0.95 and conservative uncertainty calibration (PICP = 99.9% at 95% confidence). Integration with four RL methods demonstrates that multi-agent model-based RL with GraphPerf-RT as the world model achieves 66% makespan reduction and 82% energy reduction versus model-free baselines, with zero thermal violations.


💡 Research Summary

The paper introduces GraphPerf‑RT, a graph‑neural‑network (GNN) surrogate designed to predict performance metrics—primarily makespan—for OpenMP task‑parallel applications running on heterogeneous embedded System‑on‑Chip (SoC) platforms. The authors motivate the work by highlighting the need for real‑time, risk‑aware scheduling in safety‑critical edge AI scenarios (autonomous vehicles, robotics, etc.) where resource, energy, and thermal constraints must be respected. Traditional analytical models, simulation, and tabular machine‑learning regressors either make simplifying assumptions that break under irregular control flow or discard the structural information inherent in task graphs. Model‑free reinforcement learning (RL) is also problematic because on‑device exploration can cause overheating.

GraphPerf‑RT addresses these gaps by constructing a heterogeneous graph that simultaneously encodes three orthogonal sources of information: (1) the OpenMP task DAG, (2) control‑flow‑graph (CFG)‑derived static code semantics, and (3) runtime context (per‑core DVFS state, utilization, thermal headroom). The graph contains three node types—Task (V_T), Resource (V_R), and Memory (V_M)—and four typed edge categories: task‑task precedence (E_TT), task‑resource placement (E_TR), resource‑resource contention (E_RR), and resource‑memory bandwidth (E_RM). Each node and edge carries a rich set of attributes (e.g., loop count, cyclomatic complexity, critical‑path flags, affinity scores, temperature trends, cache hierarchy descriptors). This representation enables the model to capture cross‑layer interactions that dominate execution time on heterogeneous SoCs.

The architecture employs type‑specific multilayer perceptrons (MLPs) to embed raw node features into a common 128‑dimensional space, followed by a heterogeneous Graph Attention Network (Heterogeneous GAT/HGT) with 3–6 layers that performs message passing separately for each edge type. Attention weights are learned per edge type, allowing the network to emphasize, for example, high‑affinity task‑core assignments or severe resource contention. After message passing, node embeddings are pooled per type, concatenated into a global graph representation h_G, and fed into multi‑task heads.

Uncertainty quantification is achieved via evidential regression. Each head predicts the parameters of a Normal‑Inverse‑Gamma (NIG) distribution (γ, ν, α, β). The predictive mean is γ, while aleatoric and epistemic variances are derived analytically from the NIG parameters. Training minimizes the negative log marginal likelihood plus an evidence regularizer that penalizes over‑confident predictions. This yields calibrated prediction intervals without the computational overhead of Monte‑Carlo dropout or ensembles.

The authors evaluate GraphPerf‑RT on three ARM‑based platforms: NVIDIA Jetson TX2, Jetson Orin NX, and a custom RUBIK Pi board. The dataset comprises 73,920 executions of 42 benchmarks from BOTS and PolyBench, spanning diverse input sizes, core masks, and DVFS settings. A 60/20/20 train/validation/test split stratified by benchmark, input, and configuration is used. GraphPerf‑RT achieves inference times of 2–7 ms, an R² of 0.81 on log‑transformed makespan, Spearman’s ρ of 0.95, and a Prediction Interval Coverage Probability (PICP) of 99.9 % at the 95 % confidence level—substantially outperforming homogeneous GNN baselines and tabular regressors in both accuracy and calibration.

To demonstrate practical impact, the surrogate is integrated as a world model for four RL algorithms (SAM‑FRL, SAMBRL, MAMFRL‑D3QN, MAMBRL‑D3QN). In model‑based RL, synthetic rollouts generated by GraphPerf‑RT guide policy updates while filtering out high‑uncertainty actions. The MAMBRL‑D3QN variant reduces average makespan from 2.85 ± 1.66 s (model‑free baseline) to 0.97 ± 0.35 s (66 % reduction) and cuts average energy consumption from 0.033 ± 0.026 J to 0.006 ± 0.005 J (82 % reduction). Crucially, no thermal violations are observed, whereas model‑free RL occasionally exceeds temperature limits.

The paper’s contributions are: (1) a unified heterogeneous graph that fuses OpenMP DAG, CFG semantics, and per‑core DVFS/thermal context; (2) a type‑aware GAT combined with evidential regression for calibrated uncertainty; (3) real‑time inference suitable for on‑device scheduling; and (4) a demonstration that model‑based RL using the surrogate yields substantial performance, energy, and safety gains over model‑free approaches. Limitations include a relatively coarse memory‑node model (focused on cache hierarchy) and the need for platform‑specific telemetry to train the surrogate, which may hinder rapid deployment on unseen hardware. Future work is proposed on enriching memory‑bandwidth modeling, applying meta‑learning for cross‑platform transfer, and adding online continual‑learning mechanisms to adapt to runtime drift.

Overall, GraphPerf‑RT represents a significant step toward AI‑driven, risk‑aware scheduling on embedded heterogeneous systems, combining structural program analysis, hardware telemetry, and principled uncertainty estimation into a fast, accurate surrogate that can be directly leveraged by reinforcement‑learning agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment