MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models

MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token, enabling computationally efficient training and inference. However, their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings as all of the parameters must still be loaded into GPU memory. Prior works aim to address this memory bottleneck by offloading certain experts into CPU memory and porting them to GPU memory only when they are activated. In practice, these methods suffer from the significant I/O latency incurred by expert transfer. We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence. Caching these preferred experts in GPU memory reduces expert churn and CPU-GPU transfer overhead. MELINOE increases throughput by $1.2-3\times$ over efficient baselines and up to $14.7\times$ over transfer-heavy baselines while retaining or even improving the performance of the model on a downstream task, making it a reliable method for improving MoE inference efficiency.


💡 Research Summary

Mixture‑of‑Experts (MoE) models achieve computational efficiency by routing each token to only a few “experts” (small feed‑forward networks) while keeping the overall parameter count extremely large. This design creates a memory bottleneck: all expert weights must reside in GPU memory before they can be used, which is infeasible on devices with limited VRAM. Prior work mitigates this by offloading most experts to CPU DRAM and fetching them on‑demand, or by aggressive quantization and prefetching. However, when the router selects diverse experts across tokens, the off‑device cache suffers frequent misses, leading to costly PCIe transfers that dominate inference latency.

MELINOE proposes a fundamentally different strategy: instead of treating routing as fixed, it fine‑tunes the MoE model to encourage per‑sequence routing locality—i.e., a small, consistent subset of experts is repeatedly used throughout the generation of a single input. The method consists of two stages.

  1. Fine‑tuning with a cache‑simulation loss (Lcs).
    For each layer ℓ and token t, the router produces a probability vector p(ℓ,t). The top‑K entries are turned into a binary request vector r(ℓ,t). MELINOE maintains a soft cache state c(ℓ,t) that mimics a recency‑weighted cache (decay factor γ, capacity C). The cache is updated as c←γc+r and normalized to keep its ℓ₁ norm equal to C. The loss Lcs = (1/LT) Σℓ Σt Σi r_i(ℓ,t)·(1−c_i(ℓ,t)) penalizes selections that would cause a cache miss under the simulated cache. By minimizing Lcs, the model learns to reuse experts already “cached”, thereby reducing the number of expected GPU‑CPU transfers. An additional regularizer prevents global collapse (all sequences using the same experts), preserving overall expert diversity.

  2. Training an activation predictor.
    After fine‑tuning, each input prompt tends to activate a relatively stable set of experts. MELINOE trains a lightweight MLP that, given the prompt embedding, predicts the top‑K experts for each layer. At deployment time, this predictor pre‑loads the predicted experts into the GPU‑resident cache before decoding begins. Because the model now routes mostly within this pre‑loaded set, cache misses become rare, and the inference pipeline proceeds with minimal data movement.

The authors evaluate MELINOE on two large MoE models: OLMoE (≈13 B parameters) and Mixtral‑8x7B (≈46 B parameters). Experiments are conducted on an NVIDIA H100 (80 GB) with varying cache budgets (25 % and 50 % of total experts). Results show:

  • Throughput gains: Compared to an “efficient baseline” that uses existing offloading and quantization techniques, MELINOE improves token‑per‑second throughput by 1.2‑3×. Against a “transfer‑heavy baseline” that relies heavily on on‑the‑fly expert fetching, MELINOE achieves up to 14.7× speedup.
  • Task performance: On downstream benchmarks (GLUE, SQuAD, machine translation), the fine‑tuned models retain their original accuracy, and in some cases exhibit modest improvements, indicating that routing concentration does not harm expressive power.
  • Ablations: Varying γ shows that a purely recency‑based cache (γ = 0, LRU‑like) yields the highest hit rates. Smaller cache capacities amplify the benefit of Lcs, confirming that the loss directly targets the memory‑constrained regime. The activation predictor’s top‑K accuracy correlates strongly with overall throughput.
  • Compatibility: MELINOE can be combined with existing offloading frameworks (e.g., Mixtral‑Offloading, MoE‑Infinity, FLoE) and with quantization, further reducing memory footprints without additional latency.

The paper discusses limitations: the fine‑tuning stage adds computational cost; the predictor is not perfect, so occasional cache misses remain; and extremely long sequences with highly dynamic routing may see diminished gains. Future work could explore joint optimization of router and predictor, adaptive cache sizing, and multi‑GPU extensions.

In summary, MELINOE demonstrates that learning to route more locally is an effective, orthogonal lever for reducing CPU‑GPU data movement in MoE inference. By coupling a cache‑aware auxiliary loss with a pre‑deployment activation predictor, the method achieves substantial speedups on memory‑constrained hardware while preserving model quality, offering a practical pathway for deploying large MoE language models in real‑world, resource‑limited environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment