CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The proliferation of large language models (LLMs) is accelerating the integration of multimodal assistants into edge devices, where inference is executed under stringent latency and energy constraints, often exacerbated by intermittent connectivity. These challenges become particularly acute in the context of multimodal LLMs (MLLMs), as high-dimensional visual inputs are transformed into extensive token sequences, thereby inflating the key-value (KV) cache and imposing substantial data movement overheads to the LLM backbone. To address these issues, we present CHIME, a chiplet-based heterogeneous near-memory acceleration for edge MLLMs inference. CHIME leverages the complementary strengths of integrated monolithic 3D (M3D) DRAM and RRAM chiplets: DRAM supplies low-latency bandwidth for attention, while RRAM offers dense, non-volatile storage for weights. This heterogeneous hardware is orchestrated by a co-designed mapping framework that executes fused kernels near data, minimizing cross-chiplet traffic to maximize effective bandwidth. On FastVLM (0.6B/1.7B) and MobileVLM (1.7B/3B), CHIME achieves up to 54x speedup and up to 246x better energy efficiency per inference as compared to the edge GPU NVIDIA Jetson Orin NX. It sustains 116.5-266.5 token/J compared to Jetson’s 0.7-1.1 token/J. Furthermore, it delivers up to 69.2x higher throughput than the state-of-the-art PIM accelerator FACIL. Compared to the M3D DRAM-only design, CHIME’s heterogeneous memory further improves energy efficiency by 7% and performance by 2.4x.

💡 Research Summary

The paper introduces CHIME, a heterogeneous near‑memory accelerator designed for edge multimodal large language model (MLLM) inference. CHIME integrates monolithic‑3D (M3D) DRAM and M3D RRAM chiplets within a 2.5 D UCIe package, each equipped with dedicated near‑memory processors (NMPs). The DRAM chiplet, built with 1T1C cells, provides low‑latency, high‑bandwidth access for attention, query‑key‑value (QKV) projection, and KV‑cache management. Its 200‑layer vertical stack is tiered into five levels, placing frequently accessed attention data in the lowest tier and connector kernels in the highest tier, thereby minimizing latency. Each DRAM channel contains a 256‑way SIMD special‑function processing element (SFPE), 16 general processing elements (PEs), a 2 × 2 MAC tensor core, and double‑buffered memory to enable continuous tile processing without stalls.

The RRAM chiplet, built with 1T1R resistive devices, offers dense non‑volatile storage for the massive weight matrices of the feed‑forward network (FFN). Eight RRAM layers sit above the logic die, each managed by a controller and paired with a 1 MB SRAM buffer. The FFN kernel runs entirely on the RRAM NMP: the attention output (AttnOut) streams from DRAM to RRAM, is fused with locally stored weights in the MAC units, and the resulting FFN output (FFNOut) streams back to DRAM for the next decoding step. This design creates only two fixed cross‑chiplet transfer points, drastically reducing data movement.

A co‑designed mapping framework orchestrates data placement and execution. It follows three principles: (1) workload‑aware data layout that statically assigns model components to the memory type best suited to their access patterns; (2) KV‑cache tiered scheduling that dynamically migrates cache blocks across DRAM tiers when reuse outweighs migration cost; and (3) kernel locality‑aware fusion that combines QKV projection, FlashAttention, and FFN into single fused kernels, eliminating intermediate tensors and keeping activations local to the NMPs. The framework also respects RRAM endurance by limiting write‑heavy operations.

Evaluation on FastVLM (0.6 B/1.7 B) and MobileVLM (1.7 B/3 B) shows dramatic gains over the NVIDIA Jetson Orin NX GPU: up to 54× speedup, 246× better energy efficiency, and 116.5‑266.5 tokens per joule versus 0.7‑1.1 tokens/J on the GPU. Compared with the state‑of‑the‑art PIM accelerator FACIL, CHIME delivers up to 69.2× higher throughput. Against a DRAM‑only design, CHIME improves energy efficiency by 7 % and performance by 2.4×.

Key insights include the effectiveness of heterogeneous chiplet integration for balancing bandwidth, capacity, and energy; the importance of tiered KV‑cache management in reducing latency for long context lengths; and the power of a hardware‑software co‑design that fuses kernels to keep data near compute. Limitations involve RRAM write endurance, the complexity of 2.5 D packaging, and the need to validate the approach across a broader set of models. Future work may explore endurance‑aware scheduling, higher‑precision support, and scaling to multi‑chiplet systems.

CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference

💡 Research Summary

Comments & Academic Discussion

Leave a Comment