LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons
Vision Transformers have been tremendously successful in computer vision tasks. However, their large computational, memory, and energy demands are a challenge for edge inference on FPGAs – a field that has seen a recent surge in demand. We recognize the benefits of recent works on logic and Look Up Table (LUT) based networks, such as LogicNets, NeuraLUT, DWN, among others, in offering models that simultaneously reduce both the memory and compute footprints. However, these models natively do not perform well on common vision tasks, such as CIFAR-10/100. In this work, we propose LL-ViT, a novel edge optimized vision transformer design that integrates layers of LUT neurons within the transformer architecture. Based on our characterization that reveals that a majority of model weights and computations are from the channel mixer (MLP layer), we design an alternate LUT-based channel mixer, and simultaneously develop an FPGA-based accelerator for LL-ViT. Contrary to some attempts to replace each multiplication with a table lookup, our architecture utilizes a neural learning approach which natively learns the LUT functions. This approach allows for reduced model sizes, and a computational and energy-efficient inference solution for vision transformer models. Evaluating on edge-suitable workloads, we achieve accuracies of 95.5% on CIFAR-10, 78.8% on CIFAR-100, and 60.9% on Tiny-ImageNet datasets, comparable to the baseline transformer. LL-ViT eliminates over 60% of the model weights and 50% of the multiplications in the model, and achieves 1.9x energy efficiency and 1.3x lower latency over an integer quantized ViT accelerator, while also offering superior throughput against prior works at a 10.9W power budget.
💡 Research Summary
LL‑ViT introduces a co‑design methodology that tightly integrates learnable look‑up‑table (LUT) neurons into the channel‑mixing (MLP) component of vision transformers (ViTs) to enable efficient edge deployment on field‑programmable gate arrays (FPGAs). The authors begin by profiling several ViT variants and discover that more than 60 % of model parameters and roughly half of the multiply‑accumulate (MAC) operations reside in the MLP layers that perform channel mixing. This observation motivates a redesign of the MLP block, which is traditionally the dominant source of memory and compute overhead.
Instead of post‑training quantization or naïve replacement of each multiplication with a table lookup, LL‑ViT proposes a fully differentiable LUT‑neuron. Each neuron concatenates its quantized inputs into an address that indexes a small on‑chip memory (the LUT). The table entries are learned end‑to‑end using approximate gradients derived from extended finite‑difference methods, allowing back‑propagation to update the LUT contents directly. By limiting the input resolution to 4–6 bits, the LUT size remains compatible with the LUT‑Slice resources of modern Xilinx Virtex devices, while still providing sufficient expressive power (a n‑input LUT can represent any of 2^{2^n} Boolean functions). Non‑linearities such as GELU are embedded inside the LUT, eliminating the need for separate activation hardware.
The resulting LL‑ViT encoder consists of two stages: (1) a conventional multi‑head self‑attention (MHA) token mixer, and (2) the newly designed LUT‑based channel mixer. The two stages are pipelined so that tokens flow continuously through the network without stalling. Because the channel mixer contains no multipliers, the design frees up all DSP blocks and dramatically reduces on‑chip BRAM usage; the entire weight set fits within the FPGA’s internal memory, removing costly off‑chip DRAM accesses. The hardware implementation maps each LUT‑neuron directly onto physical LUT resources, achieving high parallelism and low latency.
Experimental evaluation on three standard vision benchmarks—CIFAR‑10, CIFAR‑100, and Tiny‑ImageNet—shows that LL‑ViT attains 95.5 %, 78.8 %, and 60.9 % top‑1 accuracy respectively, matching or closely approaching the performance of the baseline DeiT‑T transformer. At the same time, LL‑ViT reduces model parameters by more than 60 % and cuts total MAC operations by roughly 50 %. Compared with an integer‑quantized ViT accelerator, LL‑ViT delivers 1.9× higher energy efficiency and 1.3× lower inference latency while sustaining a throughput of 1083 frames per second on a Virtex FPGA at a modest 10.9 W power envelope.
The paper’s contributions are threefold: (1) a novel learnable LUT‑based channel‑mixing block that can be seamlessly inserted into transformer encoders; (2) a complete FPGA accelerator architecture that exploits the multiplication‑free nature of LUT‑neurons to eliminate off‑chip weight movement and maximize on‑chip resource utilization; (3) a demonstration that LUT‑based neural components, previously limited to small‑scale datasets, can scale to complex vision transformers and achieve competitive accuracy on mainstream image classification tasks.
In addition to the immediate performance gains, the work opens several avenues for future research. Extending the LUT‑based approach to deeper transformer variants, exploring mixed‑precision LUT designs, and applying the methodology to other vision tasks such as object detection or semantic segmentation are natural next steps. Moreover, the concept of learning LUT tables end‑to‑end could be combined with other model‑compression techniques (e.g., pruning, knowledge distillation) to further push the limits of on‑device AI. In summary, LL‑ViT provides a compelling proof‑of‑concept that learnable LUT neurons can replace the heavyweight MLP blocks of ViTs, delivering substantial reductions in memory footprint, compute demand, and energy consumption while preserving the expressive power required for high‑accuracy visual recognition on edge hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment