A Study on Inference Latency for Vision Transformers on Mobile Devices

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

💡 Research Summary

This paper presents a comprehensive study of inference latency for Vision Transformers (ViTs) on mobile devices, focusing on real‑world performance rather than cloud‑based training environments. The authors collected 190 publicly available ViT models from the Timm and HuggingFace repositories and 102 convolutional neural networks (CNNs) for comparison. All models were converted to PyTorch Mobile format and deployed on six representative smartphones covering a range of System‑on‑Chip (SoC) architectures: Snapdragon 855, Snapdragon 710, Exynos 9820, Helio P35, Apple A14 Bionic, and Apple A12 Bionic. For each device, a variety of CPU core configurations (large, medium, small) and quantization settings (FP16, INT8) were evaluated. Only 64 of the ViTs could be quantized due to unsupported operations such as roll, highlighting current framework limitations.

The experimental results show that, even when matched for floating‑point operations (FLOPs), ViTs consistently incur higher latency than CNNs—approximately 1.75× longer for models around 5 GFLOPs. A detailed operation‑level breakdown reveals two dominant contributors to ViT latency: (1) linear transformations and matrix multiplications within self‑attention blocks, which account for roughly 40–50 % of total inference time, and (2) the Gaussian Error Linear Unit (GELU) activation, which alone contributes about 30 % of the latency. Because GELU’s computation varies with input magnitude, FLOPs alone are a poor predictor of actual runtime for ViTs.

Framework differences also matter. The same ViT run on PyTorch Mobile versus TensorFlow Lite exhibits a 10–20 % latency gap, with Torch Mobile generally faster for matrix‑multiply heavy paths and TFLite better optimized for convolutional kernels. Memory layout (NCHW vs. NHWC) and data type (FP16 vs. INT8) affect memory bandwidth utilization, leading to further variability. Quantization improves performance on high‑performance cores (≈1.2× speed‑up) but can degrade performance on low‑power cores due to increased overhead.

Leveraging these insights, the authors designed a synthetic ViT search space that combines seven common building blocks—patch embedding, token mixers (Multi‑Head Attention, Spatial‑Reduction Attention, Separable Convolution), MLP, normalization, and activation functions. By enumerating combinations, they generated 1,000 synthetic ViT architectures and measured their latencies across the six devices, two frameworks, multiple core configurations, and both quantized and floating‑point modes, resulting in a dataset of over 12,000 latency records. The dataset, publicly released, includes model topology, FLOPs, parameter count, memory consumption, and per‑operation latency breakdown.

Using the synthetic subset (900 models) as training data, the authors trained lightweight machine‑learning latency predictors (linear regression, random forest, Gradient Boosting). On a held‑out set of 100 synthetic ViTs, the best predictor achieved mean absolute error (MAE) of 4.4 % for Torch Mobile and 4.8 % for TFLite. When evaluated on the 190 real‑world ViTs, the same models yielded MAE of 8.2 % (Torch Mobile) and 6.1 % (TFLite). These accuracies are sufficient for practical use cases such as Neural Architecture Search (NAS) and split (collaborative) inference.

In a NAS scenario, the predictor was used to filter 100 candidate ViTs sampled from the synthetic search space, reducing the need for on‑device measurements by over 90 % while still selecting architectures that meet latency constraints. For collaborative inference, the predictor guided the optimal partitioning of a model between device and cloud, achieving up to a 15 % reduction in end‑to‑end latency compared with naïve partitioning.

Overall, the paper makes three major contributions: (1) a thorough empirical comparison of ViTs and CNNs on mobile CPUs, exposing the intrinsic latency penalties of self‑attention and GELU; (2) a large, publicly available latency dataset covering diverse hardware, frameworks, and quantization settings; and (3) demonstration that simple ML‑based latency predictors trained on synthetic data can reliably estimate the runtime of both synthetic and real‑world ViTs, enabling efficient NAS and split‑inference pipelines for on‑device computer‑vision applications. The work highlights the importance of considering memory formats, activation functions, and framework implementations when deploying ViTs on resource‑constrained devices, and provides a practical toolkit for the community to accelerate the adoption of efficient transformer‑based vision models on mobile platforms.

A Study on Inference Latency for Vision Transformers on Mobile Devices

💡 Research Summary

Comments & Academic Discussion

Leave a Comment