mlx-vis: GPU-Accelerated Dimensionality Reduction and Visualization on Apple Silicon

mlx-vis implements eight dimensionality reduction methods -- UMAP, t-SNE, PaCMAP, LocalMAP, TriMap, DREAMS, CNE, MMAE -- and NNDescent k-NN graph construction entirely in MLX for Apple Silicon Metal GPU. A built-in GPU renderer produces scatter plots…

Authors: Han Xiao

mlx-vis: GPU-Accelerated Dimensionality Reduction and Visualization on Apple Silicon
mlx-vis: GPU-Accelerated Dimensionalit y Reduction and Visualization on Apple Silicon Han Xiao han.xiao@jina.ai Jina AI by Elastic Abstract mlx-vis implemen ts eight dimensionalit y reduction metho ds – UMAP , t-SNE, P aCMAP , Lo calMAP , T riMap, DREAMS, CNE, MMAE – and NNDescen t k -NN graph construc- tion en tirely in MLX for Apple Silicon Metal GPU. A built-in GPU renderer pro duces scatter plots and smo oth animations via hardware H.264 enco ding. On F ashion-MNIST (70K p oints, M3 Ultra), seven of eight metho ds embed in 2.0–4.7 s and render 800-frame animations in 1.4 s. The library dep ends only on MLX and NumPy and is av ailable at https://github.com/hanxiao/mlx- vis . Keyw ords: dimensionalit y reduction, visualization, Apple Silicon, MLX, Metal, GPU acceleration, neigh bor em b edding 1 In tro duction Dimensionalit y reduction is a core op eration in exploratory data analysis, transforming high-dimensional data in to tw o-dimensional representations that reveal cluster structure, con tinuit y , and outliers. The field has pro duced a rich family of methods: t-SNE (v an der Maaten and Hin ton, 2008) and UMAP (McInnes et al., 2018a) preserve lo cal neigh b or- ho o ds through neighbor embedding, PaCMAP (W ang et al., 2021) and T riMap (Amid and W armuth, 2019) use triplet-based ob jectiv es to balance lo cal and global structure, Lo calMAP (W ang et al., 2025) extends PaCMAP with dynamic lo cal graph adjustmen t, DREAMS (Kury et al., 2026) h ybridizes t-SNE with PCA regularization, CNE (Damric h et al., 2023) unifies neighbor embedding under contrastiv e learning, and MMAE (Cheret et al., 2026) applies manifold-matching regularization through auto enco ders. Other no- table metho ds include PHA TE (Mo on et al., 2019), which captures tra jectory structure through diffusion p otentials, and StarMAP (W atanabe et al., 2025), which adds PCA cen- troid attraction to UMAP . W e excluded PHA TE and StarMAP: PHA TE requires matrix exp onen tiation and MDS, and StarMAP’s centroid attraction adds marginal algorithmic no velt y ov er UMAP . W e selected the eigh t metho ds ab o ve to cov er the ma jor algorithmic families – neigh b or embedding, triplet-based, h ybrid, contrastiv e, and auto encoder-based – while keeping the library fo cused. Both PHA TE and StarMAP would in tro duce architec- tural complexity that would compromise the pure MLX design. Reference implementations of these metho ds are distributed across indep enden t Python pac k ages with heterogeneous dep endencies. The umap-learn pack age (McInnes et al., 2018b) relies on numba and pynndescent ; openTSNE (P oliˇ car et al., 2024) wraps C and Cython extensions; pacmap and trimap each carry their o wn dep endency trees. All are CPU-b ound, p erforming gradient up dates and neighbor searches on the CPU even when Xiao GPU hardw are is av ailable. On Apple Silicon mac hines, where CPU and GPU share unified memory , this leav es substantial compute capacit y unused. GPU-accelerated implemen- tations exist for CUDA hardw are – notably cuML (RAPIDS Developmen t T eam, 2020) pro vides UMAP and t-SNE on NVIDIA GPUs – but no equiv alen t targets Apple Silicon’s Metal GPU. mlx-vis addresses b oth the fragmen tation and the p erformance gap. It reimplements all eight metho ds plus NNDescent k -nearest neigh b or search (Dong et al., 2011) in pure MLX (Apple Machine Learning Research, 2023), Apple’s array framew ork for Metal GPU. MLX is built on Metal, Apple’s low-lev el GPU API analogous to CUD A, and exp oses a NumPy-compatible array in terface with lazy ev aluation, JIT compilation via @mx.compile , and unified memory access that eliminates CPU-GPU data transfers. The broader MLX ecosystem includes libraries for language mo dels, diffusion, and speech, but to our kno wledge no prior w ork has targeted dimensionalit y reduction and visualization. Every op eration in the mlx-vis pip eline runs on GPU: PCA prepro cessing, k -NN graph construction, gradient- based optimization, and rendering. The library exp orts a uniform API where each metho d is instantiated with its hyperparameters and called via fit transform(X) . An optional callbac k mechanism captures p er-ep o c h snapshots for animation. A distinctiv e feature of mlx-vis is its GPU-native visualization pip eline. Rather than delegating rendering to matplotlib, the library implements a circle-splatting renderer in MLX with scatter-add alpha blending, piping frames to ffmpeg with hardware H.264 en- co ding. A double-buffering scheme ov erlaps GPU rendering with I/O, pro ducing 800-frame animations in 1.4 seconds. Unlik e general-purp ose to ols such as Datashader, this renderer is purp ose-built for embedding animation. 2 Metho dology 2.1 Arc hitecture Figure 1 sho ws the pipeline. Each dimensionalit y reduction me thod o ccupies a self-contained subpac k age (e.g., mlx vis/ umap/ , mlx vis/ tsne/ , mlx vis/ pacmap/ ). A thin public wrapp er at the pac k age level re-exp orts each class. The shared NNDescen t implemen- tation in mlx vis/ nndescent/ provides k -NN graphs consumed by all eight metho ds. GPU rendering lives in mlx vis/render.py , with plotting and animation entry p oints in mlx vis/plot.py . The full public API consists of nine algorithm classes and six visualiza- tion functions: from mlx vis import UMAP, TSNE, PaCMAP, LocalMAP, TriMap, DREAMS, CNE, MMAE, NNDescent from mlx vis import scatter gpu, animate gpu, morph gpu Ev ery metho d follows the same calling con ven tion: instantiate with hyperparameters, in vok e fit transform(X) on an n × d arra y , and receive an n × 2 embedding. An epoch callback parameter accepts a function that receives the current embedding as a NumPy arra y at eac h iteration, enabling animation without mo difying the optimization lo op. 2 mlx-vis: GPU-Accelera ted DR on Apple Silicon Metal GPU (Apple Silicon Unified Memory) X ∈ R n × d NNDescent UMAP t-SNE P aCMAP LocalMAP T riMap DREAMS CNE MMAE Y ∈ R n × 2 GPU Renderer PNG / MP4 Figure 1: The mlx-vis pip eline. All stages inside the shaded region execute on Metal GPU through MLX. 2.2 NNDescen t on GPU Appro ximate k -nearest neigh b or searc h is the first stage of every metho d. mlx-vis im- plemen ts NNDescent (Dong et al., 2011) en tirely in MLX. The algorithm initializes each p oin t with k random neighbors and iteratively refines the graph by exploring neighbors-of- neigh b ors. Distances are computed via GPU matrix multiplication: ∥ a − b ∥ 2 = ∥ a ∥ 2 + ∥ b ∥ 2 − 2 a ⊤ b . T op- k selection uses mx.argpartition to av oid full sorting, with early termination when the up date rate falls b elo w δ = 0 . 015. 2.3 Method Implementations All eight metho ds follow the standard tw o-phase pattern: construct a graph or sampling structure, then iteratively optimize a 2D em b edding. Eac h implementation faithfully repro- duces the published algorithm. Notable MLX-sp ecific adaptations include: UMAP fits its output kernel parameters via Gauss-Newton optimization rather than scipy curv e fitting; t- SNE pro vides an FFT-accelerated O ( n log n ) repulsiv e force v arian t follo wing FIt-SNE (Lin- derman et al., 2019); LocalMAP’s dynamic lo cal pair resampling uses a pure MLX GPU implemen tation with mx.argsort -based candidate selection instead of the original p er-ro w Python lo op; CNE extracts each con trastiv e loss in to a compiled static metho d for op erator fusion; and MMAE replaces the original PyT orch autoenco der with MLX lay ers and applies @mx.compile to the reconstruction and manifold-matc hing loss computation. 2.4 GPU Rendering Pip eline The rendering pip eline conv erts an n × 2 em b edding into an RGBA image or video entirely on GPU. F or eac h p oint, a set of pixel offsets within radius R is computed with linear falloff w eights w = max(0 , 1 − r /R ). These offsets are translated to global pixel co ordinates, and prem ultiplied color contributions ( α · w · c r , α · w · c g , α · w · c b , α · w ) are accum ulated in to a framebuffer via mx.array.at[idx].add(vals) , an atomic scatter-add on GPU. A final normalization pass divides accumulated color by accumulated alpha and comp osites ov er the background. F or animation, animate gpu() renders p er-ep o ch snapshots in to frames. Three opti- mizations keep the time lo w: hold frames reuse a single rendered buffer; mx.async eval() o verlaps GPU rendering of frame n +1 with I/O of frame n ; and constan t arrays are con- 3 Xiao T able 1: Embedding p erformance on F ashion-MNIST 70K, M3 Ultra. All metho ds run 500 iterations with normalize="standard" . Metho d Time (s) Mem (GB) P ow er (W) UMAP 2.53 2.5 46 t-SNE 4.65 3.2 79 P aCMAP 3.84 3.0 71 Lo calMAP 4.04 3.0 71 T riMap 1.98 2.6 59 DREAMS 4.69 3.2 79 CNE 2.95 2.3 53 MMAE 18.77 1.7 42 v erted to MLX tensors once. F rames are pip ed to ffmpeg with h264 videotoolbox hard- w are enco ding. 2.5 MLX-Specific Optimizations Lazy ev aluation. MLX builds a computation graph and dispatc hes work to the GPU only up on mx.eval() . The optimization lo ops place ev aluation gates at the end of eac h ep och, allo wing the framew ork to fuse op erations within an ep o c h in to fewer GPU dispatches. Compilation of hot lo ops. The @mx.compile decorator JIT-compiles a pure function in to a fuse d GPU k ernel. mlx-vis applies this to UMAP’s SGD step, t-SNE’s repulsiv e k ernel, P aCMAP’s per-phase update, and CNE’s p er-loss gradient, eliminating Python-lev el o verhead. 3 Benc hmarks All b enc hmarks use F ashion-MNIST (Xiao et al., 2017): 70,000 images of 28 × 28 pixels flattened to 784 dimensions, with normalize="standard" (z-score p er feature). The hard- w are is an Apple M3 Ultra with 512 GB unified memory . All timings are mean ± standard deviation ov er 5 runs. T able 1 reports embedding time, p eak GPU memory (via mx.get peak memory ), and p eak GPU p o wer (via macmon ) for each metho d at 500 iterations, alongside reference CPU implemen tations. Compared to m ulti-threaded CPU baselines, mlx-vis achiev es sp eedups of 3.4 × ov er umap-learn , 12.6 × o ver openTSNE , 1.7 × ov er pacmap , and 7.2 × o ver trimap . The GPU rendering pip eline adds 1 . 43 ± 0 . 21 s for an 800-frame animation at 1000 × 1000 resolution. The primary source of acceleration is GPU-native execution on unified memory: all matrix op erations, neigh b or searches, and gradient up dates run on Metal GPU without CPU-GPU data transfers. The @mx.compile decorator further reduces ov erhead by fusing op erations within each optimization step. Em b edding quality is exp ected to matc h the reference implementations, as mlx-vis faithfully repro duces the published ob jective func- 4 mlx-vis: GPU-Accelera ted DR on Apple Silicon tions and optimization schedules without mo dification. Figure 2 in the app endix sho ws the resulting visualizations for all eigh t metho ds. 4 Conclusion mlx-vis provides a unified, dep endency-minimal library for dimensionality reduction and visualization on Apple Silicon. By implemen ting eight em b edding methods and k -NN search in pure MLX, it deliv ers single-digit-second performance on datasets of 70K points and elim- inates the need for scipy , sklearn, num ba, and Cython. The GPU-nativ e rendering pip eline extends the acceleration b eyond computation to visualization, producing publication-quality scatter plots and smo oth animations at rates that enable in teractive exploration – a capa- bilit y absent from existing dimensionality reduction to olkits. It can be installed via pip install mlx-vis . References Ehsan Amid and Manfred K. W arm uth. T riMap: Large-scale dimensionality reduction using triplets. arXiv pr eprint arXiv:1910.00204 , 2019. Apple Machine Learning Researc h. MLX: An array framew ork for apple silicon. https: //github.com/ml- explore/mlx , 2023. Lauren t Cheret, Vincen t L´ etourneau, Isar Nejadgholi, Chris Drummond, Hussein Al Osman, and Maia F raser. Manifold-matching auto enco ders. arXiv pr eprint arXiv:2603.16568 , 2026. Sebastian Damric h, Jan Niklas B¨ ohm, F red A. Hamprech t, and Dmitry Kobak. F rom t-SNE to UMAP with contrastiv e learning. In ICLR , 2023. Vipul Divy anshu. ANEgpt: T ransformer training on apple neural engine. https://github. com/vipuldivyanshu92/ANEgpt , 2026. W ei Dong, Moses Charik ar, and Kai Li. Efficien t k -nearest neigh b or graph construction for generic similarity measures. In WWW , pages 577–586, 2011. No ¨ el Kury , Dmitry Kobak, and Sebastian Damric h. DREAMS: Preserving b oth lo cal and global structure in dimensionality reduction. T r ansactions on Machine L e arning R ese ar ch , 2026. George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerb erger, and Y uv al Kluger. F ast interpolation-based t-SNE for improv ed visualization of single-cell RNA-seq data. Natur e Metho ds , 16:243–245, 2019. doi: 10.1038/s41592- 018- 0308- 4. maderix. T raining neural netw orks on apple neural engine. https://github.com/maderix/ ANE , 2026. Leland McInnes, John Healy , and James Melville. UMAP: Uniform manifold approximation and pro jection for dimension reduction. arXiv pr eprint arXiv:1802.03426 , 2018a. 5 Xiao Leland McInnes, John Healy , and James Melville. umap-learn: UMAP – uniform manifold appro ximation and pro jection. https://github.com/lmcinnes/umap , 2018b. Kevin R. Moon, David v an Dijk, Zheng W ang, Scott Gigan te, Daniel B. Burkhardt, William S. Chen, Kristina Yim, An tonia v an den Elzen, Matthew J. Hirn, Ronald R. Coifman, Natalia B. Iv ano v a, Guy W olf, and Smita Krishnasw amy . Visualizing structure and transitions in high-dimensional biological data. Natur e Biote chnolo gy , 37:1482–1492, 2019. doi: 10.1038/s41587- 019- 0336- 3. P avlin G. Poli ˇ car, Martin Stra ˇ zar, and Blaˇ z Zupan. openTSNE: A mo dular python library for t-SNE dimensionality reduction and em b edding. Journal of Statistic al Softwar e , 109 (3):1–30, 2024. doi: 10.18637/jss.v109.i03. RAPIDS Developmen t T eam. RAPIDS cuML: Gpu machine learning algorithms. https: //github.com/rapidsai/cuml , 2020. Laurens v an der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine L e arning R ese ar ch , 9:2579–2605, 2008. Yingfan W ang, Haiy ang Huang, Cyn thia Rudin, and Y aron Shaposhnik. Understanding ho w dimension reduction to ols work: An e mpirical approac h to deciphering t-SNE, UMAP, T riMap, and P aCMAP for data visualization. Journal of Machine L e arning R ese ar ch , 22 (201):1–73, 2021. Yingfan W ang, Yiyang Sun, Haiyang Huang, and Cynthia Rudin. Dimension reduction with lo cally adjusted graphs. In AAAI , volume 39, pages 21357–21365, 2025. doi: 10. 1609/aaai.v39i20.35436. Koshi W atanab e, Keisuke Maeda, T ak ahiro Ogaw a, and Miki Haseyama. StarMAP: Global neigh b or embedding for faithful data visualization. arXiv pr eprint arXiv:2502.03776 , 2025. Han Xiao, Kashif Rasul, and Roland V ollgraf. F ashion-MNIST: A nov el image dataset for b enc hmarking machine learning algorithms. arXiv pr eprint arXiv:1708.07747 , 2017. App endix A. Em b edding Visualizations Figure 2 sho ws the 2D em beddings produced by all eight metho ds on F ashion-MNIST (70,000 p oin ts). Colors corresp ond to the 10 garment categories. App endix B. Neural Engine Applicabilit y Apple Silicon c hips include a Neural Engine (ANE), a fixed-function accelerator for lo w- p o wer inference. Recent rev erse-engineering of priv ate ANE APIs (maderix, 2026; Divyan- sh u, 2026) has enabled direct programming outside CoreML, raising the question of whether ANE could accelerate dimensionality reduction. W e iden tify three architectural mismatc hes that preclude this. First, ANE executes dense con volutions and element-wise op erations through its MIL instruction set, but the 6 mlx-vis: GPU-Accelera ted DR on Apple Silicon UMAP t-SNE P aCMAP Lo calMAP T riMap DREAMS CNE MMAE Figure 2: F ashion-MNIST 70K embeddings pro duced by the eight metho ds in mlx-vis , rendered by the GPU circle-splatting pip eline. 7 Xiao dominan t op erations in mlx-vis are scatter-add updates on randomly sampled edge indices (UMAP , PaCMAP , T riMap) and p oin ter-chasing neigh b or lo okups (NNDescent), neither of whic h has an ANE representation. Second, all embedding metho ds pro duce n × 2 outputs, whereas ANE’s in ternal SRAM tiling is optimized for channel coun ts of 256–1024; a 2- c hannel output utilizes less than 1% of the hardw are width. Third, each ANE ev aluation requires k ernel-mediated IOSurface writes and reads; prior b enchmarks of ANE dispatch latency rep ort appro ximately 0.5 ms of fixed ov erhead p er call on M-series chips (maderix, 2026). F or reference, a single UMAP SGD step on F ashion-MNIST 70K completes in roughly 5 ms on Metal GPU (2.53 s / 500 ep o chs), so IOSurface ov erhead alone would consume approximately 10% of computation time p er call, with no compensating sp eedup on the op erations that dominate the w orkload. MLX av oids all three limitations: Metal GPU executes scatter-add natively , imp oses no c hannel-width constraints, and accesses unified memory through zero-copy buffers without k ernel transitions. ANE’s adv antage is its low p ow er en velope for on-device inference, not throughput on desktop hardware. 8

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment