Multi-modal Gaussian Process Variational Autoencoders for Neural and Behavioral Data
Characterizing the relationship between neural population activity and behavioral data is a central goal of neuroscience. While latent variable models (LVMs) are successful in describing high-dimensional time-series data, they are typically only designed for a single type of data, making it difficult to identify structure shared across different experimental data modalities. Here, we address this shortcoming by proposing an unsupervised LVM which extracts temporally evolving shared and independent latents for distinct, simultaneously recorded experimental modalities. We do this by combining Gaussian Process Factor Analysis (GPFA), an interpretable LVM for neural spiking data with temporally smooth latent space, with Gaussian Process Variational Autoencoders (GP-VAEs), which similarly use a GP prior to characterize correlations in a latent space, but admit rich expressivity due to a deep neural network mapping to observations. We achieve interpretability in our model by partitioning latent variability into components that are either shared between or independent to each modality. We parameterize the latents of our model in the Fourier domain, and show improved latent identification using this approach over standard GP-VAE methods. We validate our model on simulated multi-modal data consisting of Poisson spike counts and MNIST images that scale and rotate smoothly over time. We show that the multi-modal GP-VAE (MM-GPVAE) is able to not only identify the shared and independent latent structure across modalities accurately, but provides good reconstructions of both images and neural rates on held-out trials. Finally, we demonstrate our framework on two real world multi-modal experimental settings: Drosophila whole-brain calcium imaging alongside tracked limb positions, and Manduca sexta spike train measurements from ten wing muscles as the animal tracks a visual stimulus.
💡 Research Summary
The paper introduces a novel unsupervised latent variable model, the Multi‑Modal Gaussian Process Variational Autoencoder (MM‑GPVAE), designed to jointly model simultaneously recorded neural and behavioral time‑series data. Traditional approaches either apply dimensionality reduction separately to neural recordings and then relate them to behavior, or use a single latent space to represent multiple modalities, which hampers interpretability and the ability to isolate shared versus modality‑specific dynamics. MM‑GPVAE addresses these issues by (1) partitioning the latent space into shared components (z_S) and modality‑specific components (z_A for neural data, z_B for behavioral data), (2) enforcing smooth temporal dynamics through a Gaussian Process (GP) prior, and (3) representing the GP‑driven latents in the Fourier domain. The Fourier representation diagonalizes the GP covariance, reduces computational cost, and allows pruning of high‑frequency components, thereby encouraging smoothness and sparsity.
In the generative model, Fourier‑domain latents are transformed back to the time domain via an inverse FFT, then linearly mixed by loading matrices W_A and W_B. The neural modality (A) is passed through a pointwise exponential nonlinearity to produce Poisson firing rates (or calcium fluorescence), while the behavioral modality (B) is decoded by a deep neural network g_ψ, enabling arbitrary high‑dimensional observations such as images or limb trajectories. The evidence lower bound (ELBO) combines a Poisson log‑likelihood for neural data, a Gaussian log‑likelihood for behavioral data, the GP prior term, and the variational entropy.
Training proceeds with amortized variational inference: an encoder network processes the entire trial (all time points) to produce variational means and variances in the Fourier space, which are then inverse‑transformed to obtain time‑domain latent trajectories. The authors demonstrate that this Fourier‑based GP‑VAE dramatically improves latent recovery compared to a standard GP‑VAE and a vanilla VAE on a simulated dataset where an MNIST digit rotates and scales smoothly while Poisson spikes are generated from the same latent angle.
Three empirical evaluations validate the model. First, on simulated data, MM‑GPVAE accurately recovers the true shared angle and yields high‑quality image reconstructions. Second, on Drosophila whole‑brain calcium imaging paired with 16 tracked limb positions, the model separates behavior‑related shared latents from limb‑specific independent latents, revealing clear clusters corresponding to distinct behavioral states. Third, on recordings from ten wing muscles of Manduca sexta together with a continuously moving visual stimulus, the method isolates muscle‑specific dynamics and stimulus‑related dynamics while capturing their shared temporal structure. Across all experiments, MM‑GPVAE achieves superior reconstruction error and more interpretable latent decompositions than competing baselines.
The paper’s contributions are threefold: (i) a Fourier‑domain parameterization of GP‑VAE latents that improves scalability and smoothness, (ii) a principled way to split latent space into shared and modality‑specific components while preserving temporal correlations, and (iii) empirical evidence that the approach works on both synthetic and real multimodal neuroscience datasets. Limitations include sensitivity to the choice of Fourier truncation level and GP hyperparameters, and the need to design an appropriate decoder for each behavioral modality. Future work may explore automatic hyperparameter tuning, extensions to more than two modalities (e.g., audio‑visual‑neural), and integration with downstream causal inference frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment