A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.


💡 Research Summary

The paper introduces EB‑JEPA, an open‑source library that makes Joint‑Embedding Predictive Architectures (JEPAs) accessible for researchers and educators. JEPAs differ from traditional generative models by learning to predict future observations in a learned latent space rather than reconstructing raw pixels. This shift reduces computational cost and focuses learning on semantically meaningful features.

EB‑JEPA implements three progressively more complex use‑cases: (1) image‑level self‑supervised representation learning, (2) video prediction, and (3) action‑conditioned world modeling for goal‑directed planning. All three are built on a unified energy‑based formulation where the loss consists of a prediction term (the squared error between a predicted latent vector and a target latent vector) plus a regularization term that prevents representation collapse. Two families of regularizers are provided: VICReg (variance and covariance penalties) and SIGReg (a Gaussianity test on random 1‑D projections). Experiments show that SIGReg is more stable across hyper‑parameter settings, while VICReg can achieve comparable peak performance with careful tuning.

In the image setting, two random augmentations of the same image are encoded by a shared encoder (e.g., ResNet‑18). The L2 distance between the two embeddings is minimized, and a projector is optionally used before applying the regularizer. On CIFAR‑10, linear probing of the learned features reaches 91 % accuracy, outperforming a baseline without a projector.

For video, each frame is encoded into a latent vector, and a temporal predictor (a UNet‑style or GRU‑based network) takes a sliding window of past latent vectors to predict the next one. To align training with the autoregressive inference used at test time, the authors add multi‑step rollout losses (k‑step predictions) to the objective. Ablations on Moving MNIST demonstrate that a rollout horizon of about four steps yields the best trade‑off between short‑term accuracy and long‑term coherence, and visualizations show that the model can maintain digit trajectories over many frames.

The most complex example is an action‑conditioned video model, where an additional action encoder maps control inputs to a conditioning vector. The predictor now receives both past latent states and past actions to forecast the next latent state. Because the dynamics are now coupled with actions, two extra regularizers are introduced: a temporal similarity loss that encourages smooth latent trajectories, and an inverse dynamics loss that forces the latent transition to be predictive of the executed action. The combined loss (prediction + VICReg/SIGReg + similarity + inverse dynamics) enables the model to learn a latent world model suitable for planning.

Planning is performed by defining an energy over a candidate action sequence: the sum of squared distances between the goal embedding and each predicted latent state along the imagined rollout. The authors employ Model‑Predictive Path Integral (MPPI) control, a population‑based optimizer that samples action trajectories, weights them by an exponential of negative energy, and iteratively refines the proposal distribution. In the Two Rooms navigation task, this approach achieves a 97 % success rate, showing that the learned latent dynamics are accurate enough for model‑based control.

The library is highly modular: encoders, predictors, regularizers, and planners are separate reusable components with clear APIs. All examples run on a single GPU (e.g., RTX 3090) within a few hours, using modest memory (4‑8 GB). Detailed hyper‑parameter tables are provided in the appendix, and the code is extensively commented to aid newcomers.

Key contributions are: (i) delivering a lightweight, single‑GPU‑friendly implementation of JEPA across image, video, and action‑conditioned domains; (ii) systematic comparison of collapse‑prevention regularizers, highlighting the practical benefits of SIGReg; (iii) demonstrating the importance of multi‑step rollout losses for stable long‑horizon video prediction; and (iv) integrating a latent world model with a sampling‑based planner to solve a goal‑conditioned navigation task.

Limitations include evaluation on relatively simple datasets (CIFAR‑10, Moving MNIST, Two Rooms) and the need for further testing on high‑resolution video or more complex robotic simulators. Additionally, occasional minor collapse was observed with certain initializations even when using SIGReg, suggesting room for theoretical strengthening.

Overall, EB‑JEPA bridges the gap between recent self‑supervised representation learning theory and practical applications in video and model‑based reinforcement learning, providing a valuable baseline for future research and teaching.


Comments & Academic Discussion

Loading comments...

Leave a Comment