The Deleuzian Representation Hypothesis

The Deleuzian Representation Hypothesis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose an alternative to sparse autoencoders (SAEs) as a simple and effective unsupervised method for extracting interpretable concepts from neural networks. The core idea is to cluster differences in activations, which we formally justify within a discriminant analysis framework. To enhance the diversity of extracted concepts, we refine the approach by weighting the clustering using the skewness of activations. The method aligns with Deleuze’s modern view of concepts as differences. We evaluate the approach across five models and three modalities (vision, language, and audio), measuring concept quality, diversity, and consistency. Our results show that the proposed method achieves concept quality surpassing prior unsupervised SAE variants while approaching supervised baselines, and that the extracted concepts enable steering of a model’s inner representations, demonstrating their causal influence on downstream behavior.


💡 Research Summary

The paper introduces a novel unsupervised method for extracting interpretable concepts from neural networks, positioning it as a lightweight alternative to sparse autoencoders (SAEs). The central premise draws from Deleuze’s philosophical view that concepts are defined by differences rather than universal essences. Technically, the authors sample a set of pairwise activation differences from a chosen layer of a pretrained model. Each difference vector (d_i = x_{p(i)} - x_{q(i)}) captures a directional distinction between two data points. Because computing all (N(N-1)/2) differences is infeasible, they randomly select (N) pairs, guaranteeing that each sample appears once on each side of the subtraction.

These difference vectors are then clustered using a weighted K‑Means algorithm. The weighting scheme is based on the skewness (third standardized moment) of each difference’s distribution across the dataset. Highly skewed differences tend to be near‑zero for most samples and spike only occasionally, which would dominate Euclidean distance and lead to redundant clusters. To counteract this, the distance to a centroid is scaled by the inverse of the skewness:
(d(d_i, \bar C) = \frac{1}{\tilde\mu_3(d_i)} | \bar C - d_i |^2).
Negative skewness values are sign‑flipped so that directionality is ignored (the method seeks axes, not oriented vectors). The resulting (k) centroids constitute the concept dictionary; (k) is the sole hyper‑parameter and is directly interpretable as the desired number of concepts.

The authors connect this procedure to Fisher’s Linear Discriminant Analysis (LDA). In a supervised setting, LDA finds a direction (c) that maximally separates two class means under a shared covariance assumption: (c \propto (\Sigma_A + \Sigma_B)^{-1}(\mu_A - \mu_B)). By treating each sampled pair as a two‑class problem with isotropic covariances (approximated as diagonal identity matrices), the optimal separating direction reduces to the raw difference vector. Hence, clustering differences can be viewed as an unsupervised analogue of LDA that does not require class labels. The authors also discuss a quadratic extension for anisotropic covariances, but empirical results show no benefit, so the isotropic version is retained.

Scalability is a key advantage: both the pairwise sampling and weighted K‑Means run in (O(ND)) time and memory, making the approach applicable to large‑scale vision, language, and audio models.

Evaluation spans five pretrained encoders across three modalities: CLIP and DinoV2 (vision), DeBERTa, BART, and Pythia‑70M (text), and an Audio Spectrogram Transformer (audio). Datasets include ImageNet‑100, WikiArt (with artist, style, genre labels), IMDB sentiment, CoNLL‑2003 (NER, POS, chunking), and AudioSet. The authors assess three aspects: (1) concept quality via Probe Loss, which measures how well a one‑dimensional logistic probe can recover ground‑truth attributes from each concept; (2) diversity/coverage by aggregating Probe Loss across 874 attributes; (3) consistency across random seeds using Maximum Pairwise Pearson Correlation (MPPC).

Results (Table 1) show that the proposed method consistently achieves lower Probe Loss than all SAE variants (Vanilla‑SAE, Gated‑SAE, JumpReLU‑SAE, Matryoshka‑SAE, TopK‑SAE, Archetypal‑SAE) on 13 of 20 tasks, and approaches the supervised LDA baseline, especially on image tasks. MPPC scores indicate high repeatability, surpassing SAEs and ICA. The authors also compare against Independent Component Analysis (ICA) and pretrained SAEs, confirming superior performance.

Beyond quantitative metrics, the paper demonstrates “concept steering.” Because extracted concepts are vectors in the original activation space, steering is performed by a simple additive operation: (\tilde{x}=x + \alpha c_i). This avoids the encoder‑decoder projections required by SAE‑based steering, eliminating reconstruction error and ensuring lossless, reversible manipulation. Experiments on text (sentiment, entity type) and images (color, style, genre) illustrate that increasing (\alpha) enhances the targeted attribute in the model’s output, while decreasing (\alpha) suppresses it, confirming a causal influence of the extracted concepts on downstream behavior.

In summary, the contributions are:

  1. A philosophically motivated, mathematically grounded method that extracts concepts as clustered activation differences.
  2. A skewness‑weighted K‑Means scheme that promotes semantic diversity and reduces redundancy.
  3. Theoretical linkage to discriminant analysis, providing an interpretability justification.
  4. Extensive cross‑modal empirical validation showing superior concept quality, diversity, and consistency compared to state‑of‑the‑art SAEs.
  5. Demonstration of lossless, reversible concept steering, highlighting practical utility.

Limitations include reliance on random pair sampling, which may miss rare but important differences unless a sufficiently large (N) is used, and the isotropic covariance assumption that may not hold for all layers or architectures. Future work could explore adaptive sampling strategies, anisotropic extensions, and integration with user‑interactive tools for concept discovery and manipulation.


Comments & Academic Discussion

Loading comments...

Leave a Comment