Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis
Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval.
💡 Research Summary
The paper introduces Koo‑Fu CLIP, a lightweight supervised adaptation technique for CLIP visual embeddings based on the Fukunaga‑Koontz Linear Discriminant Analysis (LDA). While CLIP’s frozen embeddings are powerful for zero‑shot tasks, they are not optimized for supervised classification: class clusters overlap and the dimensionality (typically 768) is higher than needed for a fixed label set. Koo‑Fu CLIP addresses both issues by applying a closed‑form linear transformation that first whitens the within‑class scatter matrix and then rotates the whitened space to maximize between‑class separation. Concretely, the method computes the within‑class covariance S_w, adds a small regularization λI to ensure positive definiteness, and obtains the inverse square‑root S_w^{‑1/2}. This whitening makes each class’s variance spherical. In the whitened space, class‑mean differences are used to build a between‑class scatter S′_b, which is eigendecomposed; the top L eigenvectors form a rotation matrix U_L. The final projection T = U_L^T · S_w^{‑1/2} maps original 768‑dimensional vectors into an L‑dimensional discriminative subspace. Only two hyper‑parameters exist: λ and the target dimensionality L.
The authors evaluate the approach on ImageNet‑1K, a 14K leaf‑node subset, and the full ImageNet‑21K hierarchy (≈19 K classes). Using a Nearest Visual Prototype (NVP) classifier—where each class is represented by the mean of its transformed embeddings—they achieve a top‑1 accuracy increase from 75.1 % (raw CLIP) to 79.1 % on ImageNet‑1K, a 4 % absolute gain. Similar improvements persist when the label space expands to 14 K and 21 K classes, indicating that the transformation sharpens class boundaries rather than over‑fitting to a closed set. Dimensionality reduction experiments show that compressing embeddings by a factor of 3 (768 → 256) incurs less than 0.5 % accuracy loss, while aggressive compression by 10–12× (down to 64 dimensions) retains performance comparable to the original space. The method also outperforms k‑Nearest Neighbors (k‑NN) in memory efficiency: NVP stores only one prototype per class, whereas k‑NN must retain all training vectors, leading to roughly 1000× higher memory and computational cost for only a modest 1–2 % accuracy advantage.
A series of ablations explore the effect of the regularization λ (optimal around 10^{‑4}–10^{‑2}), the number of retained dimensions L (performance saturates near 256), distance metrics (cosine similarity is most stable), and prompt strategies for zero‑shot evaluation. The analysis confirms that whitening eliminates noisy, redundant directions in high‑dimensional CLIP space, while the discriminative rotation aligns the remaining axes with maximal class separation.
In summary, Koo‑Fu CLIP provides a closed‑form, training‑free adaptation that simultaneously enhances class separability and enables substantial dimensionality reduction. It requires only a single linear matrix multiplication at inference, incurs negligible computational overhead, and dramatically reduces storage requirements, making it highly suitable for large‑scale image classification and retrieval systems where efficiency and scalability are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment