From Insight to Intervention: Interpretable Neuron Steering for Controlling Popularity Bias in Recommender Systems
Popularity bias is a pervasive challenge in recommender systems, where a few popular items dominate attention while the majority of less popular items remain underexposed. This imbalance can reduce recommendation quality and lead to unfair item exposure. Although existing mitigation methods address this issue to some extent, they often lack transparency in how they operate. In this paper, we propose a post-hoc approach, PopSteer, that leverages a Sparse Autoencoder (SAE) to both interpret and mitigate popularity bias in recommendation models. The SAE is trained to replicate a trained model’s behavior while enabling neuron-level interpretability. By introducing synthetic users with strong preferences for either popular or unpopular items, we identify neurons encoding popularity signals through their activation patterns. We then steer recommendations by adjusting the activations of the most biased neurons. Experiments on three public datasets with a sequential recommendation model demonstrate that PopSteer significantly enhances fairness with minimal impact on accuracy, while providing interpretable insights and fine-grained control over the fairness-accuracy trade-off.
💡 Research Summary
The paper introduces PopSteer, a novel post‑hoc framework designed to both interpret and mitigate popularity bias in recommender systems. Popularity bias—where a small set of highly popular items dominate user exposure while long‑tail items receive little attention—remains a persistent challenge that degrades recommendation relevance and fairness. Existing mitigation techniques typically involve re‑weighting item scores, modifying loss functions, or altering model architectures, but they often act as black boxes, offering little insight into which internal components drive the bias.
PopSteer addresses this gap by attaching a Sparse Autoencoder (SAE) to a pre‑trained recommendation model. The SAE is trained to reconstruct the model’s user embeddings while enforcing strict sparsity: only the top‑K neurons (K ≪ N) are active for any input. This sparsity encourages each neuron to specialize in a distinct, interpretable feature. After training, the SAE serves as a transparent lens into the recommendation model’s decision‑making process.
To identify neurons that encode popularity signals, the authors generate two synthetic user profiles: one consisting exclusively of popular (head) items and another consisting exclusively of unpopular (tail) items. These synthetic profiles are fed through the original recommender to obtain user embeddings, which are then passed through the SAE. For each hidden neuron j, the mean and standard deviation of its activation are computed under the popular and unpopular conditions. The effect size is quantified using Cohen’s d:
d_j = (μ_j,Pop – μ_j,Unpop) / sqrt((σ_j,Pop² + σ_j,Unpop²)/2).
A large absolute d_j indicates strong responsiveness to popularity; positive values align with popular items, negative values with long‑tail items. The authors verify that most neuron activations follow an approximately Gaussian distribution, justifying the use of Cohen’s d.
Once the bias‑related neurons are identified, PopSteer performs “neuron steering.” For neurons with d_j exceeding a threshold β, the activation is reduced by w_j · σ_j; for neurons with d_j below –β, the activation is increased by the same amount. The weight w_j is proportional to the normalized absolute d_j and is scaled by two hyper‑parameters, α_pop and α_unpop, which control the strength of adjustments for popularity‑aligned and unpopularity‑aligned neurons respectively. This approach differs from prior work that often zeroes out or clamps neuron values; by scaling adjustments with each neuron’s standard deviation, PopSteer preserves the natural activation distribution and avoids drastic side effects.
After steering, the modified hidden activations are passed through the Top‑K sparsity function and then decoded to produce a new user embedding p′. This embedding replaces the original one when computing recommendation scores with the base model’s item embeddings, thereby yielding a recommendation list that is less dominated by popular items.
The authors evaluate PopSteer on three public datasets—MovieLens‑1M, BeerAdvocate, and Yelp—using SASRec as the underlying sequential recommender. They report standard accuracy metrics (HR@10, NDCG@10) alongside fairness metrics such as exposure disparity, Gini coefficient, and a popularity gap measure. Compared with several baselines (re‑weighting, loss‑adjustment, adversarial debiasing), PopSteer achieves substantially higher fairness scores (average improvement >12%) while incurring only a modest accuracy drop (1–2%). Visualizations of neuron d_j values demonstrate that specific neurons are strongly associated with popular or long‑tail items, confirming the interpretability claim. Ablation studies show that the β threshold and the α hyper‑parameters allow fine‑grained control over the fairness‑accuracy trade‑off.
Key contributions of the paper are:
- A post‑hoc, model‑agnostic method that does not require retraining the recommender, making it practical for deployment.
- The use of a Sparse Autoencoder to obtain monosemantic neurons, enabling direct attribution of popularity bias to individual hidden units.
- A principled neuron‑steering mechanism that adjusts activations proportionally to their natural variance, preserving overall model behavior.
- Empirical evidence of superior bias mitigation across multiple datasets and recommendation scenarios, together with detailed interpretability analyses.
In summary, PopSteer offers a compelling solution that simultaneously enhances fairness, retains recommendation quality, and provides transparent insight into the internal workings of deep recommender models. Its ability to intervene at the neuron level opens new avenues for controllable, explainable bias correction in real‑world recommendation pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment