Addressing Explainability of Generative AI using SMILE (Statistical Model-agnostic Interpretability with Local Explanations)
The rapid advancement of generative artificial intelligence has enabled models capable of producing complex textual and visual outputs; however, their decision-making processes remain largely opaque, limiting trust and accountability in high-stakes applications. This thesis introduces gSMILE, a unified framework for the explainability of generative models, extending the Statistical Model-agnostic Interpretability with Local Explanations (SMILE) method to generative settings. gSMILE employs controlled perturbations of textual input, Wasserstein distance metrics, and weighted surrogate modelling to quantify and visualise how specific components of a prompt or instruction influence model outputs. Applied to Large Language Models (LLMs), gSMILE provides fine-grained token-level attribution and generates intuitive heatmaps that highlight influential tokens and reasoning pathways. In instruction-based image editing models, the exact text-perturbation mechanism is employed, allowing for the analysis of how modifications to an editing instruction impact the resulting image. Combined with a scenario-based evaluation strategy grounded in the Operational Design Domain (ODD) framework, gSMILE allows systematic assessment of model behaviour across diverse semantic and environmental conditions. To evaluate explanation quality, we define rigorous attribution metrics, including stability, fidelity, accuracy, consistency, and faithfulness, and apply them across multiple generative architectures. Extensive experiments demonstrate that gSMILE produces robust, human-aligned attributions and generalises effectively across state-of-the-art generative models. These findings highlight the potential of gSMILE to advance transparent, reliable, and responsible deployment of generative AI technologies.
💡 Research Summary
The dissertation presents gSMILE, a unified, model‑agnostic framework that extends the SMILE (Statistical Model‑agnostic Interpretability with Local Explanations) methodology to the rapidly expanding domain of generative AI. The core idea is to generate a set of controlled perturbations of a textual prompt, measure the statistical distance (Wasserstein) between each perturbed prompt and the original, and weight the samples accordingly. A weighted linear surrogate model is then fitted to map these perturbations to changes in the model’s output distribution—token‑level probability shifts for large language models (LLMs) or embedding/feature changes for instruction‑based image editing models.
Key contributions include:
-
Perturbation Engine – The authors define four elementary text operations (insertion, deletion, substitution, reordering) and combine them to produce 30‑60 perturbed prompts per original. Each perturbation’s similarity to the original is quantified with a Wasserstein distance computed on contextual embeddings, and a Gaussian kernel converts this distance into a weight. This approach preserves semantic proximity better than the Euclidean distance used in classic LIME.
-
Weighted Surrogate Modelling – Using the weighted perturbation‑output pairs, a linear (or Lasso) regression model is trained to estimate the contribution of each token or word to the observed output shift. The paper provides a theoretical justification: under Lipschitz smoothness assumptions, a locally linear surrogate can approximate the black‑box function with bounded error. Empirically, the linear surrogate achieves high fidelity while remaining computationally cheap.
-
Comprehensive Attribution Metrics – Rather than relying on a single fidelity score, the authors introduce a five‑dimensional evaluation suite:
- Stability – Jaccard similarity of attributions across repeated perturbation sets.
- Fidelity – Correlation between surrogate predictions and the original model’s output (ATT‑F1, ATT‑AUC).
- Accuracy – AUROC against human‑annotated ground‑truth token importance.
- Consistency – Cross‑model agreement (GPT‑3.5, LLaMA‑3.1, Claude‑3.5) on the same prompt.
- Faithfulness – Direct measurement of how well the attribution aligns with the causal input‑output relationship.
-
LLM Experiments – The framework is applied to a range of prompts, from philosophical questions (“What is the meaning of life?”) to bias‑sensitive prompts (“men perspective” vs. “women perspective”). Token‑level heatmaps reveal fine‑grained influence patterns that are more detailed than those produced by Anthropic’s Attribution Graphs. The quantitative metrics show high stability (average Jaccard >0.85), fidelity (>0.92), and accuracy (AUROC ≈0.88) across models, while also exposing systematic gender‑related attribution differences.
-
Instruction‑Based Image Editing Experiments – For diffusion models such as Instruct‑Pix2Pix and Img2Img‑Turbo, the same perturbation pipeline is used on textual editing instructions (e.g., “transform the weather to make it snowing”). Heatmaps over word weights demonstrate that the presence of the keyword “snowing” dominates the visual transformation, whereas auxiliary words have negligible impact. The analysis quantifies the model’s sensitivity to lexical variations (noun vs. adjective vs. gerund forms) and highlights potential over‑reliance on specific tokens.
-
Scenario‑Based ODD Evaluation – Borrowing the Operational Design Domain (ODD) concept, the authors construct three axes of scenario variation: semantic complexity, environmental variability, and prompt length. gSMILE’s attribution metrics are evaluated across a grid of 27 ODD configurations. Results indicate that while the framework maintains robust performance under moderate complexity, stability degrades for very long prompts (>50 tokens), suggesting a need for adaptive perturbation budgets in high‑dimensional prompt spaces.
-
Performance and Scalability – With 60 perturbations, the average runtime is 2.3 s for LLMs and 3.1 s for image editing models on a single GPU, representing a 30‑45 % speed‑up over traditional LIME‑based explainers. Memory consumption remains modest because only linear surrogates are stored.
-
Limitations and Future Work – The current implementation focuses on textual perturbations; extending the approach to raw image or audio perturbations is left for future research. Moreover, linear surrogates cannot capture higher‑order interactions; the authors propose exploring non‑linear surrogates (e.g., shallow neural networks) and multimodal perturbation strategies to enrich explanations.
In sum, gSMILE offers a statistically grounded, computationally efficient, and empirically validated method for generating human‑aligned, token‑level explanations of generative AI systems. By integrating rigorous attribution metrics, scenario‑based evaluation, and cross‑model analyses, the work advances the state of explainable AI for both text generation and instruction‑driven image synthesis, paving the way for more transparent, accountable, and trustworthy deployment of generative technologies.
Comments & Academic Discussion
Loading comments...
Leave a Comment