Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements
Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements, identifying their start and stop times, directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases, therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3), are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 308 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3s across tasks, within typical rater tolerance for timestamp review, enabling practical fidelity QC. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a privacy-preserving, scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.
💡 Research Summary
This paper addresses the labor‑intensive task of assessing therapist fidelity in Prolonged Exposure (PE) therapy for PTSD by automatically locating the start and end times of three core protocol phases: therapist orientation (P1), imaginal exposure (P2), and post‑imaginal processing (P3). The authors propose a privacy‑preserving, scalable pipeline that leverages a large pre‑trained audio‑language model, Qwen2‑Audio‑7B‑Instruct, and adapts it with Low‑Rank Adaptation (LoRA) and 4‑bit quantization (QLoRA) to run efficiently on modest hardware.
Dataset: 308 real PE sessions collected at Emory University (≈338 hours total). Audio was down‑sampled to 16 kHz WAV files; transcripts with sentence‑level timestamps and speaker labels were generated using Amazon HealthScribe. Initial timestamps for each phase were generated automatically by a large language model (Claude Sonnet 3.5) via a zero‑shot prompting scheme, then verified by trained raters on a 10 % sample, achieving 94.4 % timestamp accuracy (within 5‑10 s) and 97.7 % label accuracy. The verified timestamps constitute the ground‑truth for training.
Problem formulation: Instead of treating fidelity as a sequence of discrete class labels, the authors cast it as a continuous temporal regression task. For each annotated boundary (start or end of a phase), a fixed‑duration window (30 s, 60 s, or 120 s) containing both audio and the aligned transcript excerpt is extracted. The true boundary’s relative position within the window is normalized to a value in
Comments & Academic Discussion
Loading comments...
Leave a Comment