Shared Representation Learning for Reference-Guided Targeted Sound Detection

Shared Representation Learning for Reference-Guided Targeted Sound Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a reference audio of that sound is provided. Prior approaches, rely on generating a sound-discriminative conditional embedding vector for the reference and pairing it with a mixture encoder, jointly optimized with a multi-task learning approach. In this work, we propose a unified encoder architecture that processes both the reference and mixture audio within a shared representation space, promoting stronger alignment while reducing architectural complexity. This design choice not only simplifies the overall framework but also enhances generalization to unseen classes. Following the multi-task training paradigm, our method achieves substantial improvements over prior approaches, surpassing existing methods and establishing a new state-of-the-art benchmark for targeted sound detection, with a segment-level F1 score of 83.15% and an overall accuracy of 95.17% on the URBAN-SED dataset.


💡 Research Summary

The paper tackles the problem of Reference‑Guided Targeted Sound Detection (TSD), where a short reference audio clip is provided and the system must determine whether the corresponding sound event occurs in a longer, potentially noisy mixture and localize its temporal boundaries. Traditional approaches, such as TSDNet, employ a dual‑branch architecture: a conditional network encodes the reference into a discriminative embedding, while a separate detection network processes the mixture; the two are later fused. Although effective, this design incurs high parameter count, requires careful alignment between two distinct representation spaces, and often struggles to generalize to unseen sound classes.

To address these limitations, the authors propose a unified encoder framework that processes both reference and mixture audio through a single ConvNeXt backbone pre‑trained on AudioSet‑2M. By sharing the encoder, both inputs are projected into the same embedding space, which naturally encourages stronger alignment and reduces architectural complexity. The encoder outputs a frame‑level embedding Hₘ ∈ ℝ^{T×F} for the mixture and a global clip‑level embedding h_ref ∈ ℝ^{1×F} for the reference. The reference embedding is tiled along the temporal axis to match Hₘ, and both streams are projected to a common dimension F′ (3072) via separate 1‑D convolutions.

Three fusion strategies are explored:

  1. Element‑wise multiplication – the simplest operation, directly multiplying the two projected tensors. This baseline already yields a segment‑level F1 of 83.15 % and overall accuracy of 95.17 %.
  2. FiLM‑based conditioning – the reference vector modulates the mixture features via learned scale and shift parameters, marginally improving F1 to 83.18 %.
  3. Cross‑attention – a content‑adaptive attention mechanism aligns reference and mixture representations, achieving the highest segment‑F1 of 86.06 %.

The model is trained with a multi‑task loss that combines (i) a clip‑level classification loss (cross‑entropy) applied to the reference embedding, encouraging the network to predict the presence of the target class in the whole clip, and (ii) a frame‑level detection loss (binary cross‑entropy) applied to each time frame’s sigmoid‑activated output. The total loss L_total = L_CE + L_SED drives both coarse class recognition and fine‑grained temporal localization simultaneously.

Experiments are conducted on the synthetic URBAN‑SED dataset (10‑second urban soundscapes with strong onset/offset annotations) and UrbanSound8K (isolated reference clips). Two benchmark settings are defined: Urban‑TSD‑Strong, where each mixture is paired with a reference belonging to a class present in the mixture, and Urban‑TSD‑Strong+, which additionally includes negative pairs (reference class absent). Under the Strong setting, the proposed unified model outperforms all baselines, achieving 83.15 % segment‑F1 and 95.17 % accuracy, a ~7 % absolute gain over the previous best (TSDNet, 76.3 % F1). Per‑class analysis shows notable improvements for transient, spectrally overlapping events such as car horn, dog bark, and gunshot.

To assess generalization, the model trained on URBAN‑SED is evaluated on a curated subset of AudioSet‑Strong, comprising real‑world YouTube recordings of the same ten classes. Despite the domain shift, the system attains 76.62 % segment‑F1 and 97.3 % accuracy, demonstrating robust transferability likely aided by the AudioSet pre‑training of ConvNeXt and the shared‑representation design. Moreover, when trained on only seven of the ten classes and tested on all ten, the model still reaches 73.47 % F1 and 91.06 % accuracy, confirming resilience to unseen‑class scenarios.

Ablation studies compare the unified encoder against a dual‑branch counterpart using both ConvNeXt and a classic CNN14 backbone. In every configuration, the shared encoder yields higher F1 and accuracy, confirming that parameter sharing not only simplifies the model but also improves representation quality. Fusion strategy experiments reveal that while simple element‑wise multiplication is already strong, cross‑attention provides the best performance by allowing the network to selectively emphasize mixture features that match the reference content.

The inclusion of negative reference samples in the Strong+ setting reduces segment‑F1 to 78.94 %, reflecting the added difficulty of rejecting absent classes—a realistic requirement for deployment. The authors suggest future work on contrastive learning or hard‑negative mining to mitigate this drop.

In summary, the paper introduces a conceptually simple yet powerful shift from dual‑branch to unified encoder architectures for reference‑guided TSD. By leveraging a pre‑trained ConvNeXt backbone, shared representation learning, and multi‑task supervision, the method achieves state‑of‑the‑art performance on synthetic benchmarks, strong cross‑domain generalization, and competitive results on unseen classes, all while reducing model complexity. This work sets a new benchmark for open‑domain acoustic retrieval and opens avenues for further research on efficient, generalizable sound detection systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment