Music Plagiarism Detection: Problem Formulation and a Segment-based Solution
Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real-world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at https://github.com/Mippia/ICASSP2026-MPD.
💡 Research Summary
The paper addresses the emerging social problem of music plagiarism by first clarifying that the research community has been working on “Music Plagiarism Detection” (MPD) without a precise definition of the task. The authors distinguish MPD from related Music Information Retrieval (MIR) tasks such as Cover Song Identification (CSI) and Audio Fingerprinting. While CSI and fingerprinting assess similarity at the whole‑track level, plagiarism often involves only a short, specific segment (e.g., a melody line, a chord progression, or a vocal phrase). Consequently, MPD requires (1) identification of a plagiarized track from a large database, (2) precise localization of the matching segment(s) between the query and the suspect track, and (3) an explanation of which musical element(s) cause the similarity.
To operationalize these requirements, the authors introduce the Similar Music Pair (SMP) dataset, consisting of 72 real‑world plagiarism/remake pairs with manually annotated timestamps for the plagiarized sections. They augment this with the Covers80 dataset to create a large index of millions of segments, enabling both segment‑level and track‑level evaluation.
The proposed system follows a three‑stage pipeline:
-
Music Segment Transcription – Raw audio is processed through a cascade of state‑of‑the‑art models: Demucs for source separation, an all‑in‑one structural analyzer, AST for vocal transcription, SheetSage for melody transcription, and Harmony Transformer for chord transcription. The audio is sliced into fixed 4‑bar segments, each represented by multimodal features (pianoroll, onset‑rhythm, chord symbols).
-
Segment‑level Similarity Measurement – Two complementary approaches are explored. (a) A rule‑based, music‑domain method computes separate similarity scores for melody, rhythm, and harmony and combines them with learned weights, providing interpretable evidence for Task 3. (b) A deep‑learning route fine‑tunes the MERT acoustic model and a multimodal CNN within a Siamese architecture; a dual‑encoder with bidirectional cross‑attention is also tested to let the two modalities inform each other.
-
Filtering & Final Decision – The top‑20 segment matches are aggregated using a linearly decreasing weighted vote to produce a track‑level ranking (Task 1). The system can thus return the most likely plagiarized track together with the matching timestamps and the element‑wise reasoning.
Evaluation uses two metrics. For segment‑level performance, “Recall@k within 1 second” (Rec@k@1s) measures how often the retrieved timestamps fall within one second of the ground‑truth interval. For track‑level performance, mean Average Precision (mAP) and Mean Rank of the first correct result (MR1) are reported, mirroring CSI benchmarks.
Results show that the music‑domain method and MERT achieve the highest segment‑level recall (up to 51 % at Rec@10@1s), while the multimodal CNN underperforms, likely due to the limited training data. At the track level, the proposed system lags behind state‑of‑the‑art CSI models (Bytecover3, CoverHunter) with mAP around 0.15–0.20, but it achieves relatively low MR1 (≈30), indicating that when a correct match exists it is often ranked near the top. Qualitative examples demonstrate that the system can pinpoint exact timestamps and provide element‑wise explanations (e.g., “melody similarity” or “rhythm similarity”), a capability absent from conventional CSI systems.
The authors acknowledge several limitations: fixed 4‑bar segments may not suit all genres; the SMP dataset is small, restricting deep‑learning generalization; and the current evaluation does not quantitatively assess the quality of the explanatory component. They suggest future work on variable‑length segmentation, data augmentation or synthetic plagiarism generation, and development of dedicated metrics for explainability.
In conclusion, the paper makes three primary contributions: (1) a clear, task‑level definition of music plagiarism detection, (2) the release of a real‑world, timestamp‑annotated SMP dataset, and (3) a segment‑transcription‑based pipeline that demonstrates the feasibility of precise, explainable plagiarism detection. While performance on large‑scale track retrieval remains modest, the work establishes a solid foundation for subsequent research aiming to bridge the gap between academic MIR studies and practical copyright enforcement.
Comments & Academic Discussion
Loading comments...
Leave a Comment