Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.


💡 Research Summary

The paper tackles the under‑explored problem of generating music directly from visual artworks without relying on an intermediate text representation. Existing image‑conditioned music generation systems suffer from two major drawbacks: they are typically trained on natural photographs, which lack the rich semantic, stylistic, and cultural layers present in artworks, and they usually insert an image‑to‑text conversion stage that compresses visual information into language, discarding many non‑verbal cues essential for artistic expression. To address these gaps, the authors make two primary contributions.

First, they introduce ArtSound, a large‑scale multimodal dataset comprising 105,884 artwork‑music pairs. The visual side is sourced from ArtGraph, a knowledge graph built on WikiArt and DBpedia, covering 116 k digitized paintings across 18 genres and 32 styles. The audio side comes from the Free Music Archive (FMA) large version, providing 30‑second Creative‑Commons‑licensed tracks from over 16 k artists. After cleaning, 105,884 high‑quality pairs remain. Each artwork and each music track are annotated with dedicated captions. Image captions are generated by the multimodal LLM LLaVA, prompted to describe content, mood, style, and possible artistic influences. Audio captions are obtained by first segmenting tracks with LP‑MusicCaps (10‑second segment captions) and then fusing them into a coherent description using the Qwen3 LLM. To ensure caption reliability, the authors devise two composite metrics: ICS‑core (a weighted sum of CLIP‑Score and PAC‑Score) for images, and ACS‑core (a weighted sum of ROUGE‑1 and BERT‑Score) for audio. Captions falling below preset thresholds are regenerated, yielding a consistently high‑quality dataset that can serve as a benchmark for future cross‑modal research.

Second, the paper presents Art2Mus, the first framework that maps digitized artworks directly to music without any textual intermediary. The pipeline proceeds as follows: (1) visual embeddings are extracted from the artwork using a pre‑trained vision encoder such as CLIP or ImageBind; (2) these embeddings are passed through a dedicated cross‑modal adapter consisting of multi‑layer perceptrons and attention‑based modules, which projects them into the conditioning space of a latent diffusion model (LDM) originally designed for text‑to‑audio synthesis; (3) the LDM’s decoder receives the projected embeddings and generates mel‑spectrograms, which are finally inverted to audio waveforms. The training objective combines a reconstruction loss (L2 on spectrograms), a cross‑modal alignment loss (cosine similarity between visual and audio latent spaces), and music‑specific regularizers that enforce rhythmic continuity, harmonic stability, and genre‑consistent timbre. By eliminating the language bottleneck, the model is forced to learn direct visual‑to‑acoustic correspondences, preserving fine‑grained visual attributes such as brushstroke texture, color palette, and compositional balance that are often lost in captioning.

Experimental evaluation is conducted on both quantitative and human‑subjective dimensions. Quantitatively, Art2Mus lags behind state‑of‑the‑art text‑conditioned systems on CLIP‑AudioScore and genre classification accuracy, reflecting the intrinsic difficulty of learning without explicit semantic supervision. However, in a large‑scale listening study (over 1,200 participants), Art2Mus achieves comparable or superior scores on three key criteria: (i) visual‑audio consistency (how well the music reflects the visual mood and style), (ii) stylistic appropriateness (alignment with the art movement, e.g., impressionist colorfulness leading to bright, flowing melodies), and (iii) emotional conveyance (the ability of the generated piece to evoke feelings suggested by the artwork). Notably, for abstract expressionist works and highly saturated modern paintings, the generated music exhibits rich timbral variation and dynamic tempo changes that mirror the visual intensity, demonstrating that the model captures non‑verbal cues.

The authors also perform ablation studies. Replacing the sophisticated cross‑modal adapter with a simple linear projection dramatically reduces both alignment scores and human preference, confirming the necessity of a richer mapping. Using ImageBind embeddings instead of CLIP yields modest gains on artworks with strong textual metadata (e.g., annotated historical scenes), suggesting that multimodal encoders that already fuse audio‑visual signals can be beneficial.

In the discussion, the paper emphasizes three broader impacts. First, the ArtSound dataset fills a critical gap in multimodal resources, enabling systematic research on direct visual‑to‑audio generation. Second, the Art2Mus architecture showcases a viable path toward “language‑free” multimodal generation, which could be extended to video‑to‑music, 3‑D scene‑to‑sound, or even haptic‑to‑audio synthesis. Third, the introduced evaluation metrics (ICS‑core, ACS‑core) provide a template for assessing caption quality in cross‑modal datasets where ground‑truth textual descriptions are scarce.

Future work directions include (a) scaling the model to generate longer, structured compositions (e.g., multi‑minute symphonies) by incorporating hierarchical diffusion or transformer‑based temporal modeling, (b) integrating user‑controllable parameters (e.g., desired tempo, instrumentation) while preserving the visual grounding, and (c) exploring interactive creative tools where artists can iteratively refine visual inputs and instantly hear corresponding musical variations.

In summary, the paper makes a compelling case that direct visual‑to‑music generation is not only feasible but also capable of producing musically and aesthetically meaningful outputs that respect the nuanced semantics of visual art. By providing both a large‑scale dataset and a novel diffusion‑based framework, it opens a new research frontier at the intersection of computer vision, audio synthesis, and computational creativity.


Comments & Academic Discussion

Loading comments...

Leave a Comment