ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model

ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.


💡 Research Summary

ArtContext addresses the long‑standing gap between the wealth of scholarly art‑historical literature and the public’s ability to access contextual information while viewing paintings. The authors propose a fully automated pipeline that harvests open‑access art‑history articles, extracts semantically relevant sentences, aligns them with individual artworks using structured Wikidata metadata, and fine‑tunes a vision‑language model (CLIP) with Low‑Rank Adaptation (LoRA) to create a domain‑specific model called PaintingCLIP.

The pipeline consists of four stages. First, a corpus of 27,044 open‑access PDFs is collected via the OpenAlex API, filtered to English language and to the “Art History” topic, and grouped by 450 curated artists. Second, each PDF is converted to Markdown, tokenized into sentences, short sentences are discarded, and a sliding‑window creates short paragraphs that preserve local context. These paragraphs are embedded with Sentence‑BERT, producing a dense semantic representation for every candidate text fragment. Third, for each painting present in Wikidata (≈37,449 entries), a natural‑language query is generated from its metadata (title, year, creator, depicted entities, etc.). The query is also embedded with Sentence‑BERT, and the most similar candidate sentence (by cosine similarity) is selected as the painting’s representative scholarly description. This yields 29,697 image‑text pairs, each consisting of the painting image, the Wikidata‑derived factual label, and the aligned sentence.

Instead of full fine‑tuning, the authors apply LoRA to the visual and textual projection heads of CLIP‑ViT‑B/32. LoRA introduces low‑rank matrices (rank = 16, scaling α = 32) that modify the frozen weights with negligible additional parameters and inference overhead. The model is trained with the same contrastive loss used in the original CLIP, using the weak supervision pairs generated in the previous step. The resulting PaintingCLIP retains CLIP’s broad visual‑language capabilities while moving paintings closer to the nuanced, scholarly descriptions found in art‑history literature.

Evaluation is three‑fold. Corpus statistics reveal a heavy skew: a few canonical artists and works dominate the open‑access literature, while many have very few associated articles, leading to about 20 % of paintings being unmatched. Quantitatively, a retrieval experiment on the ten most‑cited paintings shows that PaintingCLIP consistently outperforms the baseline CLIP across the full recall range, with especially large gains in the high‑precision regime (top‑ranked sentences are far more likely to be correct). Qualitative examples, such as Rembrandt’s “The Night Watch,” demonstrate that PaintingCLIP retrieves multi‑sentence explanations covering composition, iconography, and historical interpretation, whereas baseline CLIP often returns generic or tangential statements.

The authors acknowledge several limitations: (1) noise from PDF parsing and sentence selection can introduce irrelevant or fragmented contexts; (2) the supervision data inherit the bias of open‑access publishing toward canonical artists and well‑documented works; (3) reliance on Sentence‑BERT for similarity may miss fine‑grained iconographic distinctions. They suggest future improvements such as using cross‑encoder re‑ranking, developing art‑history‑specific language models, and incorporating closed‑access scholarly corpora to enrich supervision.

In summary, ArtContext demonstrates that even with modest, openly available textual resources, a vision‑language model can be efficiently adapted to the art‑history domain using LoRA. The pipeline is modular and scalable, offering a template for similar multimodal knowledge‑integration tasks across other humanities disciplines.


Comments & Academic Discussion

Loading comments...

Leave a Comment