Learning language through pictures

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose Imaginet, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task objective by receiving a textual description of a scene and trying to concurrently predict its visual representation and the next word in the sentence. Mimicking an important aspect of human language learning, it acquires meaning representations for individual words from descriptions of visual scenes. Moreover, it learns to effectively use sequential structure in semantic interpretation of multi-word phrases.

💡 Research Summary

The paper introduces Imaginet, a neural architecture designed to learn visually grounded language representations from paired textual descriptions and images. The model consists of two parallel Gated Recurrent Unit (GRU) networks that share a common word‑embedding matrix. One GRU (the “textual pathway”) processes the sentence word‑by‑word and predicts the next word at each time step, effectively functioning as a language model. The other GRU (the “visual pathway”) also reads the sentence sequentially but only after the final token does it map its hidden state to a 4096‑dimensional image feature vector (extracted from the pre‑softmax layer of a 16‑layer VGG CNN).

Training uses a multi‑task loss: a cross‑entropy term for next‑word prediction (L_T) and a mean‑squared‑error term for image reconstruction (L_V). A scalar α balances the two objectives, allowing the model to be run in three regimes: pure visual (α = 0), pure textual (α = 1), or a true multi‑task setting (0 < α < 1). The authors set α = 0.1 for most experiments, thereby giving the visual objective a dominant role while still encouraging the language pathway to learn useful representations.

The system is trained on the MS‑COCO caption dataset, using Adam optimization for eight epochs. Word embeddings and GRU hidden states are 1024‑dimensional. As baselines, the authors implement a simple linear regression model (LINREG) that maps bag‑of‑words counts directly to image features.

Evaluation proceeds along three axes. First, the quality of learned word embeddings is measured against human similarity judgments (MEN‑3K and SimLex‑999). The multi‑task version achieves Spearman ρ = 0.39 (MEN) and 0.63 (SimLex), outperforming both the visual‑only and textual‑only variants and substantially beating LINREG (0.18/0.23). Notably, the model captures nuanced relations such as antonyms, collocations, and non‑visual semantic links.

Second, the authors test whether single‑word meanings are grounded visually. Each vocabulary word is fed as a one‑word “sentence,” the visual pathway’s final hidden state is projected into the image feature space, and the top‑5 nearest ImageNet validation images are retrieved. Both the visual‑only and multi‑task models retrieve correct images far more often than LINREG (≈57 % vs 23 % top‑5 accuracy).

Third, the paper investigates the acquisition of sentence structure. Two manipulations are performed: (a) image retrieval using original captions versus scrambled word order, and (b) paraphrase retrieval where each image has five captions treated as paraphrases. In both tasks, models perform markedly better on ordered captions, confirming sensitivity to syntactic order. The multi‑task model initially shows a larger gap but eventually catches up, indicating that visual supervision alone can also induce some structural awareness. Qualitative examples illustrate that the model learns to treat sentence‑initial nouns as topics, respects period boundaries, and distinguishes modifiers from heads.

The discussion highlights that Imaginet simultaneously learns lexical semantics and compositional syntax without explicit syntactic supervision, a step beyond earlier connectionist or bag‑of‑words approaches that ignored word order. Future work is proposed to replace next‑word prediction with full sentence reconstruction or paraphrase generation, and to explore how visual grounding can be transferred to words lacking image data.

Overall, Imaginet demonstrates that a modest multi‑task GRU architecture, trained on large‑scale caption‑image pairs, can acquire rich word meanings and exploit sentence structure for both visual grounding and language‑only tasks, outperforming simple linear baselines and approaching performance levels of state‑of‑the‑art image‑caption retrieval systems.

Learning language through pictures

💡 Research Summary

Comments & Academic Discussion

Leave a Comment