Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset
Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: https://github.com/laicsiifes/transformer-caption-ptbr.
💡 Research Summary
This paper tackles the persistent resource gap in Brazilian Portuguese (BP) image captioning by systematically comparing two versions of the Flickr30K dataset: one composed of captions manually authored by native BP speakers (the “native” set) and another generated through automatic English‑to‑Portuguese translation (the “translated” set). The authors adopt a Vision‑Encoder‑Decoder (VED) paradigm, pairing three state‑of‑the‑art vision transformers—ViT‑base, Swin‑Transformer‑base, and DeiT‑base—with three BP‑focused language decoders: BERTimbau, DistilBERTimbau, and GPT‑Portuguese‑2. This yields nine distinct VED configurations, each fine‑tuned on both the native and translated corpora, resulting in a total of eighteen trained models.
A central contribution is the cross‑context evaluation: models trained on one dataset are tested on the other, enabling a direct measurement of how translation quality influences generalization. In addition to classic reference‑based metrics (BLEU, METEOR, ROUGE, CIDEr), the study incorporates CLIP‑Score, a reference‑free metric that quantifies image‑text alignment by leveraging the CLIP model’s joint embedding space. Finally, attention maps are visualized for each model, providing qualitative insight into which image regions drive caption generation and revealing systematic biases.
Results show that the Swin‑DistilBERTimbau configuration consistently outperforms all other VED combinations across both datasets, achieving the highest scores on BLEU, METEOR, and CIDEr while maintaining minimal performance degradation in the cross‑dataset tests. This suggests that Swin’s hierarchical windowed attention efficiently captures both local and global visual cues, and that DistilBERTimbau, despite being a distilled model, retains sufficient linguistic capacity for BP. The BP‑specific pre‑trained VLM ViTucano surpasses larger multilingual models (GPT‑4o, LLaMa 3.2‑Vision) on traditional text metrics, highlighting the advantage of language‑specific pre‑training. Conversely, GPT‑4‑based models achieve the highest CLIP‑Score, indicating superior image‑text semantic alignment even without fine‑tuning.
Attention analysis uncovers three notable bias patterns: (1) gender misclassification in images lacking explicit gender cues, reflecting societal biases present in the training data; (2) enumeration errors where multiple objects are described out of order or omitted; and (3) spatial inconsistencies where phrases such as “to the left of” or “above” do not correspond to the attended image patches, likely stemming from imperfect positional encoding during patch tokenization. These findings underscore the need for bias mitigation strategies, enhanced positional encoding, and possibly data augmentation focused on multi‑object and spatial relations.
The study’s methodological contributions are threefold. First, it provides empirical evidence that native captions yield higher linguistic fidelity and better downstream performance than automatically translated captions, yet models trained solely on translated data still retain reasonable cross‑dataset competence. Second, the integration of CLIP‑Score within the CAPIVARA evaluation framework offers a more holistic assessment of caption quality, complementing reference‑based scores that may overlook semantic alignment. Third, the visual attention diagnostics furnish an interpretable layer that can guide future model improvements and dataset curation.
In conclusion, the paper demonstrates that a carefully selected combination of a hierarchical vision encoder (Swin) and a lightweight, BP‑tailored language decoder (DistilBERTimbau) delivers state‑of‑the‑art performance on both native and translated BP captioning tasks. It also shows that large multilingual VLMs excel in cross‑modal alignment (as measured by CLIP‑Score) but may lag on language‑specific fluency metrics. The authors suggest future work in expanding native BP caption corpora, refining automatic translation pipelines with neural post‑editing, applying bias‑reduction techniques, and extending the cross‑language evaluation methodology to other low‑resource languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment