Culture in Action: Evaluating Text-to-Image Models through Social Activities

University of Pittsburgh and Boston University
The Fourteenth International Conference on Learning Representations (ICLR), 2026

Abstract

Cultural nuances are best captured through social interactions, yet current text-to-image (T2I) benchmarks focus largely on object-centric artifacts (e.g., food, landmarks, and attire). In this work, we study the cultural faithfulness of T2I models (i.e., adherence to the target culture) through social activities. To this end, we introduce CULTIVate, a new benchmark of 576 activities across 9 categories (e.g., dancing, greeting, dining) with over 19,000 images from 16 countries. We further propose AHEaD, an explainable framework that measures cultural under- standing along four dimensions: cultural Alignment, Hallucination, Exaggeration, and semantical Diversity. Unlike prior work relying on costly human evaluation or image-text alignment (ITA), AHEaD uses culturally-grounded descriptors to provide quantitative, interpretable feedback that enables iterative image refinement. Our analysis shows ITA metrics correlate poorly with human judgments and that alignment alone is insufficient to capture faithfulness. In contrast, FAITH achieves 27%+ higher correlation than baselines by combining alignment, hallucination, and exaggeration. Finally, we observe systematic disparities, with generated images being consistently more faithful for Global North than Global South cultures.

Overview

Overview of CULTIVate benchmark and AHEaD framework.

BibTeX

@article{malakouti2025culture,
title={Culture in Action: Evaluating Text-to-Image Models through Social Activities},
author={Malakouti, Sina and Gong, Boqing and Kovashka, Adriana},
journal={ICLR},
year={2026}
}