MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering

Published in EMNLP 2020, 2020

Recommended citation: Tejas Gokhale, Pratyay Banerjee, Chitta Baral and Yezhou Yang (2020, March). MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering. EMNLP 2020. https://arxiv.org/abs/2009.

While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present \textit{MUTANT}, a training paradigm that exposes the model to perceptually similar, yet semantically distinct \textit{mutations} of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer). Unlike existing methods on VQA-CP, \textit{MUTANT} does not rely on the knowledge about the nature of train and test answer distributions. \textit{MUTANT} establishes a new state-of-the-art accuracy on VQA-CP with a $10.57\%$ improvement. Our work opens up avenues for the use of semantic input mutations for OOD generalization in question answering.

Recommended citation: Tejas Gokhale, Pratyay Banerjee, Chitta Baral and Yezhou Yang (2020, March). Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning. EMNLP 2020.