Paper Details

Title: Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
Authors: Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra
Link: arxiv
Tags: Deep Reinforcement Learning, Visual Question Answering, Dialog Agents
Year: 2017

Summary

This paper aims to develop agents that can understand visual content and communicate their understanding using natural language.
To achieve this goal, the authors pose a cooperative “image guessing” game between two agents – Q-Bot and A-Bot who communicate in natural language dialog so that Q-Bot can select an unseen image from a lineup of images.
Using deep reinforcement learning, the policies of these agents are learned end-to-end, simulating a potentially unlimited number of game plays along the way.
Evaluation in large-scale real-image experiments show significant performance improvements over the existing methods, with RL agents producing exchanges that are less repetitive and more image-discriminative.

This paper introduces the first goal-driven training method for visual question answering dialog agents, improving upon previous work based on supervised learning.
Using images from a simple synthetic world, the authors demonstrate the automatic emergence of grounded language and communication among visual dialog agents with no human supervision.
The authors show that introducing RL to the training process results in significant performance improvements over agents trained in a strictly supervised manner.

While the supervised Q-Bot simply learns to mimic the questions that humans ask, the RL trained Q-Bot shifts strategies and asks questions that A-Bot is better at answering. This leads to more informative dialog and better synergy.
The interactions between RL-trained agents are more diverse, less prone to repetitive and safe exchanges that don’t carry meaningful information and are shown to be more image-discriminative.
The “guessing game” posed by the authors does not require any bounding box annotations and reward/error can be computed by simple Euclidean distance, allowing for an unlimited number of game plays to be simulated.

While less prone to getting stuck in infinite repeating dialog loops, this issue nonetheless still affects the RL agents, albeit at slightly longer intervals
RL agents are also still affected by “forgetfulness”, the observed phenomenon where performance of agents becomes worse after several rounds, even though more information is strictly available.