Reasonable Perception: Connecting Vision and Language Systems for Validating Scene Descriptions
PubDate: March 2018
Teams: Massachusetts Institute of Technology
Writers: Leilani H. Gilpin;Cagri Zaman;Danielle Olson;Ben Z. Yuan
PDF: Reasonable Perception: Connecting Vision and Language Systems for Validating Scene Descriptions
Abstract
Understanding explanations of machine perception is an important step towards developing accountable, trustworthy machines. Furthermore, speech and vision are the primary modalities by which humans collect information about the world, but the linking of visual and natural language domains is a relatively new pursuit in computer vision, and it is difficult to test performance in a safe environment. To couple human visual understanding and machine perception, we present an explanatory system for creating a library of possible context-specific actions associated with 3D objects in immersive virtual worlds. We also contribute a novel scene description dataset, generated natively in virtual reality containing speech, image, gaze, and acceleration data. We discuss the development of a hybrid machine learning algorithm linking vision data with environmental affordances in natural language. Our findings demonstrate that it is possible to develop a model which can generate interpretable verbal descriptions of possible actions associated with recognized 3D objects within immersive VR environments.