Overview

Motto: When you are groping in the dark, knowledge gives you the light

Language is acquired, used, and evaluated in everyday life by understanding the world around us. It is thus essential to capture such an understanding by exploiting knowledge from sources that are useful for grounding language. Recent works have shown that visually-grounded language is useful for addressing task-specific challenges such as visual captioning, Q\&A, dialog, and so on. This has been extensively demonstrated in a series of workshops conducted at various conferences.

In this workshop, however, we are interested in work which goes beyond the task-specific integration of language and vision. That is, to leverage knowledge from external sources that are either provided by an environment or some fixed knowledge. As far as environments are concerned, they represent physical spaces such as homes, offices, hospitals, and warehouses. In these environments, we may want to develop agents/robots that are able to assist humans in various ways. To do that, these agents should have knowledge about the environment surroundings and effectively use it for producing visually-grounded language. Hence, this requires developing new techniques for linking language to an action/scene in the real world. As for fixed knowledge, it refers to knowledge graphs which encapsulate real-world or common-sense knowledge along with their relationships. They explicitly contribute relational knowledge which can help to interconnect language and vision effectively (e.g. Visual Genome). However, other similar problems such as continuous versus discrete representations (of knowledge) are also interesting for understanding how to combine and reason with them. Also, advances in machine learning play a fundamental role in providing background support for the integration of language and vision with real-world knowledge. In particular, the usage of reinforcement learning for optimization, deep learning from graph structured data and generalization to limited/unseen data with few-shot/ zero-shot learning.

Hence, the aim of the LANTERN workshop is to bring together researchers from different disciplines who are united by their adoption of techniques from machine learning to interconnect language, vision, and other modalities by leveraging external knowledge.

Topics of interest include, but are not limited to:

Application of language and vision to robotics
Cognitively- and neuroscience-driven vision and language learning (eye-tracking, fMRI, etc.)
Common-sense knowledge acquisition from vision
Enhancing visual perception with language and structured knowledge
Human-robot interaction with language understanding and visual perception
Integration of vision and language by building cross-modal relationship networks
Integrated models of real-world knowledge, vision, and language for generating context-sensitive embeddings
Language and vision for learning games
Learning of quantities from vision
Multi-task learning for integration of language and vision
Reasoning with language to improve visual perception
Text-to-Image (natural, sketch, synthetic) generation with external knowledge
Theoretical understanding of limitations in the integration of vision and language
Visual dialog, captioning and Q&A by incorporating common-sense/real-world knowledge
Other novel tasks which combine language and vision with means of external knowledge