Overview

Motto: When you are groping in the dark, knowledge gives you the light

Language is acquired, used, and evaluated in everyday life by understanding the world around us. It is thus essential to capture such an understanding by exploiting knowledge from sources that are useful for grounding language. Recent works [Lin et al., 2014; Antol et al., 2015; Das et al., 2017] have shown that visually-grounded language is useful for addressing task-specific challenges such as visual captioning, Q&A, dialog, and so on. This has been extensively demonstrated in a series of workshops conducted at various conferences.

In this workshop, however, we are interested in work which goes beyond the task-specific integration of language and vision. That is, to leverage knowledge from external sources that are either provided by an environment or some fixed knowledge [Mogadala et al., 2018; Lu et al., 2018, Zhou et al., 2019, Shah et al., 2019]. As far as environments are concerned, they represent physical spaces such as homes, offices, hospitals, and warehouses. In these environments, we may want to develop agents/robots that are able to assist humans in various ways. To do that, these agents should have knowledge about the environment surroundings and effectively use it for producing visually-grounded language [Whitehead et al., 2018, Anderson et al., 2018, Das et al., 2018]. Hence, this requires developing new techniques for linking language to an action/scene in the real world. This kind of environments is also observed in virtual spaces such as games (e.g., chess), where annotated transcripts of player communications, moves can be learned, and in navigation discovery [Hermann et al., 2019].

As for fixed knowledge, it refers to knowledge graphs which encapsulate real-world or common-sense knowledge along with their relationships. They contribute explicitly the relational knowledge which helps to interconnect language and vision effectively (e.g., Visual Genome [Krishna et al., 2017]). However, other similar problems such as continuous versus discrete representations are also interesting for understanding how to combine and reason with them. Translating language directly to imaginary or real physical spaces and reason in these spaces (sketches in 2D and 3D or 4D including time) through math as humans somehow do is another prospective.

Also, advances in machine learning plays a fundamental role in providing background support for the integration of language and vision with real-world knowledge. In particular, usage of reinforcement learning for optimization [Rennie et al., 2017], deep learning from graph structured data [Haurilet et al., 2019] and generalization to limited/unseen data with Few-shot/ zero-shot learning [Vinyals et al., 2016].

Hence, the aim of the LANTERN workshop is to bring together researchers from different disciplines who are united by their adoption of techniques from machine learning to interconnect language, vision, and other modalities by leveraging external knowledge. We encourage contributions that exploit external knowledge coming from the most diverse sources: from knowledge graphs, fixed and dynamic environments, cognitive and neuroscience data, and many others.

Topics of interest include, but are not limited to:

Application of language and vision to robotics
Cognitively- and neuroscience-driven vision and language learning (eye-tracking, fMRI, etc.)
Common-sense knowledge acquisition from vision
Enhancing visual perception with language and structured knowledge
Human-robot interaction with language understanding and visual perception
Integration of vision and language by building cross-modal relationship networks
Integrated models of real-world knowledge, vision, and language for generating context-sensitive embeddings
Language and vision for learning games
Learning of quantities from vision
Multi-task learning for integration of language and vision
Reasoning with language to improve visual perception
Text-to-Image (natural, sketch, synthetic) generation with external knowledge
Theoretical understanding of limitations in the integration of vision and language
Visual dialog, captioning and Q&A by incorporating common-sense/real-world knowledge
Other novel tasks which combine language and vision with means of external knowledge

References

[1] Lin, Tsung-Yi, et al. “Microsoft COCO: Common objects in context.” ECCV (2014).
[2] Antol, Stanislaw, et al. “Vqa: Visual question answering.” ICCV (2015).
[3] Das, Abhishek, et al. “Visual dialog.” CVPR (2017).
[4] Mogadala, Aditya, et al. “Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects.” ESWC (2018).
[5] Lu, Di, et al. “Entity-aware Image Caption Generation.” EMNLP (2018).
[6] Zhou, Yimin, Yiwei Sun, and Vasant Honavar. “Improving Image Captioning by Leveraging Knowledge Graphs.” WACV (2019).
[7] Shah, Sanket, et al. “KVQA: Knowledge-aware Visual Question Answering.” AAAI (2019).
[8] Whitehead, Spencer, et al. “Incorporating Background Knowledge into Video Description Generation.” EMNLP (2018).
[9] Anderson, Peter, et al. “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments.” CVPR (2018).
[10] Das, Abhishek, et al. “Embodied question answering”. CVPR Workshops (2018).
[11] Hermann, Karl Moritz, et al. “Learning To Follow Directions in Street View.” arXiv preprint arXiv:1903.00401 (2019).
[12] Krishna, Ranjay, et al. “Visual genome: Connecting language and vision using crowdsourced dense image annotations”. IJCV (2017).
[13] Rennie, S. J., et al. “Self-critical sequence training for image captioning.” CVPR (2017).
[14] Haurilet, M., Roitberg, et al. “It is not about the journey; it is about the destination: Following soft paths under question-guidance for visual reasoning.” CVPR (2019).
[15] Vinyals, Oriol, et al. “Matching networks for one shot learning.” NIPS (2016).