Language is acquired by understanding the world around us. It is essential to capture such understanding by obtaining knowledge from sources that are useful for grounding language. Recent works have shown that visually-grounded language is useful for addressing task-specific challenges such as visual captioning, dialog, q&a and so on. This has been demonstrated in a series of workshops conducted at varied conferences.

However, in this workshop, we encourage submissions to go beyond the task-specific integration of language and vision. That is to leverage knowledge from an external source that is either provided by an environment or fixed knowledge. Environments represent physical spaces such as homes, offices, hospitals, and warehouses. Agents or robots used to assist humans in such environments should have knowledge about their surroundings and effectively used it for visually-grounded language. Hence, it provides a need for developing new techniques for linking language to action/scene in the real world. This kind of environments is also observed in virtual spaces such as games (e.g., chess), where annotated transcripts of player communications, moves can be learned.

Fixed knowledge refers to structured knowledge graphs which encapsulate real-world or common-sense knowledge along with their relationships. They contribute explicitly the relational knowledge which helps to interconnect language and vision effectively (e.g., visual genome). However, ​ other similar problems such as continuous representations versus discrete representations is also interesting for understanding how to combine and reason with them. Translating language directly to imaginary or real physical spaces and reason in these spaces (sketches in 2D and 3D or 4D including time) through math as humans somehow do is another prospective.

Furthermore, the objective of the workshop is to bring together researchers from different disciplines but united by their adoption of techniques from machine learning, neuroscience, Multi-agents, natural language processing, computer vision and psychology to interconnect language and vision by leveraging external knowledge provided by either fixed or environments.

Topics of interest include, but are not limited to:


Mohit Bansal
Lucia Specia
(Imperial College London)
Yejin Choi
(University Washington)
Massimiliano Pontil


Aditya Mogadala
(Saarland University)
Dietrich Klakow
(Saarland University)
Sandro Pezzelle
(University of Trento)
Marie-Francine Moens
(KU Leuven)