The Third Workshop Beyond Vision and LANguage:
inTEgrating Real-world kNowledge

EACL 2021 Workshop
April 20, 2021

Invited Speakers

Learning language by observing the world and learning about the world from language
Children learn about the visual world from implicit supervision that language provides. At the same time, most children learn their language, at least to some extent, by observing the world. Recently released datasets of instructional videos are interesting as they can be considered a rough approximation of a child’s visual and linguistic experience -- in these videos, the narrator performs a high-level task (e.g., cooking pasta) while describing the steps involved in that task (e.g., boiling water). Moreover, these datasets pose challenges similar to those children need to address; for example, identifying relevant activities to the task (e.g., boiling water) and ignoring the rest (e.g., shaking head). I will present two projects where we study the interaction of visual and linguistic signals in these videos. We show that (1) using language and the structure of tasks is important in discovering action boundaries; (2) visual signal improves the quality of unsupervised word translation, especially for dissimilar languages, and where we do not have access to large corpora.

Building machines that understand words as people do
Machines have achieved a broad and growing set of linguistic competencies, thanks to recent progress in Natural Language Processing (NLP). Psychologists have shown increasing interest in such models, comparing their output to psychological judgments such as similarity, association, priming, and comprehension, raising the question of whether the models could serve as psychological theories. In this talk, we will compare how humans and machines represent the meaning of words. We argue that contemporary NLP systems are fairly successful models of human word similarity, but they fall short in many other respects. Current models are too strongly linked to the text-based patterns in large corpora, and too weakly linked to the desires, goals, and beliefs that people express through words. Word meanings must also be grounded in perception and action and be capable of flexible combinations in ways that current systems are not. We discuss more promising approaches to grounding NLP systems and argue that they will be more successful with a more human-like, conceptual basis for word meaning.

Vision and Language Problems for a Real-World Application of Describing Images Taken by People Who Are Blind
A natural grand challenge for the AI community is to design technology that can assist people who are blind to overcome their real daily visual challenges. Towards this aim, my team introduced the first datasets and AI challenges originating from people who are blind to encourage a larger community to collaborate on developing language and vision algorithms for assistive technologies. The datasets were built using data submitted by users of a mobile phone application, who each took a picture and (optionally) recorded a spoken question about that picture in order to receive visual assistance from remote humans in answering their visual questions or describing their images. In this talk, I will address questions including: how does the challenges of building datasets originating from real-world applications compare to the standard approach in the AI community for dataset creation? How does data originating from real users of image captioning and visual question answering services compare to that in existing datasets? What is the current state of AI algorithms for meeting real users' practical needs? What are the key AI challenges ahead?

Probably Asked Questions and Parametric vs Non-Parametric Knowledge
In this talk I look at factual knowledge as a function from questions to answers, and frame knowledge intensive tasks as distributions over “probably asked questions.” I will consider two paradigms: parametric models approximate the above function by optimising a fixed number of parameters using key/value pairs as training set; and non-parametric models that memorise key/value pairs explicitly. I show that parametric models are promising solutions when based on pre-trained LMs, but remain relatively poor approximators in comparison to non-parametric counterparts. I will also illustrate that traditional knowledge bases/graphs can be seen as non-parametric models optimised for very particular “probable question” distributions, with additional “training pairs” generated/curated in advance. We translate this paradigm to modern Open-Domain QA question distributions by synthetically generating PAQ, a dataset of 60M+ likely question/answers, and introducing RePAQ, a non-parametric model we train with PAQ. RePAQ enables us to readily build systems that either very fast (1000 q/s) and quite accurate, very small and quite accurate (winning two NeuRIPS competition tracks for their minimal memory footprint), or very accurate and still quite fast (More accurate and 2x faster than a SOTA model). Critically, RePAQ is always good at knowing what it doesn’t know.