Humans understand events in the world contextually, performing what’s called multimodal reasoning across time to make inferences about the past, present, and future. Given text and an image that seems innocuous when considered separately — e.g., “Look how many people love you” and a picture of a barren desert — people recognize that these elements take on potentially hurtful connotations when they’re paired or juxtaposed, for example.
Even the best AI systems struggle in this area. But there’s been progress, most recently from a team at the Allen Institute for Artificial Intelligence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering. In a preprint paper published this month, the researchers detail Multimodal Neural Script Knowledge Models (Merlot), a system that learns to match images in videos with words and even follow events globally over time by watching millions of YouTube videos with transcribed speech. It does all this in an unsupervised manner, meaning the videos haven’t been labeled or categorized — forcing the system to learn from the videos’ inherent structures.
Learning from videos
Our capacity for commonsense reasoning is shaped by how we experience causes and effects. Teaching machines this type of “script knowledge” is a significant challenge, in part because of the amount of data it requires. For example, even a single photo of people dining at a restaurant can imply a wealth of information, like the fact that the people had to agree where to go, meet up, and enter the restaurant before sitting down.
Merlot attempts to internalize these concepts by watching YouTube videos. Lots of YouTube videos. Drawing on a dataset of 6 million videos, the researchers trained the model to match individual frames with a contextualized representation of the video transcripts, divided into segments. The dataset contained instructional videos, lifestyle vlogs of everyday events, and YouTube’s auto-suggested videos for popular topics like “science” and “home improvement,” each selected explicitly to encourage the model to learn about a broad range of objects, actions, and scenes.
The goal was to teach Merlot to contextualize the frame-level representations over time and over spoken words so it could reorder scrambled video frames and make sense of “noisy” transcripts — including those with erroneously lowercase text, missing punctuation, and filler words like “umm,” “hmm,” and “yeah.” The researchers largely accomplished this. They reported that in a series of qualitative and quantitative tests, Merlot had a strong “out-of-the-box” understanding of everyday events and situations, enabling it to take a scrambled sequence of events from a video and order the frames to match the captions in a coherent narrative, like people riding a carousel.
Merlot is only the latest work on video understanding in the AI research community. In 2019, researchers at Georgia Institute of Technology and the University of Alberta created a system that could automatically generate commentary for “let’s play” videos of video games. More recently, researchers at Microsoft published a preprint paper describing a system that could determine whether statements about video clips were true by learning from visual and textual clues. And Facebook has trained a computer vision system that can automatically learn audio, textual, and visual representations from publicly available Facebook videos.
The Allen Institute and University of Washington researchers note that, like previous work, Merlot has limitations, some owing to the data selected to train the model. For example, Merlot could exhibit undesirable biases because it was only trained on English data and largely local news segments, which can spend a lot of time covering crime stories in a sensationalized way. It’s “very likely” that training models like Merlot on mostly news content could cause them to learn racist patterns as well as sexist patterns, the researchers concede, given that the most popular YouTubers in most countries are men. Studies have demonstrated a correlation between watching local news and having more explicit, racialized beliefs about crime.
For these reasons, the team advises against deploying Merlot in a production environment. But they say the model is still a promising step toward future work in multimodal understanding. “We hope that Merlot can inspire future work for learning vision+language representations in a more humanlike fashion compared to learning from literal captions and their corresponding images,” the coauthors wrote. “The model achieves strong performance on tasks requiring event-level reasoning over videos and static images.”