Learning from videos to understand the world

(ai.facebook.com)
Source: Learning from videos to understand the world

Today, we’re announcing a project called Learning from Videos, designed to automatically learn audio, textual, and visual representations from the data in publicly available videos uploaded to Facebook.

By learning from videos spanning nearly every country and hundreds of languages, this project will not just help us continuously improve our core AI systems for applications like content recommendation and policy enforcement — it will enable entirely new experiences.

This is also part of our broader efforts toward building machines that learn like humans do — from any example, not just ones where experts have labeled.

The first application is now live in ‘Instagram Reels’ recommendation system.

Continuously learning from the world around us is one of the hallmarks of human intelligence. Just as we quickly learn to recognize people, places, things, and actions through observation, AI systems will be smarter and more useful if they can mimic the way humans learn. In just the last couple of years, we’ve made substantial breakthroughs in self-supervised learning across speech, vision, and language. These advancements have made AI systems less dependent on labeled data sets — a fundamental bottleneck on the pace of AI innovation — so that AI can start understanding the world through vast amounts of observational data like humans do.

Every day, people around the globe share videos on Facebook products. Building AI that learns from publicly available videos will help us create machines that better analyze uncurated, real-world sights and sounds — not just examples that are part of a much smaller, hand-curated data set.

Today, we are announcing a new project called Learning from Videos, designed to learn from audio, visual, and textual input — to continuously improve our core systems and power entirely new applications. By learning from global streams of publicly available videos spanning nearly every country and hundreds of languages, our AI systems will not just improve accuracy but also adapt to our fast-moving world and recognize the nuances and visual cues across different cultures and regions. And by helping AI researchers break away from the reliance on labeled data, we can improve AI-powered products and create entirely new experiences.

AI models are successful when we meet and acknowledge the responsibility we have to honor people’s privacy. We’re building and maintaining a strong privacy foundation that uses automated solutions to enforce privacy at scale. By embedding this work at the infrastructure level, we can consistently apply privacy requirements across our systems and support efforts like AI. This includes implementing technical safeguards throughout the data lifecycle.

Although we’ve just scratched the surface, using semi- and self-supervised learning on the videos uploaded to Facebook has already improved our computer vision and speech recognition systems. Within six months of developing a state-of-the-art, self-supervised framework for video understanding, we’ve built and deployed an AI model in Instagram Reels’ recommendation system. And this is just the beginning of our Learning from Videos project. Early experiments in applying self-supervised learning to real-world videos also show a 20 percent reduction in speech recognition errors, which could improve a wide range of applications like auto-captioning and tasks that help flag harmful content like hate speech. And we’re researching ways to apply new capabilities, like multimodal video retrieval, in order to make it easier for people to surface key moments in time from their trove of digital memories.

Finding similar Reels fits particularly well with self-supervised models because Reels tend to be highly stylized, featuring common patterns across trendy videos. Popular videos often consist of the same music set to the same dance moves, but created and acted by different people. Self-supervised models automatically learn “themes,” group them together, and implicitly make them available to the recommendation system. We’re using self-supervision to suggest videos that are relevant to recently watched videos, while filtering out near-duplicates — without explicit training labels for each classification task. To achieve this, we leveraged Generalized Data Transformations (GDT), our state-of-the-art method for building video embeddings, which systematically learns the relationships between the sound and images in a video. Since building this technology last year, we’ve pioneered the large-scale application of GDT to the representation of Reels data, by training a series of models on a data set of millions of Reels and videos from Instagram.

GDT is designed to learn both audio and visual encoders of video content. The model’s parameters are adjusted (or learned) so that the representations of audio and visual content taken from the same video at the same time are similar to each other but different from the representations of unrelated content. So, we can learn that a picture of an audience clapping probably goes with the sound of applause, or that a video of a plane taking off most likely goes with a loud roar.

GDT also proves to work well even in a unimodal setting, where we can find recommendations based on videos that sound alike or look alike, respectively. Interestingly, we’ve found that audio similarity has been particularly helpful in suggesting relevant content.

Deploying GDT was a massive engineering and scientific effort, involving software engineering to perfect the trainer, followed by training on dozens of GPUs, and a new set of data pipelines to extract the embeddings. For training efficiency, we leveraged the PyTorch and Classy Vision distribution platform.

Another major hurdle was to evaluate the accuracy of our model. Offline metrics, which are based on historical data, aren’t reliable enough to gauge whether people would prefer recommendations generated by our new model to those from existing models. We needed a way to analyze realistic engagement statistics in real time. To solve this, we turned to the gold standard of online A/B testing. We ran the model in production and made its output available in real time to the ranking system. Using this approach, we were able to run online A/B tests that showed positive results.

Better speech recognition for more languages and domains

Recently, speech models have been able to successfully learn the entire structure of language using mostly raw speech data — and to improve on traditional, supervised methods. Our latest technique for learning speech representations, called wav2vec 2.0, works by first masking a portion of the speech and then learning to predict masked speech units. To provide an idea of the speed of progress, wav2vec 2.0 and self-training requires only 10 minutes of transcribed audio to achieve very good speech recognition results on the LibriSpeech industry benchmark. The same results required nearly 1,000 hours of transcribed audio just one year ago.

While exciting, these results are so far based on academic, curated data sets of speech audio (like carefully read audiobooks). Real-world videos include speech that’s unconstrained, filled with background music, noise, various speaking styles, accents, language switching, and other hard challenges.

To test the method on real-world data, we applied wav2vec 2.0 on millions of hours of unlabeled videos and just 100 hours of labeled data. We achieved strong improvements of about 20 percent relative word error reduction, compared with supervised-only baselines with the 100 hours. This proves, for the first time, that self-supervised learning with wav2vec 2.0 is effective for real-world data sets that are not as curated as the LibriSpeech corpus used in the original paper. The video data we trained wav2vec on is largely varied, and we found that wav2vec performs particularly well for subdomains and accents where little labeled data exists.

As a next step, we’re now working on scaling wav2vec 2.0 to more data and more languages. These models will reduce labeling for new automatic speech recognition domains (e.g., like AR glasses and virtual gaming), improve the performance of low- and medium-resource models, and improve other speech and audio tasks. As part of these efforts, we’re currently working on training a multilingual model with millions of hours of speech from 25 languages.

Jointly learning video, audio, text to recall digital memories

With self-supervision, AI systems can automatically learn to find and surface key memories — far beyond searching by date or location labels. To find videos that match such text phrases as “Show me every time we sang happy birthday to Grandma,” for instance, multimodal video understanding is crucial. Recalling memories in this way requires teaching systems how to match the phrase “happy birthday” to cakes, candles, people singing various birthday songs, and more.

Recent self-supervised learning advances have made it possible to create a joint representation of audio, visual, and textual signals in a single vector space. As part of our latest research efforts, we are using the combination of Facebook videos and their associated text (title, caption, descriptions) as the key lever for multimodal understanding. Although the associated text isn’t always perfectly descriptive of the video, it’s often enough of a useful signal. We’ve previously achieved this for images rather than videos using billions of public images and thousands of hashtags.

The key technical challenge of this approach is the text and video are not aligned in any way. Although each audio clip corresponds to a visual clip, there’s no similar association between the textual words and the clips. Because of this, we must aggregate audio-visual information from across the video, aggregate information from the text, and then compare the two at the global video level. We call this research the Audio Visual Textual model, or AVT model, depicted below:

In this research model, we extract a visual clip — which is a short sequence of visual frames — from a video every second. Our system analyzes this sequence using a convolutional neural network (CNN) to produce a vector of numbers that represents the information in the clip. This information is aggregated across time, both with another CNN and with an attention model. The output of this process is an overall representation of the information in the visual part of the video.

We follow a similar process with audio, where one-second audio clips are extracted, represented as a spectrogram, and then processed analogously to the visual stream to obtain a representation of the audio information in the video. The audio and visual representations are then combined to — ultimately — produce an overall audio-visual representation of the video. Separately, we process the text with a Transformer model and recurrent neural network to produce a representation for each word, and then we aggregate this information across all the words, similar to the way we aggregate audio and visual information.

By using contrastive training — learning what makes two things similar and dissimilar — we can ensure that video and textual encoders have similar video and text representations for inputs that go together, but are different from random content. Given a text query (e.g., “Show me every time we sang to grandma”), we compute its embedding and find the nearest video neighbors in embedding space. As a next step, we’re now working on scaling this feature up to millions of videos before we can start testing the feature in production. And we’re also working on making this feature more robust and more useful. For instance, people often search phrases like “Show me the news show that was talking about Yosemite.” This requires adding speech recognition output as one of the inputs of the AVT model, along with raw audio and raw visual signals.

What’s next

Smartphone cameras have made it simple to take photos and videos on the fly. In the future, wearables such as AR glasses will make it even easier to capture things — hands free. As this becomes the norm, people should be able to recall specific moments from their vast bank of digital memories just as easy as they capture them. It’ll be valuable to build smarter AI systems that can understand what’s happening in videos on a more granular level.

Our Learning from Videos project signals a paradigm shift in the way machines are able to understand videos, sending us on the path to build smarter AI systems. This work will allow us to move away from AI that requires people to look at and label videos by hand, and will make it possible for us to build AI systems that use the most advanced techniques, such as self-supervision, to improve recommendations, search, and retrieval, and other important applications for everyone on Facebook. As our systems continuously learn, they will become more reliable, efficient, and personalized, so that sharing and rediscovering moments can one day be effortless. We are excited to continue our research in the space as we share more of our findings and work to productionize cutting-edge AI research that improves our core technology systems, unlocking new experiences for the billions of people around the world who use our products and services every day.

Article: Learning from videos to understand the world