Roughly a year ago, Facebook detailed its work on an AI chatbot called BlenderBot 1.0, which the company claims is the largest-ever project of its kind. In an extension of that work, Facebook today took the wraps off of BlenderBot 2.0, which it says is the first chatbot that can build long-term memory while searching the internet for up-to-date information.
Language systems like OpenAI’s GPT-3 and BlenderBot 1.0 can articulately express themselves — at least in the context of an ongoing conversation. But as Facebook notes, they suffer from very short short-term memory and long-term memory limited to what they’ve been previously taught. Moreover, if you told GPT-3 or BlenderBot 1.0 something today, they’ll forget it by tomorrow. And they’ll confidently state information that isn’t correct, owing to deficiencies in their algorithms. Because they can’t gain additional knowledge, GPT-3 and BlenderBot 1.0 believe that NFL superstar Tom Brady is still on the New England Patriots, for example, and don’t know that he won the 2021 Super Bowl with the Tampa Bay Buccaneers.
By contrast, BlenderBot 2.0 can query the internet using any search engine for movies, TV shows, and more and both read and write to its long-term local memory store. It also remembers the context of previous discussions — a form of continual learning. So, for example, if you talked about Tom Brady with it weeks ago, it could potentially bring up the NFL in future conversations, as it knows that’s a relevant topic to you.
“This work combines and refines a number of ideas around retrieval and memory in transformer models and puts them together in a clever way,” Connor Leahy, one of the founding members of EleutherAI, told VentureBeat via email. “The results look impressive, especially when one considers how small these models are compared to the likes of GPT-3. I especially applaud the team for addressing shortcomings and potential problems with their method, and for releasing their code and data publicly, which will help the wider research community to scrutinize and further improve these methods.”
BlenderBot 2.0 uses an AI model based on retrieval augmented generation, an approach that enables generating responses and incorporating knowledge beyond that contained in a conversation. During a dialogue, the model, which combines an information retrieval component with a text generator, seeks germane data both in its long-term memory and from documents it finds by searching the internet.
A neural network module in BlenderBot 2.0 produces searches given a conversational context. The chatbot then prepends retrieved knowledge to the conversational history and takes the knowledge into account in deciding what to write. Effectively, BlenderBot 2.0 reads from its long-term memory store while writing, a process achieved by using a module that generates the memory to be stored based on conversational content.
In order to train BlenderBot 2.0’s neural networks, Facebook collected data in English using a crowdsourcing platform akin to Amazon Mechanical Turk. One of the resulting datasets, Wizard of the Internet, contains human conversations augmented with new information from internet searches, via the Microsoft Bing API. The other, called Multisession, has long-context chats with humans referencing information from past conversation sessions.
Wizard of the Internet provides guidance to BlenderBot 2.0 on how to generate relevant search engine queries, as well as create responses based on the search results. Meanwhile, Multisession helps the chatbot decide which fresh knowledge to store in long-term memory and what to write given those memories. In tandem with the Blended Skill Talk dataset, which Facebook created to give BlenderBot 1.0 knowledge and “personality,” Facebook says that Wizard of the Internet and Multisession enable BlenderBot 2.0 to chat simultaneously with a range of conversational skills.
Safety and future steps
Even the best language models today exhibit bias and toxicity — it’s well-established that they amplify the gender, race, and religious biases in data on which they were trained. OpenAI itself notes that biased datasets can lead to placing words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.” A separate paper by Stanford University Ph.D. candidate and Gradio founder Abubakar Abid details the biased tendencies of text generated by GPT-3, like associating the word “Jews” with “money.” And in tests of a medical chatbot built using GPT-3, the model responded to a “suicidal” patient by encouraging them to kill themself.
In an effort to mitigate this, Facebook says that it implemented “safety recipes” in BlenderBot 2.0 to reduce offensive responses. As measured by an automated classifier, the chatbot was 90% less likely to respond harmfully and 74.5% more likely to give a “safe” response to questions from real people. Facebook also says that, beyond this, its methods alleviate the risk of BlenderBot 2.0 spouting harmful falsehoods “to some extent,” at least compared with previous methods.
“We know that safety issues are not yet solved, and BlenderBot 2.0’s approach of utilizing the internet and long-term memory to ground conversational responses brings new safety challenges,” Facebook research scientist Jason Weston and research engineer Kurt Shuster wrote in a blog post. “As a research community, we need to address them, and we believe reproducible research on safety, made possible by releases like this, will help the community make important new progress in this area together.”
In experiments, Facebook says that BlenderBot 2.0 outperformed BlenderBot 1.0 when it came to picking up where previous conversation sessions left off, with a 17% improvement in “engagingness” (as scored by human evaluators) and a 55% improvement in the use of previous conversation sessions. Furthermore, BlenderBot 2.0 reduced hallucinations from 9.1% to 3.0% and was factually consistent across a conversation 12% more often.
To spur further research in these directions, Facebook has open-sourced BlenderBot 2.0 and the datasets used to train it, Wizard of the Internet and Multisession. “We think that these improvements in chatbots can advance the state of the art in applications such as virtual assistants and digital friends,” Weston and Shuster wrote. “Until models have deeper understanding, they will sometimes contradict themselves. Similarly, our models cannot yet fully understand what is safe or not. And while they build long-term memory, they don’t truly learn from it, meaning they don’t improve on their mistakes … We look forward to a day soon when agents built to communicate and understand as humans do can see as well as talk.”