



TikTok owner ByteDance’s “self-regulating memory system” reaches databanks of hundreds of turns of dialogue and thousands of characters, and offers better language model capabilities than ChatGPT, allowing it to capture information about past events. I can answer your questions.

byte dance

When you type something into a generative artificial intelligence (AI) program’s prompt, such as ChatGPT, the program responds based not only on what you typed, but also on everything you typed previously.

You can think of that chat history as a kind of memory. But that alone isn’t enough, according to researchers at multiple institutions as they seek to give generative AI a kind of organized memory that can augment what it generates.

A paper titled Extending Language Models with Long-term Memory, published earlier this month by University of California, Santa Barbara researcher Weizhi Wang and Microsoft collaborators, has been posted to the arXiv preprint server to add new components to language. Adding. model.

The problem is that ChatGPT and similar programs can’t take in enough text at once to give you a very long context.

As Wang and team observe, “The input length limitations of existing LLMs generalize LLMs to real-world scenarios where the ability to process long-form information beyond fixed-size sessions is important.” I can’t.”

For example, OpenAI’s GPT-3 can input up to 2,000 tokens, meanings, characters, and words. For example, you can’t feed a 5,000 word article or a 70,000 word novel into a program.

It’s possible to keep the input “window” growing, but that runs into nasty computing problems. Attention arithmetic, an essential tool in all large language programs, including ChatGPT and GPT-4, has a “quadratic” complexity (see “Time Complexity” in Computations). This complexity means that the time it takes for ChatGPT to generate a response increases with the square of the amount of data supplied as input. Increasing the window increases the computing required.

So some scholars are already trying to derive rough memories, Wang and his colleagues say. His Yuhuai Wu and his colleagues at Google last year introduced something called the Memorizing Transformer. This saves a copy of the previous answer for future use. This process will allow him to process 65,000 tokens at a time.

However, Wang and team caution that the data can become “stale.” In the process of training the Memory Transformer, the neural weights (parameters) are updated, so some things in memory get out of sync with the neural network.

Wang et al.’s team’s solution, called “long-term memory augmented language models” (LongMem), uses a traditional large-scale language model that does two things. After probing the input, some of it is saved in a memory bank. It also passes the output of all current prompts to his second neural network called SideNet.

SideNet is also a language model, and like the first network, it’s job is to compare the current prompt typed by a person with the contents of memory to see if there is a relevant match. SideNet can be trained independently from the main language model, unlike Memory Transformer. By doing so, you will be able to make better choices about the contents of memory that will never become stale.

Wang and his team will run tests comparing LongMem against both Memorizing Transformer and OpenAI’s GPT-2 language model. They also compare LongMem with results reported from the literature for other language models, including the 175 billion parameter GPT-3.

University of California, Santa Barbara, Microsoft

They use tasks based on three datasets with summaries of very long texts, including entire articles and textbooks. Project Gutenberg, arXiv File Server, and ChapterBreak.

To get an idea of ​​the scale of these tasks, ChapterBreak, introduced last year by Simeng Sun and colleagues at the University of Massachusetts Amherst, tests a language model on an entire book, given a single chapter as input. If so, see if it works correctly. Identify which of several possible passages is the beginning of the next chapter. Such work includes “richness about long-term dependencies,” such as changes in place and time of events, and techniques such as “analepsis,” where the next chapter is a “flashback” to an earlier point in the story. understanding is required. . ”

And that includes processing tens and even hundreds of thousands of tokens.

When Sun and his team ran the ChapterBreak test last year, they reported that the mainstream language model “struggled.” For example, the large GPT-3 he got right only 28% of the time.

However, according to Wang and team’s report, the LongMem program “surprisingly” outperformed all standard language models, including GPT-3, despite LongMem having only about 600 million neural parameters. , achieved a state-of-the-art score of 40.5%. , much less than GPT-3’s 175 billion.

“The significant improvement in these datasets shows that LONGMEM can understand long historical contexts in cached memory to successfully complete language modeling for future inputs,” said Wang and the team writes.

Microsoft’s research mirrors recent work by ByteDance, the parent company of social media app TikTok.

In a paper posted on arXiv in April titled “Releasing Infinite-Length Input Capacity of Large-Scale Language Models with Self-Regulating Memory Systems,” ByteDance researcher Xinnian Liang and colleagues found that arbitrary large-scale language models I have developed an add-on program that provides a Ability to save very long sequences of spoken content.

In fact, the program can place new prompts in context, thereby dramatically improving the program’s ability to formulate appropriate statements in response, claiming it outperforms ChatGPT. doing.

In a “self-controlling memory system” called SCM, user-prompted input is evaluated by the memory controller to see if it should be dipped into an archival memory system called a memory stream. Past interactions between the user and the program. This is similar to Wang and team’s SideNet and its accompanying memory bank.

If memory is needed, access the collection of past inputs via a vector database tool such as Pinecone. User input is a query, matched for relevance to content in the database.

Some user queries, such as “tell me a joke”, do not require memory. This is a random request that can be processed by any language model. But you won’t see user prompts like “Remember the conclusions we made last week about fitness diets?” It’s like having access to past chat materials.

byte dance

As a neat trick, the user prompt and the memory it captures are combined in what the paper calls “input fusion”. The combined text becomes the actual input to the language model that produces the response. .

The end result is that SCM can outperform ChatGPT in tasks that involve interacting, writing, and referencing hundreds of turns back in Liang and the team. They connected their SCM to a version of GPT-3 called text-davinci-003 and tested how it performed compared to ChatGPT with the same input.

byte dance

In one series of over 100 rounds of 4,000 tokens, when humans asked the machine to recall the hobbies of the person they discussed at the beginning of the session, they found that “SCM systems returned accurate responses to queries. , exhibits exceptional memory “enhanced,” they write, but “ChatGPT, by contrast, appears to be preoccupied with a substantial amount of irrelevant historical data.”

It is also possible to summarize long texts of thousands of words, such as reports. This is done by repeatedly summarizing the text. This means saving the first digest to a memory stream and combining it with the previous one to create the next one.

SCM can also make large language models that are not chatbots behave like chatbots. “Experimental results show that our SCM system can achieve multi-turn dialogue capabilities comparable to ChatGPT for LLMs not optimized for multi-turn dialogue,” they wrote.

Both Microsoft and TikTok’s efforts can be considered extensions of the original intent of the language model. Before ChatGPT and its predecessor, Google’s Transformer, natural language tasks were often performed by something called a recurrent neural network (RNN). A recurrent neural network is a type of algorithm that can go back to previous input data and compare it with the current input.

LLMs such as Transformer and ChatGPT have replaced RNNs with a simpler approach (attention). Note always takes the past into account as it automatically compares everything entered with everything previously entered.

Therefore, Microsoft and TikTok’s research efforts simply extend attention, using algorithms explicitly crafted to recall elements of the past in a more organized manner.

Adding more memory is a very basic tweak and will likely become a standard aspect of large language models in the future, allowing programs to connect to past content such as chat histories, or to address issues. Things that can be dealt with or dealt with become much more common. The full text of a very long piece.

