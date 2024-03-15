



Posted by Yun Zhu and Lijuan Liu, Software Engineers, Google Research

Advances in large-scale language models (LLMs) have created a new paradigm for integrating various natural language processing (NLP) tasks within an instruction-following framework. This paradigm is exemplified by recent multitasking LLMs such as T0, FLAN, and OPT-IML. First, multitask data is collected for each task according to a task-specific template. Here, for each labeled example, these instruction-response pairs are used to train an LLM, resulting in a conditional generative model that takes instructions as input and generates a response. Furthermore, multitasking LLMs showed remarkable task-to-task generalization ability as they were able to cope with unseen tasks by understanding and solving completely novel instructions.

Demonstration of pre-training following the instructions of a multi-tasking LLM (e.g. FLAN). Pretraining tasks under this paradigm improves performance on unseen tasks.

Due to the complexity of understanding and solving different tasks using only instructions, the size of multitasking LLMs typically ranges from billions to hundreds of billions of parameters (FLAN-11B, T0-11B, OPT-IML-175B etc.). As a result, working with such large models requires considerable computational power and imposes significant requirements on the memory capacity of GPUs and TPUs, making training and inference expensive and inefficient. poses significant challenges. Maintaining a unique LLM copy for each downstream task requires extensive storage. Additionally, the most powerful multitasking LLMs (such as FLAN-PaLM-540B) are closed-source and cannot be adapted. However, in real-world applications, it is still difficult to leverage a single multitasking LLM to manage all possible tasks in a zero-shot manner, especially for complex tasks, personalized tasks, and instructions. This is difficult when dealing with tasks that cannot be concisely defined using On the other hand, the size of downstream training data is usually insufficient to properly train the model without incorporating rich prior knowledge. Therefore, it has long been desired to adapt LLM with downstream monitoring while avoiding storage, memory, and access issues.

Certain parameter-efficient tuning strategies, such as prompt tuning and adapters, can significantly reduce storage requirements, but still have high memory demands due to backpropagation through LLM parameters performed during the tuning process. Additionally, some in-context learning techniques avoid parameter tuning by integrating a limited number of supervised samples into the instructions. However, these techniques are constrained by the maximum input length of the model, so only a small number of samples can be used to solve the task.

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer, presented at NeurIPS 2023, proposes a new approach to improve the performance and efficiency of multi-task LLMs. We introduce Cappy, a lightweight pre-trained scorer based on his continuous pre-training on RoBERTa with just 360 million parameters. Cappy takes an instruction and a potential response as input and produces a score between 0 and 1 that indicates the estimated correctness of the response to the instruction. Cappy can work independently for classification tasks or as an auxiliary component of LLM to improve performance. Additionally, Cappy efficiently enables downstream monitoring without the need for fine-tuning, thus avoiding the need for backpropagation with LLM parameters and reducing memory requirements. Finally, adaptation using Cappy is compatible with closed-source multitasking LLMs that are only accessible via WebAPI, so there is no need to access LLM parameters.

Cappy takes as input a command-response pair and outputs a score ranging from 0 to 1 that indicates the estimated accuracy of the response to the command.Pre-training

Start with the same dataset collection. This includes 39 diverse datasets from his PromptSource that were used to train T0. This collection includes a wide variety of tasks such as question answering, sentiment analysis, and summarization. Each dataset is associated with one or more templates that transform each instance of the original dataset into instructions paired with its ground truth response.

Cappy's regression modeling requires each pre-training data instance to include an instruction-response pair and an accuracy annotation for that response, resulting in a dataset with an accuracy annotation ranging from 0 to 1. For every instance in the generation task, we utilize an existing multitasking LLM to generate multiple responses through sampling based on the given instructions. The similarity between the response and the ground truth response of the instance is then used to assign annotations to the pairs formed by the instruction and all responses. Specifically, we measured this similarity using Rouge-L, a commonly used metric to measure overall multitasking performance that has demonstrated strong agreement with human ratings. as a form of weak supervision.

The result is a valid regression dataset of 160 million instances combined with correctness score annotations. The final Cappy model is the result of continuous pre-training using the regression dataset in addition to the RoBERTa model. Cappy's pre-training is performed on Google's TPU-v4 using RedCoast, a lightweight toolkit for automating distributed training.

Build a weakly supervised regression dataset for pre-training and fine-tuning Cappy by data augmentation using multi-task LLM.Application of cappy

Cappy solves practical tasks within the candidate selection mechanism. More specifically, given an instruction and a set of candidate responses, Cappy generates a score for each candidate response. This is accomplished by entering instructions with individual responses and assigning the response with the highest score as its prediction. In a classification task, all candidate responses are essentially predefined. For example, in an emotion classification task instructions (e.g., “Based on this review, would you recommend this product?: “It's great even for non-gamers.''), a candidate response might be “yes'' or “”is. “No”. In such scenarios, Cappy works independently. On the other hand, the generation task requires an existing multitasking LLM to generate the candidate responses, as the candidate responses are not predefined. In this case, Cappy acts as an auxiliary component of the multitasking LLM and enhances its decoding.

Adapting a multitasking LLM using Cappy

If there is downstream training data available, Cappy can effectively and efficiently adapt the multi-task LLM on downstream tasks. Specifically, we fine-tune Cappy to integrate downstream task information into LLM predictions. This process involves creating a separate regression dataset specific to the downstream training data using the same data annotation process used to construct the pre-training data. As a result, the fine-tuned Cappy works with the multitasking LLM and improves the performance of his LLM on downstream tasks.

In contrast to other LLM tuning strategies, adapting the LLM using Cappy avoids the need for backpropagation through the LLM parameters of downstream tasks, thereby significantly reducing high demands on device memory. Masu. Additionally, Cappy adaptation does not rely on access to his LLM parameters, making it compatible with closed-source multitasking LLMs that can only be accessed via WebAPI. Compared to in-context learning approaches that avoid model tuning by attaching training samples to instruction prefixes, Cappy is not limited by the maximum input length of his LLM. Therefore, Cappy can incorporate an unlimited number of downstream training samples. Cappy can also be used in conjunction with other adaptation methods such as fine-tuning and in-context learning to further improve overall performance.

Comparison of downstream adaptation between Cappy and LLM parameter-dependent approaches such as fine-tuning and prompt adjustment. Cappy's application powers multitasking LLM.result

We evaluate Cappy's performance across 11 language comprehension classification tasks provided by PromptSource. We demonstrate that Cappy with 360 million parameters outperforms OPT-175B and OPT-IML-30B and matches the accuracy of the best existing multitasking LLMs (T0-11B and OPT-IML-175B) . These findings highlight Cappy's capabilities and parametric efficiency. This is likely due to a scoring-based pre-training strategy that integrates contrasting information by distinguishing between high-quality and low-quality responses. On the contrary, previous multi-task LLMs relied solely on teacher-forced training that utilizes ground truth responses only.

Overall accuracy averaged over 11 test tasks from PromptSource. “RM” refers to the pre-trained RLHF reward model. Cappy matches the best of existing multitasking LLMs.

We also explore the adaptation of multitasking LLMs using Cappy to the complex tasks of BIG-Bench, a manually curated set of tasks that are considered beyond the capabilities of many LLMs. We focus on all 45 generations of BIG-Bench tasks, especially tasks that do not provide pre-established answer choices. We evaluate performance using the Rouge-L score (representing the overall similarity between model generations and the corresponding ground truth) on all test sets and report the average score across 45 tests. In this experiment, all variants of FLAN-T5 act as backbone LLMs, and the basic FLAN-T5 model is frozen. These results shown below demonstrate that Cappy significantly improved the performance of his FLAN-T5 model, consistently outperforming the most effective baseline achieved by sample selection using LLM's own self-scoring. It suggests that

Average Rouge-L score across 45 complex tasks in BIG-Bench. The X-axis refers to his FLAN-T5 models of various sizes. All dashed lines represent approaches that work with FLAN-T5. Self-scoring refers to using cross-entropy in LLM to select responses. Cappy significantly improves the performance of FLAN-T5 models.conclusion

We introduce Cappy, a novel approach to improve the performance and efficiency of multitasking LLMs. In our experiments, we use Cappy to adapt a single LLM to multiple domains. In the future, Cappy as a pre-trained model could be used in creative ways other than a single LLM.

Acknowledgment

We would like to thank Bowen Tan, Jindong Chen, Lei Meng, Abhanshu Sharma, and Ewa Dominowska for their valuable feedback. He would also like to thank Eric Xing and his Zhiting Hu for their suggestions.

