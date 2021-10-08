



Posted by: Maarten Bosma, Research Engineer, Jason Wei, AI Resident, Google Research

Machine learning models require a great deal of knowledge and abstraction of the world to produce meaningful text. Language models trained to do this will be able to acquire this knowledge automatically as they scale, but it is not clear how to unlock this knowledge and apply it to a particular real task. ..

One of the established techniques for doing this is called fine-tuning. This is to train pre-trained models such as BERT and T5 on labeled datasets to adapt to downstream tasks. However, tweaking requires a large number of training examples, along with model weights saved for each downstream task. This is not always practical, especially for large models.

“A fine-tuned language model is a zero-shot learner” describes a simple technique called fine-tuning instructions, or fine-tuning instructions for short. This involves tweaking the model to make the resolution of common NLP tasks more acceptable, rather than resolving a specific task. Train your model using instructional tuning. This is called Fine-tuned LAnguage Net (FLAN). The FLAN instruction adjustment phase is a metaphorical dessert of the main course of pre-training due to the small number of updates compared to the large amount of calculations that accompany pre-training of the model. This allows FLAN to perform a variety of invisible tasks.

Illustrated how FLAN works: The model is fine-tuned with a different set of instructions and generalized to invisible instructions. Performance improves as more types of tasks are added to the fine-tuning data model.

One of the recent common techniques for solving tasks using the background language model is called the zero-shot or minority-shot prompt. This technique formulates a task based on the text that the language model may have seen during training, and the language model produces the answer by completing the text. For example, to classify the emotions of a movie review, the language model is given the sentence “Movie review’The best RomCom since Pretty Woman’is _” and complete the sentence with the words “positive” or “positive”. You may be asked to do so. “Negative”.

While this technique performs well for some tasks, it requires careful and rapid engineering to design the task so that it looks like the data the model saw during training. This works well for some, but not all, tasks and is also a non-intuitive method for practitioners. Interact with the model. For example, the authors of GPT-3 (one of the largest language models in use today) have found that such prompting techniques do not improve the performance of natural language inference (NLI) tasks.

Instead, the instruction TuningFLAN is a large scale of various instructions that use simple and intuitive explanations of the task, such as “classify this movie review as positive or negative” or “translate this sentence into Danish”. Fine-tune the model as a set.

Creating a dataset of instructions from scratch to fine-tune the model requires a significant amount of resources. Therefore, use templates instead to convert existing datasets to educational format.

An example template for a natural language inference dataset.

By training your model with these instructions, you will not only be good at resolving the types of instructions you saw during training, but you will also be good at following general instructions.

Model Evaluation To compare FLAN with other techniques in a meaningful way, we used an established benchmark dataset to compare model performance with existing models. We also evaluated how FLAN works without looking at an example of that dataset during training.

However, even if you train on a dataset that is too similar to the evaluation dataset, performance results can be distorted. For example, training on one question answering dataset may cause the model to work better on another question answering dataset. For this reason, all datasets are grouped into clusters by task type to hold the entire task cluster to which the dataset belongs, not just the training data for the dataset.

We have grouped the datasets into the following clusters.

Results We evaluated FLAN on 25 tasks and found that all but 4 tasks were better than the Zero Shot prompt. The results show that 20 of the 25 tasks are better than the zero-shot GPT-3, and some tasks are better than the few-shot GPT-3.

We also find that the scale of the model is very important to the ability of the model to benefit from the coordination of instructions. On a small scale, the FLAN method actually degrades performance, and only on a large scale can the model be generalized to tasks that are not visible from the instructions in the training data. This may be due to the model being too small and not having enough parameters to perform a large number of tasks.

Instruction tuning only improves the performance of invisible tasks for models of a particular size.

Conclusion The FLAN model is not the first model to train a series of instructions, but as far as we know, it is the first model to apply this technique on a large scale and shows that it can improve the generalization ability of the model. .. We hope that the methods we have presented will help inspire more research on models that can perform invisible tasks and learn from very little data.

We have also released code to perform the conversion so that other researchers can reproduce and build the results.

Acknowledgments Vincent Y, a collaborator of Google Research. Thanks to Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai and Quoc V. Le.

