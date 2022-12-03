



Not so good, but we’re getting there

Since the release of GPT-3, several other large language models have been introduced and evaluated in machine translation.

But how well do large language models compare to standard machine translation encoder/decoder approaches at translation time?

Unfortunately, there is no easy answer to this question.

Evaluation of large language models for translation is often too crude and incomplete to draw reliable conclusions.

To get a definitive answer, a research team at Google proposed an initial extensive evaluation comparing large-scale language models with state-of-the-art machine translation systems.

The work is published on arXiv (28 Nov) and is proposed by David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, and George Foster.

This blog post reviews their work, shares their comments, and summarizes their findings. We also identify the main reasons why we think the evaluations made in this work are good, and call for better evaluations of our language models.

GPT-3, FLAN, and more recently PaLM are language models that have all demonstrated great ability to perform translations.

GPT-3 and PALM even claim to achieve translation quality comparable to some standard machine translation systems.

However, these claims were based solely on improvements measured in BLEU, a standard automated evaluation metric for machine translation. BLEU is well known not to correlate well with human judgment.

Better automated metrics and/or manual evaluations are needed to ensure that language models and standard machine translation systems achieve comparable translation quality.

A proper evaluation of a large language model should also fully report how the prompts were designed and selected.

What is the prompt?

Let’s define that the prompt is the text given as input to the language model. You can include a description of the task you want the model to perform, as well as examples of tasks such as translation.

Although the PaLM and GPT-3 papers do not disclose the prompts used for the machine translation experiments, and the translation example selection strategy used for the few-shot prompts, the performance of these models varies dramatically. It is well known that it is possible. upon prompting.

What are multiple prompts?

Large language models are typically not trained for a specific task. Thanks to pre-training, a task can be taught by simply including a few example tasks. For example, in this piece, PaLM shows 5 translation examples.

In their work, Vilar et al. Evaluate different strategies for creating and selecting prompts first.

They selected the most promising ones, using BLEURT as an automatic evaluation metric and the MQM framework (Lommel et al., 2014) for manual evaluation, using the language model PaLM and the translation quality of standard machine translation systems. Evaluate

What is Brute?

The BLEURT repository defines it as: BLEURT is a natural language generation metric. It takes a sentence pair as input, a reference, and a candidate and returns a score that indicates how fluent the candidate is and conveys the meaning of the reference.

This is a state-of-the-art neural metric that correlates better with human judgment than BLEU for machine translation evaluation.

Use the following prompt design.

[source]: X[target]:or… [source]: X[target]:can be[source]: X[target]:

Where [source] When [target] is the English language name and X is an example sentence in English. [source] the language to translate, the corresponding translation of Y [target] language, X is the current sentence you want to translate. in the last line, [target]: , expecting the language model to produce a translation of X .

Biller et al. Assume prompt design is not critical for a small number of prompts. They do not cite any previous research to support this assumption. I think it depends a lot on the language model, but it probably comes from observations in preliminary experiments.

Assuming prompt design isn’t important, the paper can be more concise and focus on the more important aspects. For example, selecting translated examples for a few shot prompts.

The source for these examples is the translation dataset, which we will refer to as the pool for the rest of this article.

They considered different strategies to select examples from specific pools.

Randomly selected k nearest neighbor (kNN) search

A kNN search retrieves the closest k sentence examples and their corresponding translations from the pool to the sentence you want to translate.

To measure how close the sentences are, we used two different models to embed the sentences.

Bag-of-words (BOW)RoBERTa

They kept their kNN search very efficient by using ScaNN.

In their evaluation, they found that kNN with RoBERTa was able to obtain more useful examples than with BOW.

Note: For the sake of brevity of this article, I will not go into further detail on this aspect of their work.

Photo by Clark Tai on Unsplash

Translation examples are selected from three different pools.

A large dataset of millions of translations across multiple domains and styles, called WMT Full. A small dataset of thousands of translations of the same domain and style as the data you want to translate, called WMT dev. Small dataset paragraphs containing less than 200 translations, showing the high end, with domains and styles that have been manually selected but may not match those of the data to be translated.

Based on empirical experiments with these three pools, the authors found that selecting high-quality translations is of utmost importance. Choosing examples from the high-end pool gives similar results to choosing examples from the WMT developer pool, despite the mismatch in style and domain.

Image from Pixabay

They mainly used evaluation datasets published by WMT21 to conduct evaluations in German-English and Chinese-English language pairs.

English-French results are also shown, but with a much older evaluation dataset that may have been part of the training data for the machine translation system being evaluated. The authors openly comment on this issue and provide results for this dataset only to follow the original evaluation setup proposed in the PaLM paper. They do not draw definitive conclusions from the English and French results.

Let’s see how they present their main results.

Table by David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, George Foster (Google)

That’s a lot!

But I think they are all necessary and relevant to draw any conclusions from the experiment.

As previously described, MQM and BLEURT are used to report on manual and automated evaluations respectively. BLEU is included here only to have some points for comparison with the results presented in the PaLM paper. They comment that BLEU is misleading and that an evaluation based on BLEU alone would have changed their conclusions.

Interestingly, the authors did not set up the machine translation system themselves, instead opting for publicly available translations submitted to WMT21. This choice allows reproducibility of results.

We also appreciate Google Translate. We usually argue that the evaluation of such black box systems is unreliable. Because we don’t know if it was trained on evaluation data (i.e. a plausible data leak). The author thought about this issue and consulted directly with the Google Translate team to ensure there were no data leaks. While this may seem like a natural step in evaluation, it is rarely seen in practice by researchers evaluating commercial systems.

We also note that we have not copied the figures from previous studies, in contrast to the assessments reported in the PaLM paper. That is, all scores were calculated by the authors to ensure that all scores were comparable.

This is one of the most scientifically authoritative evaluations I’ve seen in a machine translation research paper (trust me, I’ve studied over 900 machine translation evaluations). In summary:

Manual assessment using state-of-the-art framework (MQM) and detailed translation error count Automated assessment using state-of-the-art metric (BLEURT) Statistical significance test confirmation using PERM-BOTH No data leakage Reproducible availability of some of the translations that were evaluated to increase the accuracy of the score verification that the scores are all equivalent, i.e. calculated with the same metrics, tokenization, etc. BLEU scoring using SacreBLEU and A score that allows comparison in future work.

The SOTA system has a substantial advantage of 1-3 BLEURT points over the best PaLM results. This gap is reflected in the much lower MQM scores.

In short, standard machine translation systems are far superior to PaLM.

Looking at the reported BLEU scores alone, the differences between the PaLM and WMT21 systems appear striking.

This result contradicts the evaluation performed in the PaLM paper (section 6.5). In this evaluation, PaLM was found to outperform the best previous systems. The difference was that the evaluation in PaLM was based solely on already calculated BLEU scores for older machine translation systems and/or used unparalleled tokenization.

A major limitation of this work is that PaLM is actually trained on documents rather than independent sentences, which the authors admit.

If longer chunks of text were translated by PaLM to gain more context, it’s safe to assume that the difference in translation quality between PaLM and standard machine translation systems would not be as dramatic.

On the other hand, standard machine translation systems are trained on independently considered sentences.

As a result, when running and evaluating the system at the sentence level, the evaluation was biased toward the machine translation system. NOTE: The original assessment done in the PaLM paper was also done at the sentence level.

This assessment also does not perform statistical significance testing of results calculated by automated metrics. It looks significant in BLEU, but it’s pretty hard to tell at a glance if the difference in BLEURT is really significant. NOTE: Given the observed gaps between BLEU and MQM scores, we can say that they are significant, but need to be confirmed.

To my knowledge, the authors have not published the prompts and translations generated by PaLM and used for evaluation. I hope they do. Similar to what Meta-AI did with his No Language Left Behind, they can be studied by the research community and various metrics can be used to facilitate comparisons in future work.

This is especially necessary as MQM evaluations are costly and difficult to reproduce. They also appear to have used a custom BLEURT model that is not (yet?) publicly available, preventing us from reproducing the BLEURT score.

Standard machine translation systems outperform large language models.

PaLM has been shown to translate better than other comparable models such as FLAN and GPT-3. We can expect the conclusions drawn by Vilar et al. These are extensible to other models.

That said, it’s important to remember that machine translation systems and language models are not trained on the same data and have vastly different computational costs.

PaLM is a 540 billion parameter model, but it can translate without being directly trained on translation. Machine translation systems, on the other hand, are much less computationally expensive, but require a large number of translations to train.

