



Posted by: Jason Wei, AI Resident, Dan Garrette, Research Scientist, Google Research

In recent years, pre-trained language models such as BERT and GPT-3 have been widely used in natural language processing (NLP). By training a large amount of text, the language model gains extensive knowledge of the world and delivers powerful performance on a variety of NLP benchmarks. However, these models are often unclear in that it may not be clear why they work so well, which limits further hypothesis-driven improvements in the model. Therefore, a new scientific research was born. What linguistic knowledge do these models include?

There are many types of linguistic knowledge that you want to explore, but the topic that provides a strong foundation for analysis is the matching grammar rules for English subjects and verbs. This requires that the number of verbs match the number of subjects. .. For example, the sentence “dogs run”. “Dog” and “run” are both plural, so they are grammatical, but “dog runs”. “Runs” is a singular verb and is not grammatical.

One framework for assessing the linguistic knowledge of a language model is Target Syntax Assessment (TSE). In this case, one is grammatical and the other is non-grammatical, with minimally different pairs of sentences appearing in the model, and the model needs to determine which is grammatical. With TSE, the English subject and verb can be determined by determining the model between two versions of the same sentence, one in which a particular verb is written in the singular and one in which the verb is written. You can test your knowledge of matching rules. Its plural.

In the above context, in the “Effects of Frequency on Syntax Rule Learning” published in EMNLP2021, how the number of words affects the ability of the BERT model to correctly apply English subject and verb matching rules. I investigated. Displayed by the model during pre-training. The BERT model was pre-trained from the beginning with a carefully controlled dataset to test specific conditions. BERT has been found to achieve good performance with subject and verb pairs that do not appear together in the pre-training data. This indicates that you are learning to apply subject and verb matches. However, models tend to predict incorrect formats if they are much more frequent than correct formats. This indicates that BERT does not treat grammatical matches as a rule that must be followed. These results help us better understand the strengths and limitations of pre-trained language models.

Previous work In the previous work, we used TSE to measure the ability of the BERT model to match English subjects and verbs. In this setting, the BERT will perform a task to fill in the blanks (for example, “Dog_Beyond the Park”) by assigning probabilities to both the singular and plural of a particular verb (for example, “runs”). “And” run “). If the model correctly learns to apply subject-verb matching rules, then a consistently high probability should be assigned to the verb form that makes the sentence grammatically correct.

The previous work was artificially constructed to be grammatically valid but semantically meaningless, as in Noam Chomsky’s famous example, “The idea of ​​colorless green sleeps fiercely.” We evaluated BERT using both natural and non-natural sentences (quoted from Wikipedia). Nonce statements are useful when testing syntactic abilities because the model cannot rely on superficial corpus statistics. For example, “dog runs” is much more common than “dog runs”, but “dogs publish” and “dogs publish” are both very rare, so one model is It is not possible to simply remember the fact that it is more likely than the other.

BERT achieves over 80% accuracy in nonce sentences (much better than the 50% random chance baseline). This was seen as evidence that the model learned to apply subject-verb matching rules. In our paper, by pre-training the BERT model under specific data conditions, we go beyond this previous task and delve deeper into these results, how specific patterns of pre-training data affect performance. I made it possible to check if it affects.

Invisible Subject and Verb Pairs First, we looked at how well the model works with the subject and verb pairs seen during pre-training. In contrast, here are some examples where the subject and verb were not found together in the same sentence:

The BERT error rate in natural and nonce statements, layered by whether a particular subject and verb (SV) pair was found in the same sentence during training. BERT’s performance with invisible SV pairs is much better than simple heuristics such as choosing more frequent verbs or choosing more frequent SV pairs.

BERT error rates increase slightly for invisible subject-verb (SV) pairs in both natural and non-sentence statements, but occur more often in pre-training data and subject nouns. Select a verb format. This shows that it doesn’t just reflect what BERT saw during pre-training. Making decisions based on more than raw frequency and generalizing to new subject-verb pairs means that the model is a subject-verb match.

Verb Frequency Next, we looked at how word frequency affects the ability of BERT to use correctly in subject-verb matching rules. In this study, we selected a set of 60 verbs and created several versions of the pre-training data. Each version is designed to contain 60 verbs at a specific frequency so that the singular and plural appear the same number of times. We then trained the BERT model from these various datasets and evaluated it with a subject-verb matching task.

BERT’s ability to follow subject-verb matching rules depends on the frequency of verbs in the training set.

These results show that BERT can model subject-verb matching rules, but the verb needs to be checked about 100 times to be reliably used in the rule.

Relative Frequency Between Verb Forms Finally, I wanted to understand how the singular and plural relative frequencies of verbs affect BERT predictions. For example, if one form of a verb (such as “combat”) appears in pre-training data much more often than another form of verb (such as “combat”), the BERT is likely to assign a high probability. Become. Converts to a more frequent format, even if it is not grammatically correct. To evaluate this, we used the same 60 verbs again, but this time we created an operational version of the pre-training data where the frequency ratio between verb formats varies from 1: 1 to 100: 1. The following figure shows BERT’s performance against these various levels of frequency imbalance.

The more imbalanced the frequency ratio between verb forms in training data, the less likely BERT will be to use those verbs grammatically.

These results show that BERT achieves excellent accuracy in predicting the correct verb form when the two forms are seen the same number of times during pretraining, but the imbalance between frequencies is large. The result gets worse as it gets worse. This is because even if BERT learns how to apply a subject-verb match, it does not necessarily use it as a “rule”, regardless of whether it violates the subject-verb match constraint. It means that you prefer to predict high frequency words.

Conclusion Using TSE to evaluate BERT performance reveals language proficiency in syntax tasks. In addition, studying syntactic abilities in relation to the frequency with which words appear in the training dataset reveals how BERT handles conflicting priorities. That is, we know that the subject and verb must match, and that it is likely a high-frequency word, but we do not understand the match. It’s a rule you have to follow, frequency is just a preference. We hope that this work will provide new insights into how the language model reflects the properties of the dataset being trained.

Acknowledgments We are honored to work with Tal Linzen and Ellie Pavlick on this project.

