



Posted by: Siamak Shakeri, Staff Software Engineer, Oshin Agarwal, Research Intern, Google Research

Large-scale pre-trained natural language processing (NLP) models such as BERT, RoBERTa, GPT-3, T5, and REALM are web-derived and leverage a natural language corpus fine-tuned with task-specific data. , With various NLP tasks that have made great strides. However, natural language texts alone have a limited range of knowledge, and facts can be contained in word-rich sentences in a variety of ways. In addition, the presence of false information or toxic content in the text can ultimately bias the resulting model.

An alternative source of information is the Knowledge Graph (KG), which consists of structured data. Information is usually extracted from more reliable sources, and KG is factual because post-processing filters and human editors ensure that inappropriate and inaccurate content is removed. Therefore, models that can incorporate them have the advantage of increasing the accuracy of the facts and reducing toxicity. However, due to the different structural formats, it is difficult to integrate with the existing pre-training corpus in the language model.

Accepted in NAACL 2021, “Knowledge Graph-Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-Training” (KELM) transforms KG into synthetic natural language statements to extend and pre-train existing pre-training corpora. Allows integration into training. -Train the language model without changing the architecture. To do this, we take advantage of the publicly available English Wikidata KG and convert it into natural language text to create a synthetic corpus. Next, we will extend the search-based language model REALM with a synthetic corpus as a way to integrate the natural language corpus and KG with pre-training. This corpus is open to the wider research community.

Converting KG to Natural Language TextKG consists of factual information explicitly expressed in a structured format. [subject entity, relation, object entity] Triple, for example [10×10 photobooks, inception, 2012].. The group of related triples is called an entity subgraph. An example entity subgraph based on the previous triple example is { [10×10 photobooks, instance of, Nonprofit Organization], [10×10 photobooks, inception, 2012] }, This is shown in the following figure. You can think of the KG as an interconnected entity subgraph.

Converting subgraphs to natural language text is a standard NLP task known as data-to-text generation. While data to text generation on benchmark datasets such as WebNLG has made great strides, converting the entire KG to natural text presents additional challenges. Large KG entities and relationships are broader and more diverse than smaller benchmark datasets. In addition, the benchmark dataset consists of predefined subgraphs that can form fluent and meaningful sentences. For the entire KG, you also need to create such a segmentation into the entity subgraph.

Illustration of an example of how a pipeline transforms an entity subgraph (inside a bubble) into a synthetic natural sentence (far right).

To convert Wikidata KG into synthetic natural sentences, we have developed a verbalization pipeline named “Text from KG Generator” (TEKGEN). It consists of a heuristically placed Wikipedia textbook and a large training corpus of Wikidata KG triples. , An intertext generator (T5) that converts KG triples to text, an entity subgrapher for generating groups of triples that are verbalized together, and finally a post-processing filter that removes poor quality output. The result is a corpus that contains the entire Wikidata KG as natural text. This is called the Knowledge-Enhanced Language Model (KELM) corpus. It consists of up to 18 million sentences that span up to 45 million triples and up to 1500 relationships.

Convert KG to natural language and use it to extend the language model

Knowledge Graph and Natural Text Integration for Pre-Training of Language Models Our assessment shows that verbalization of KG is an effective way to integrate KG with natural language text. We show this by extending REALM’s search corpus, which contains only Wikipedia text.

To assess the effectiveness of verbalization, we extend the REALM search corpus with the KELM corpus (that is, “verbalized triples”) and compare its performance with the extension with unverbalized concatenated triples. Two common open domain question answering datasets, Natural Questions and Web Questions, use each data extension technique to measure accuracy.

Extending REALM with concatenated triples improves accuracy and can add information that is not explicitly or at all expressed in the text. However, the extension with verbalized triples allows for a smoother integration of KG and the natural language text corpus, as evidenced by its high accuracy. We also saw the same trend with a knowledge probe called LAMA that queries the model using a blank question.

Conclusion Use KELM to provide the published KG corpus as natural text. We show that KG verbalization can be used to integrate KG with a natural text corpus and overcome structural differences. This includes real-world applications of knowledge-intensive tasks, such as question answering, where the provision of factual knowledge is essential. In addition, such corpora can be applied to pre-training large language models, reducing toxicity and improving facts. We hope that this work will facilitate further progress in integrating structured knowledge sources into pre-training of large language models.

Acknowledgments This work is a collaboration of Oshin Agarwal, Heming Ge, Siamak Shakeri and Rami Al-Rfou. Thanks to William Woods, Jonni Kanerva, Tania Rojas-Esponda, Jianmo Ni, Aaron Cohen and Itai Rolnick for evaluating the synthetic corpus samples and their quality. We would also like to thank Kelvin Guu for his valuable feedback on this treatise.

