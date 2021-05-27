



This repository maintains the code for the SMITH (Siamese Multi-depth Transformer-based Hierarchical Encoder) model. This model can be used for long-form document matching tasks.

Preface

Many natural language processing and information retrieval problems can be formalized as semantic matching tasks. Existing research in this area has mainly focused on matching between short texts (eg, question answering) or matching between short and long texts (eg, ad hoc search). Semantic matching between long documents with many important applications such as news recommendations, related article recommendations, document clustering, etc. is relatively less researched and requires more research effort. In recent years, self-attention-based models like Transformers and BERT have achieved cutting-edge performance in text matching tasks. However, these models are limited to short texts, such as a few sentences or a paragraph, due to the complexity of the quadratic calculation of self-awareness regarding the length of the input text. To address this issue, we have proposed a Siamese Multi-Depth Transformer-based Hierarchy (SMITH) encoder for long-form document matching. Our model includes several innovations to adapt the self-attention model to longer text inputs, showing promising results for long-form document matching on some benchmark datasets. I will. In this repository, we have released model implementations and pre-trained model checkpoints.

Checkpoints for pre-trained models

We have uploaded pre-trained model checkpoints for SMITH models to Google Cloud Storage. You can download the model checkpoints by following these steps.

Usage dependencies and settings

The main dependencies include the following packages:

python 3.7 tensorflow 1.15 nltk 3.5+ tqdm 4.50.1+ numpy 1.13.3 + tf_slim 1.1.0

Here’s how to set up your environment (tested on Debian Linux): After downloading or duplicating the code, run the following command to set up the python3.7 virtual environment and use pip to install the required dependencies.

# *** Folder google-research / *** sudo apt install python3.7 python3-venv python3.7-venv python3.7 -m venv py37-venv. py37-venv / bin / activate pip install -r smith / requireds.txt

I used nltk’s sentence tokenizer to get sentence boundary information during data preprocessing. Therefore, you can run the following command to install the nltk data resource.

python3 >>> import nltk >>> nltk.download (‘punkt’)

Next, you need to configure the protocol buffers. We used protocol buffers to define the input raw text data (wiki_doc_pair.proto), including the model configuration (experiment_config.proto) and document pairs. See this page for instructions on installing the Protocol Buffers compiler. To install the latest release of the protocol compiler from a compiled binary, run the following command:

PB_REL = “https://github.com/protocolbuffers/protobuf/releases” curl -LO $ PB_REL / download / v3.13.0 / protoc-3.13.0-linux-x86_64.zip unzip protoc-3.13.0-linux-x86_64 .zip -d $ HOME / .local export PATH = “$ PATH: $ HOME / .local / bin”

After the Protocol Buffers compiler installation is complete, you can enter protocol in the terminal to see if the installation was successful.

Then run the following command to generate a Python file with the proto definition file.

# *** folder from google-research / *** protoc smith / wiki_doc_pair.proto –python_out =. protoc smith / Experiment_config.proto –python_out =.

By running these commands, both the input proto file path and the output python file path are set as smith. This is the root path of the released code. For more information on protoc and protocal buffer, see this tutorial.

Then run the following command to verify that all Python test cases pass the environment settings.

# *** folder google -research / *** python -m smith.loss_fns_test python -m smith.metric_fns_test python -m smith.modeling_test python -m smith.preprocessing_smith_test

If you pass all the test cases, your environment is set up successfully.

Data preprocessing

Before you start model training, you need to preprocess your data. Take GWikiMatch data as an example. Download the GWikiMatch data following the data README file. Then download the BERT-based checkpoint that contains the BERT vocabulary file by doing the following:

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip unzip uncased_L-12_H-768_A-12.zip

After downloading the GWikiMatch data and BERT checkpoints, you can start data preprocessing for your SMITH model by running the following command:

From the # *** folder google-research / *** python -m smith.preprocessing_smith –input_file = / path / to / input_file –output_file = / path / to / output_file –vocab_file = / path / to / bert / checkpoint / uncased_L-12_H-768_A-12 / vocab.txt –max_sent_length_by_word = 32 –max_doc_length_by_sentence = 64 –max_predictions_per_seq = 0 –add_masks_lm = false

With the GWikiMatch data and the BERT vocabulary file in the path / tmp / data, the command looks like this:

From the # *** folder google-research / *** export DATA_DIR = / tmp / data / python -m smith.preprocessing_smith –input_file = $ {DATA_DIR} gwikimatch_v2_human_neg_1.eval_external_wdp.tfrecord –output_file = $ {DATA_DIR} gwikimatch_v2_human_neg_1.eval_external_wdp_smith_32_64_false.tfrecord –vocab_file = $ {DATA_DIR} uncased_L-12_H-768_A-12 / vocab.txt –max_sent_length_by_word = 32 –max_sent_length_by_word = 32 –max_sent_length_by_word = 32 –max_sent_length_by_word = 32 –max_doc_length_by_sentence = 64

You can use these commands to complete the data preprocessing of the train / eval / test partitions of GWikiMatch data.

Fine adjustment by SMITH

Use model configuration files to maintain various model and data settings during model training and evaluation. The model configuration protobuffer definition is experiment_config.proto. All sample configuration files are located in the config folder. Here’s how to perform a fine-tuning experiment using the SMITH model. Take SMITH-Short and SMITH-WP + SP introduced in this book as examples. Please note that SMITH-WP + SP and SMITH-Short are two model variations. SMITH-WP + SP is the best model variation we have pre-trained.

SMITH-Short is a model variation that loads BERT-based checkpoints and fine-tunes the model with only document matching losses. The following is an example of the SMITH-Short configuration file.

Encoder configuration {Model name: “smith_dual_encoder” init_checkpoint: “/ tmp / data / uncased_L-12_H-768_A-12 / bert_model.ckpt” bert_config_file: “/tmp/data/config/bert_config.json” doc_bert_config_file: “/ tmp / data / config /doc_bert_3l_768h_config.json “vocab_file:” /tmp/data/uncased_L-12_H-768_A-12/vocab.txt “max_seq_length: 32 max_predictions_per_seq: 5 max_sent_length_by_word: max_doc_length_by_sentence_: 2 max_sent_length_by_word: max_doc_length_by_sentence_ : “/tmp/data/gwikimatch_v2_human_neg_1.train.smith_msenl_32_mdl_64_lm_false.tfrecord” input_file_for_eval: “/tmp/data/gwikimatch_v2_human_neg_1.eval_external_wdp_smith_32_64_false.tfrecord” train_batch_size: 32 eval_batch_size: 32 predict_batch_size: 32 max_eval_steps: 54 save_checkpoints_steps: 10 repeat _per_loop: 10 eval_with_eval_data : True neg_t o_pos_example_ratio: 1.0} loss_config {similarity score: 6.0}

Note that you need to replace the path’/ tmp / data’ with the corresponding path in your environment. After preparing the configuration file, you can start tweaking the SMITH-Short model using the following command:

From the # *** folder google-research / *** export DATA_DIR = / tmp / data / python -m smith.run_smith –dual_encoder_config_file = $ {DATA_DIR} config / dual_encoder_config.smith_short.32.8.pbtxt –output_dir = $ {DATA_DIR} res / gwm_smith_short_32_8 / –train_mode = finetune –num_train_steps = 10000 –num_warmup_steps = 1000 –schedule = train

SMITH-WP + SP is a configuration file for SMITH models that have been pretrained with both masked word prediction loss and masked sentence block prediction loss in the pre-training collection and fine-tuned with document matching loss. Here is an example of: :

Encoder configuration {Model name: “smith_dual_encoder” init_checkpoint: “/ tmp / data / smith_pretrain_model_ckpts / smith_wsp / model.ckpt-400000” bert_config_file: “/ tmp / data / config / sent_bert_4l_config.json” doc_bert_config_file: “/ tmp / data .json “vocab_file:” /tmp/data/uncased_L-12_H-768_A-12/vocab.txt “max_seq_length: 32 max_predictions_per_seq: 5 max_sent_length_by_word: 32 max_doc_length_by_sentence: 64 loop_sent_number_per_train: true_per_sent : 256} train_eval_config {input_file_for_train: “/tmp/data/gwikimatch_v2_human_neg_1.train.smith_msenl_32_mdl_64_lm_false.tfrecord” input_file_for_eval: “/tmp/data/gwikimatch_v2_human_neg_1.eval_external_wdp_smith_32_64_false.tfrecord” train_batch_size: 32 eval_batch_size: 32 predict_batch_size: 32 max_eval_steps: 54 save_checkpoints_steps: 10 Repeat _per_loop: 10 eval_wi th_eval_data: true neg_to_pos_example_ratio: 1.0} loss_config {similarity score amplifier: 6.0}

Please note that you will need to download a pre-trained SMITH checkpoint to perform this run. After preparing the configuration file, you can start tweaking the SMITH-WP + SP model using the following command:

From the # *** folder google-research / *** export DATA_DIR = / tmp / data / python -m smith.run_smith –dual_encoder_config_file = $ {DATA_DIR} config / dual_encoder_config.smith_wsp.32.48.pbtxt –output_dir = $ {DATA_DIR} res / gwm_smith_wsp_32_48 / –train_mode = finetune –num_train_steps = 10000 –num_warmup_steps = 1000 –schedule = train model pre-training

The released code also supports pre-training of the SMITH model using the LM loss of masked words and the block LM loss of masked sentences described in this document. In practice, you can use any plain text data (such as Wikipedia documentation or other text datasets) to pre-train your model. To pretrain the model in the codebase, if you want to use both the LM loss of masked words and the LM loss of masked sentences, set FLAG Strain_mode to “pretrain” in the model configuration file and “use_masked_sentence_lm_loss” Is set to true. Also, as with BERT, LM-related masks must be added to the data during data preprocessing. The data preprocessing script supports adding LM masks to the data. Note that the original data preprocessing pipeline was developed in C ++ and distributed data processing to improve scalability on larger pre-training datasets. We will not release a C ++ version of the data preprocessing code. However, the released Python data preprocessing script can also generate the same Tensorflow example for model pre-training.

Release Notes Initial Release: October 9, 2020 Disclaimer

This is not an officially supported Google product.

Quote

If you want to extend or use this code / model checkpoint, please cite the following article:

@inproceedings {yang2020beyond, title = {Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching}, author = {Liu Yang and Mingyang Zhang and Cheng Li and Michael Bendersky and Marc Najork}, booktitle = {CIKM}, year = {2020}}

