



Felix is ​​a flexible text editing approach for generation, designed to maximize the benefits of decoding ideas using interactive context and self-supervised pre-training. This is achieved by breaking down the text editing task into two subtasks. Tag a subset of the input tokens and order them in the output text, then insert them to fill in the missing tokens in the output that does not exist in the input.

A detailed method description and evaluation can be found in the EMNLP2020 findings paper: https: //www.aclweb.org/anthology/2020.findings-emnlp.111/

Felix is ​​built on Python 3, Tensorflow 2, and BERT. Works with CPU, GPU, and cloud TPU.

Instruction manual

To run the experiment in Felix, do the following:

Create label_map for the tagged model Transform the data for the insert / tagged model. Tweak the tagging / inserting model. Calculate the forecast.

Then perform these steps using a subset of the DiscoFuse (DiscoFuse) tasks as an example.

You can perform all the steps at

sh run_discofuse_experiment.sh

After setting the variable at the beginning of the script.

1. Build label map # Build label map export OUTPUT_DIR = / path / to / output pythonphrase_vocabulary_constructor_main –output = “$ {OUTPUT_DIR} /label_map.json” –use_pointing = “$ {USE_POINTING}” – do_lower_case = “True” 2.Converting data for insert / tagging models

Download the pre-trained BERT model from the official repository. Unless otherwise stated, all experiments used a 12-layer (not pre-trained hub module) “BERT-Base” model. Then convert the original TSV dataset to TFRecord format. The Discofuse dataset can be found here (https://github.com/google-research-datasets/discofuse)

# Preprocessing export BERT_BASE_DIR = / path / to / uncased_L-12_H-768_A-12 export DISCOFUSE_DIR = / path / to / discofuse python preprocess_main –input_file = “$ {DISCOFUSE_DIR} /train.tsv” –input_format = ” discofuse ” –output_file =” $ {OUTPUT_DIR} /train.tfrecord ” –label_map_file =” $ {OUTPUT_DIR} /label_map.json ” –vocab_file =” $ {BERT_BASE_DIR} /vocab.txt ” – do_lower_case = “True” –use_open_vocab = “True” –max_seq_length = “128” –use_pointing = “$ {USE_POINTING}” –split_on_punc = “True” python preprocess_main.py –input_file = ” $ {DISCOFUSE_DIR} /tune.tsv ” –input_format =” discofuse ” –output_file =” $ {OUTPUT_DIR} /tune.tfrecord ” –label_map_file =” $ {OUTPUT_DIR} /label_map.json ” – vocab_file = “$ {BERT_BASE_DIR} /vocab.txt” –do_lower_case = “True” –use_open_vocab = “True” –max_seq_length = “128” –use_pointing = “$ {USE_POINTING}” – split_on_punc = “True” 3.Model training

Model hyperparameters are specified in felix_config.json. This configuration file extends bert_config.json that comes with the zipped pre-trained BERT model. Note These models can be trained individually, so it is faster to train in parallel rather than sequentially.

Train your model on CPU / GPU.

# Python run_felix –train_file = “$ {OUTPUT_DIR} / train.tfrecord” –eval_file = “$ {OUTPUT_DIR} /tune.tfrecord” –model_dir_tagging = “$ {OUTPUT_DIR} / model_tagging” -Training Bert_config_tagging = “$ {DISCOFUSE_DIR} /felix_config.json” –max_seq_length = 128 –num_train_epochs = 500 –num_train_examples = 8 –num_eval_examples = 8 –train_batch_size = “32” –eval_batch_size = “32” –log_steps = “100” –steps_per_loop = “100” –train_insertion = “False” –use_pointing = “$ {USE_POINTING}” –init_checkpoint = “$ {BERT_DIR} / bert_model. ckpt ” –learning_rate =” 0.00003 ” –pointing_weight =” 1 ” –input_format =” recordio ” –use_weighted_labels =” True “rm -rf” $ {DATA_DIRECTORY} / model_insertion “mkdir” $ {DATA_DIRECTORY} / model_insertion “python run_felix –train_file =” $ {OUTPUT_DIR} /train.tfrecord.ins ” –eval_file =” $ {OUTPUT_DIR} /tune.tfrecord.ins ” –model_dir_insertion =” $ { OUTPUT_DIR} / model_insertion ” –bert_config_insertion” = “$ {DISCOFUSE_DIR} /felix_config.json” –max_seq_length = 128 –num_train _epochs = 500 –num_train_examples = 8 –num_eval_examples = 8 –train_batch_size = “32” –eval_batch_size = ” 32 ” –log_steps =” 100 ” –steps_per_loop =” 100 “-init_checkpoint =” $ {BERT_DIR} /bert_model.ckpt ” –use_pointing =” $ {USE_POINTING} ” –learning_rate =” 0.00003 ” ” –pointing_weight =” 1 ” –input_format =” recordio “-train_insertion =” True ”

To train with Cloud TPU, you need to add the following settings:

–use_tpu = true –tpu_name = $ {TPU_NAME}

For instructions on how to use Cloud TPU, see the BERT TPU procedure and the Google Cloud TPU tutorial.

4. Prediction # Prediction Export PREDICTION_FILE = $ {OUTPUT_DIR} /pred.tsv python Forecast_main –input_format = “discofuse” –predict_input_file = “$ {DISCOFUSE_DIR} /test.tsv” –predict_output_file = “$ {PREDICTION_FILE } ” –label_map_file =” $ {OUTPUT_DIR} /label_map.json ” –vocab_file =” $ {BERT_BASE_DIR} /vocab.txt ” –max_seq_length = 128 –predict_batch_size = 32 –do_lower_case =” True ” –use_open_vocab =” True ” –bert_config_tagging =” $ {DISCOFUSE_DIR} /felix_config.json ” –bert_config_insertion =” $ {DISCOFUSE_DIR} /felix_config.json ” –model_tagging_filepath =” $ {OUTPUT_DIR} / model_tagging ” –model_insertion_filepath =” $ {OUTPUT_DIR} / model_insertion ” –use_pointing =” $ {USE_POINTING} ”

To predict with Cloud TPU, you need to add the following settings:

–use_tpu = true –tpu_name = $ {TPU_NAME}

Prediction outputs a TSV file with four columns: source, input to the insertion model, final output, and reference. Note that the lexical output is tokenized, including the start (WordPieces).[CLS]”And the end”[SEP]”. WordPieces can be removed by replacing” ## “with” “. In addition, the words are separated by punctuation” don’t-> don’t “and should be reversed as well. there is.

How to quote Felix @inproceedings {mallinson-etal-2020-felix, title = “{FELIX}: Flexible text editing by tagging and inserting”, author = “Mallinson, Jonathan and Severyn, Aliaksei and Malmi, Eric and Garrido” , Guillermo “, booktitle =” Results of the Association for Computational Linguistics: EMNLP 2020 “, month = nov, year =” 2020 “, address =” Online “, publisher =” Association for Computational Linguistics “, url =” https: // www. aclweb.org/anthology/2020.findings-emnlp.111 “, doi = ’10 .18653 / v1 / 2020.findings-emnlp.111′, pages =” 1244–1255 “,} License

Apache 2.0; See Licensing for more information.

Disclaimer

This repository contains a TensorFlow2 reimplementation of the original TensorFlow1 code used in the paper, so there may be some discrepancies compared to the paper results. However, we have confirmed that similar results can be obtained with the DiscoFuse dataset.

This is not an official Google product.

