This repo contains the jiant
sentence representation learning toolkit created at the 2018 JSALT Workshop by the General-Purpose Sentence Representation Learning team. It is an extensible platform meant to make it easy to run experiments that involve multitask and transfer learning across sentence-level NLP tasks.
The 'j' in jiant
stands for JSALT. That's all the acronym we have.
To reproduce experiments from JSALT (bugs and all) use the jsalt-experiments
branch. That will contain a snapshot of the code as of early August, potentially with updated documentation.
Make sure you have installed the packages listed in environment.yml
.
When listed, specific particular package versions are required.
If you use conda (recommended, instructions for installing miniconda here), you can create an environment from this package with the following command:
conda env create -f environment.yml
To activate the environment run source activate jiant
, and to deactivate run source deactivate
Some requirements may only be needed for specific configurations. If you have trouble installing a specific dependency and suspect that it isn't needed for your use case, create an issue or a pull request, and we'll help you get by without it.
You will also need to install dependencies for nltk if you do not already have them:
python -m nltk.downloader -d /usr/share/nltk_data perluniprops nonbreaking_prefixes punkt
This project uses git submodules to manage some dependencies on other research code, in particular for loading CoVe and the OpenAI transformer model. In order to make sure you get these repos when you download jiant/
, add --recursive
to your clone command:
git clone --recursive git@github.com:jsalt18-sentence-repl/jiant.git jiant
If you already cloned and just need to get the submodules, you can do:
git submodule update --init --recursive
The repo contains a convenience python script for downloading all GLUE data and standard splits.
python scripts/download_glue_data.py --data_dir data --tasks all
We also make use of many other data sources, including:
- Translation: WMT'14 EN-DE, WMT'17 EN-RU. Scripts to prepare the WMT data are in
scripts/wmt/
. - Language modeling: Billion Word Benchmark, WikiText103. We use the English sentence tokenizer from NLTK toolkit Punkt Tokenizer Models to preprocess WikiText103 corpus. Note that it's only used in breaking paragraphs into sentences. It will use default tokenizer on word level as all other tasks unless otherwise specified. We don't do any preprocessing on BWB corpus.
- Image captioning: MSCOCO Dataset (http://cocodataset.org/#download). Specifically we use the following splits: 2017 Train images [118K/18GB], 2017 Val images [5K/1GB], 2017 Train/Val annotations [241MB].
- Reddit: reddit_comments dataset. Specifically we use the 2008 and 2009 tables.
- DisSent: Details for preparing the corpora are in
scripts/dissent/README
. - DNC (Diverse Natural Language Inference Collection), i.e. recast data: The DNC is available online. Follow the instructions described there to download the DNC.
- CCG: Details for preparing the corpora are in
scripts/ccg/README
. - Edge probing analysis tasks: see
probing/data
for more information.
To incorporate the above data, placed the data in the data directory in its own directory (see task-directory relations in src/preprocess.py
and src/tasks.py
.
To run an experiment, make a config file similar to config/demo.conf
with your model configuration. You can use the --overrides
flag to override specific variables. For example:
python main.py --config_file config/demo.conf \
--overrides "exp_name = my_exp, run_name = foobar, d_hid = 256"
will run the demo config, but output to $JIANT_PROJECT_PREFIX/my_exp/foobar
.
To run the demo config, you will have to set environment variables. The best way to achieve that is to follow the instructions in this script
- $JIANT_PROJECT_PREFIX: the where the outputs will be saved.
- $JIANT_DATA_DIR: location of the saved data. This is usually the location of the Glue data.
- $WORD_EMBED: location of the word embeddings you want to use. For GloVe: 840B300d Glove. For FastText: 300d-2M. For ELMo, AllenNLP will download it for you.
- $FASTTEXT_MODEL_FILE: location of the FastText model: can be set to '.'
Because preprocessing is expensive (e.g. building vocab and indexing for very large tasks like WMT or BWB), we often want to run multiple experiments using the same preprocessing. So, we group runs using the same preprocessing in a single experiment directory (set using the exp_dir
flag) in which we store all shared preprocessing objects. Later runs will load the stored preprocessing. We write run-specific information (logs, saved models, etc.) to a run-specific directory (set using flag run_dir
), usually nested in the experiment directory. Experiment directories are written in project_dir
. Overall the directory structure looks like:
project_dir # directory for all experiments using jiant
|-- exp1/ # directory for a set of runs training and evaluating on FooTask and BarTask
| |-- preproc/ # shared indexed data of FooTask and BarTask
| |-- vocab/ # shared vocabulary built from examples from FooTask and BarTask
| |-- FooTask/ # shared FooTask class object
| |-- BarTask/ # shared BarTask class object
| |-- run1/ # run directory with some hyperparameter settings
| |-- run2/ # run directory with some different hyperparameter settings
| |
| [...]
|
|-- exp2/ # directory for a runs with a different set of experiments, potentially using a different branch of the code
| |-- preproc/
| |-- vocab/
| |-- FooTask/
| |-- BazTask/
| |-- run1/
| |
| [...]
|
[...]
You should also set data_dir
and word_embs_file
options to point to the directories containing the data (e.g. the output of the scripts/download_glue_data
script) and word embeddings (optional, not needed when using ELMo, see later sections) respectively.
To force rereading and reloading of the tasks, perhaps because you changed the format or preprocessing of a task, delete the objects in the directories named for the tasks (e.g., QQP/
) or use the option reload_tasks = 1
.
To force rebuilding of the vocabulary, perhaps because you want to include vocabulary for more tasks, delete the objects in vocab/
or use the option reload_vocab = 1
.
To force reindexing of a task's data, delete some or all of the objects in preproc/
or use the option reload_index = 1
and set reindex_tasks
to the names of the tasks to be reindexed, e.g. reindex_tasks=\"sst,mnli\"
. You should do this whenever you rebuild the task objects or vocabularies.
All model configuration is handled through the config file system and the --overrides
flag, but there are also a few command-line arguments that control the behavior of main.py
. In particular:
--tensorboard
(or -t
): use this to run a Tensorboard server while the trainer is running, serving on the port specified by --tensorboard_port
(default is 6006
).
The trainer will write event data even if this flag is not used, and you can run Tensorboard separately as:
tensorboard --logdir <exp_dir>/<run_name>/tensorboard
--notify <email_address>
: use this to enable notification emails via SendGrid. You'll need to make an account and set the SENDGRID_API_KEY
environment variable to contain the (text of) the client secret key.
--remote_log
(or -r
): use this to enable remote logging via Google Stackdriver. You can set up credentials and set the GOOGLE_APPLICATION_CREDENTIALS
environment variable; see Stackdriver Logging Client Libraries.
The core model is a shared BiLSTM with task-specific components. When a language modeling objective is included in the set of training tasks, we use a bidirectional language model for all tasks, which is constructed to avoid cheating on the language modeling tasks.
We also include an experimental option to use a shared Transformer in place of the shared BiLSTM by setting sent_enc = transformer
. When using a Transformer, we use the Noam learning rate scheduler, as that seems important to training the Transformer thoroughly.
Task-specific components include logistic regression and multi-layer perceptron for classification and regression tasks, and an RNN decoder with attention for sequence transduction tasks. To see the full set of available params, see config/defaults.conf. For a list of options affecting the execution pipeline (which configuration file to use, whether to enable remote logging or tensorboard, etc.), see the arguments section in main.py.
The trainer was originally written to perform sampling-based multi-task training. At each step, a task is sampled and bpp_base
(default: 1) batches of that task's training data is trained on.
The trainer evaluates the model on the validation data after a fixed number of gradient steps, set by val_interval
.
The learning rate is scheduled to decay by lr_decay_factor
(default: .5) whenever the validation score doesn't improve after lr_patience
(default: 1) validation checks.
Note: "epoch" is generally used in comments and variable names to refer to the interval between validation checks, not to a complete pass through any one training set.
If you're training only on one task, you don't need to worry about sampling schemes, but if you are training on multiple tasks, you can vary the sampling weights with weighting_method
, e.g. weighting_method = uniform
or weighting_method = proportional
(to amount of training data). You can also scale the losses of each minibatch via scaling_method
if you want to weight tasks with different amounts of training data equally throughout training.
For multi-task training, we use a shared global optimizer and LR scheduler for all tasks. In the global case, we use the macro average of each task's validation metrics to do LR scheduling and early stopping. When doing multi-task training and at least one task's validation metric should decrease (e.g. perplexity), we invert tasks whose metric should decrease by averaging 1 - (val_metric / dec_val_scale)
, so that the macro-average will be well-behaved.
We have partial support for per-task optimizers (shared_optimizer = 0
), but checkpointing may not behave correctly in this configuration. In the per-task case, we stop training on a task when its patience has run out or its optimizer hits the minimum learning rate.
Within a run, tasks are distinguished between training tasks and evaluation tasks. The logic of main.py
is that the entire model is pretrained on all the training
tasks, then the best model is then loaded, and task-specific components are trained for each of the evaluation tasks with a frozen shared sentence encoder.
You can control which steps are performed or skipped by setting the flags do_pretrain, do_target_task_training, do_full_eval
.
Specify training tasks with pretrain_tasks = $pretrain_tasks
where $pretrain_tasks
is a comma-separated list of task names; similarly use target_tasks
to specify the eval-only tasks.
For example, pretrain_tasks = \"sst,mnli,foo\", target_tasks = \"qnli,bar,sst,mnli,foo\"
(HOCON notation requires escaped quotes in command line arguments).
Note: if you want to train and evaluate on a task, that task must be in both pretrain_tasks
and target_tasks
.
To add new tasks, you should:
-
Add your data to the
data_dir
you intend to use. When constructing your task class (see next bullet), make sure you specify the correct subfolder containing your data. -
Create a class in
src/tasks.py
, and make sure that...- You decorate the task: in the line immediately before
class MyNewTask():
, add the line@register_task(task_name, rel_path='path/to/data')
wheretask_name
is the designation for the task used inpretrain_tasks, target_tasks
andrel_path
is the path to the data indata_dir
. SeeEdgeProbingTasks
intasks.py
for an example. - Your task inherits from existing classes as necessary (e.g.
PairClassificationTask
,SequenceGenerationTask
,WikiTextLMTask
, etc.). - The task definition includes the data loader, as a method called
load_data()
which stores tokenized but un-indexed data for each split in attributes namedtask.{train,valid,test}_data_text
. The formatting of each datum can be anything as long as your preprocessing code (insrc/preprocess.py
, see next bullet) expects that format. Generally data are formatted as lists of inputs and output, e.g. MNLI is formatted as[[sentences1]; [sentences2]; [labels]]
wheresentences{1,2}
is a list of the first sentences from each example. Make sure to call your data loader in initialization! - Your task implements a method
task.get_sentences()
that iterates over all text to index in order to build the vocabulary. For some types of tasks, e.g.SingleClassificationTask
, you only need settask.sentences
to be a list of sentences (List[List[str]]
). - Your task implements a method
task.count_examples()
that setstask.example_counts
(Dict[str:int]
): the number of examples per split (train, val, test). See here for an example. - Your task implements a method
task.get_split_text()
that takes in the name of a split and returns an iterable over the data in that split. This method will be called in preprocessing and passed totask.process_split
(see next bullet). - Your task implements a method
task.process_split()
that takes in a split of your data and produces a list of AllenNLPInstance
s. AnInstance
is a wrapper around a dictionary of(field_name, Field)
pairs.Field
s are objects to help with data processing (indexing, padding, etc.). Each input and output should be wrapped in a field of the appropriate type (TextField
for text,LabelField
for class labels, etc.). For MNLI, we wrap the premise and hypothesis inTextField
s and the label inLabelField
. See the AllenNLP tutorial or the examples insrc/tasks.py
. The names of the fields, e.g.input1
, can be named anything so long as the corresponding code insrc/models.py
(see next bullet) expects that named field. However make sure that the values to be predicted are either namedlabels
(for classification or regression) ortargs
(for sequence generation)! - If you task requires task specific label namespaces, e.g. for translation or tagging, you set the attribute
task._label_namespace
to reserve a vocabulary namespace for your task's target labels. We strongly suggest including the task name in the target namespace. Your task should also implementtask.get_all_labels()
, which returns an iterable over the labels (possibly words, e.g. in the case of MT) in the task-specific namespace. - Your task has attributes
task.val_metric
(name of task-specific metric to track during training) andtask.val_metric_decreases
(bool,True
if val metric should decrease during training). You should also implement atask.get_metrics()
method that implements the metrics you care about by using AllenNLPScorer
objects (typically set viatask.scorer1
,task.scorer2
, etc.).
- You decorate the task: in the line immediately before
-
In
src/models.py
, make sure that:- The correct task-specific module is being created for your task in
build_module()
. - Your task is correctly being handled in
forward()
ofMultiTaskModel
. The model will receive the task class you created and a batch of data, where each batch is a dictionary with keys of theInstance
objects you created in preprocessing, as well as apredict
flag that indicates if your forward function should generate predictions or not. - You create additional methods or add branches to existing methods as necessary. If you do add additional methods, make sure to make use of the
sent_encoder
attribute of the model, which is shared amongst all tasks.
- The correct task-specific module is being created for your task in
Note: The current training procedure is task-agnostic: we randomly sample a task to train on, pass a batch to the model, and receive an output dictionary at least containing a loss
key. Training loss should be calculated within the model; validation metrics should also be computed within AllenNLP scorer
s and not in the training loop. So you should not need to modify the training loop; please reach out if you think you need to.
Feel free to create a pull request to add an additional task if you expect that it'll be useful to others.
We use the ELMo implementation provided by AllenNLP.
To use ELMo, set elmo
to 1.
By default, AllenNLP will download and cache the pretrained ELMo weights. If you want to use a particular file containing ELMo weights, set elmo_weight_file_path = path/to/file
.
To use only the character-level CNN word encoder from ELMo by use elmo_chars_only = 1
. This is set by default.
We use the CoVe implementation provided here.
To use CoVe, clone the repo and set the option path_to_cove = "/path/to/cove/repo"
and set cove = 1
.
To use fastText, we can either use the pretrained vectors or pretrained model. The former will have OOV terms while the latter will not, so using the latter is preferred.
To use the pretrained model, follow the instructions here (specifically "Building fastText for Python") to setup the fastText package, then download the trained English model (note: 9.6G).
fastText will also need to be built in the jiant environment following these instructions.
To activate fastText model within our framework, set the flag fastText = 1
Download the pretrained vectors located here, preferrably the 300-dimensional Common Crawl vectors. Set the word_emb_file
to point to the .vec file.
To use GloVe pretrained word embeddings, download and extract the relevant files and set word_embs_file
to the GloVe file.
For the JSALT workshop, we used Google Compute Engine as our main compute platform. If you're using Google Compute Engine, the private project instance images (cpu-workstation-template*
and gpu-worker-template-*
) already have all the required packages installed, plus the GLUE data and pre-trained embeddings downloaded to /usr/share/jsalt
. Unfortunately, these images are not straightforward to share. To use, clone this repo to your home directory, then test with:
python main.py --config_file config/demo.conf
You should see the model start training, and achieve an accuracy of > 70% on SST in a few minutes. The default config will write the experiment directory to $HOME/exp/<experiment_name>
and the run directory to $HOME/exp/<experiment_name>/<run_name>
, so you can find the demo output in ~/exp/jiant-demo/sst
.
As some config arguments are renamed, you may encounter an error when loading past config files (e.g. params.conf) created before Oct 24, 2018. To update a config file, run
python scripts/update_config.py <path_to_file>
This package is released under the MIT License. The material in the allennlp_mods directory is based on AllenNLP, which was originally released under the Apache 2.0 license.
Post an issue here on GitHub if you have any problems, and create a pull request if you make any improvements (substantial or cosmetic) to the code that you're willing to share.