This is the repository for the code relative to the ICML 2024 submission of By Tying Embeddings You Are Assuming the Distributional Hypothesis.
In this work, we analyze both theoretically and empirically the effect of tied input-output embeddings---a popular technique that reduces the model size while often improving training. Interestingly, we found that this technique is connected to the distributional hypothesis, often portrayed by the famous J.Firth quote "A word is characterized by the company it keeps". In particular, we find that words (or, more generally, symbols) with similar semantics are encoded in similar input embeddings, while words that appear in similar contexts are encoded in similar output embeddings. As a consequence of these findings, the tying of the input and output embeddings is encouraged only when the distributional hypothesis holds for the underlying data. These results also provide insight into the embeddings of foundation language models (which are known to be semantically organized). Further, we complement the theoretical findings with several experiments supporting the claims, replicable with this package.
To reproduce our experiments you will need to have installed python3 with the following packages installed. We provide also a package version used at the time of writing. However, the requirements.txt does not specify version numbers. You will also need make to run makefiles.
| package | version |
|---|---|
| click | 8.1.7 |
| jsonlines | 4.0.01.4 |
| matplotlib | 3.8.2 |
| numpy | 1.26.3 |
| pandas | 2.2.0 |
| rich | 13.7.0 |
| seaborn | 0.13.1 |
| termcolor | 2.4.0 |
| torch | 2.1.2 |
| tqdm | 4.66.1 |
| transformers | 4.37.0 |
To install these packages, we suggest using a virtual environment, for example running:
python3 -m venv venvsource venv/bin/activatepython3 -m pip install -r requirements.txt
It should create a minimal python3 environment to run the experiments.
We prepared a makefile for each experiment. These makefiles can be found in the directory semeqv/problem/xor/:
| name | experiment |
|---|---|
MakefileTransformerSmall.mk |
small transformer architecture (main experiment) |
MakefileTransformerMedium.mk |
larger transformer architecture |
MakefileTransformerSmallYDH.mk |
no cond.differ. symbols |
MakefileTransformerSmallNDH.mk |
no cond.eqv symbols |
MakefileLSTMSmall.mk |
small LSTM architecture |
MakefileLSTMMedium.mk |
larger LSTM architecture |
MakefileMLPSmall.mk |
small MLPMixer architecture |
MakefileMLPMedium.mk |
larger MLPMixer architecture |
If you desire to run all experiments, the file makefile provides the command all to run everything: make all.
Otherwise, to run a specific experiment you can run make -f <path> figs. This will train the model and plot figures as the one presented in the paper. For example, make -f semeqv/problem/xor/MakefileTransformerSmall.mk figs will train the models used in the main experiments and it will plot the relative figures.
The code is organized to provide easily customizable experiments. To achieve this, we used the click python package to build a tree-like command line interface (CLI). Our CLI provides a command to choose dataset type, callbacks, loggers, loss function, model architecture, optimizer, and scheduler. For example, the command:
python3 semeqv/cli.py
--seed 42
cli callbacks cdists --split "train" \
--input_embedding_path ./input \
--output_embedding_path ./output \
cli callbacks log --epoch_log_path ./valid.log --split "validation" \
cli loss cross-entropy \
cli xor dataset default --batch_size 100 --split "train" \
cli xor dataset default --batch_size 100 --split "train" \
cli xor model transformer \
--layers 1 --heads 1 \
--embedding_size 4 \
--activation "gelu" \
cli optimizers adamw --learning_rate 0.0001 \
cli schedulers cosine noscheduler \
cli savers default-saver --dirpath $(dir $@) \
cli train
This command works by setting up the class Trainer in Trainer.py with a build pattern with methods like set_loss_fn(self, value) or set_model(self, value). The final command cli train finally launches the training.
Of course, there are many more options than the one shown here. The interested reader can launch the command help at any point of the previous command to inspect the possible command or options available to be specified in the place of help. We also encourage checking click for comprehensive documentation.
To ensure reproducibility in the future, we also provide a docker file to build an environment in which to run experiments. We suggest using docker with a GPU to run the experiments (see the nvidia-container-toolkit). Once you have built the container with the command make docker-build, you can enter the docker environment with make docker-run. From there, we need can run the commands already discussed. Note that, the docker-run command automatically binds to the current working directory (make sure to run the command from the project root). Therefore, data will be generated outside the virtual environment.
@inproceedings{bertolottitying,
title={By Tying Embeddings You Are Assuming the Distributional Hypothesis},
author={Bertolotti, Francesco and Cazzola, Walter},
booktitle={Forty-first International Conference on Machine Learning}
}