Semantica

This is the repository for the code relative to the ICML 2024 submission of By Tying Embeddings You Are Assuming the Distributional Hypothesis.

Abstract

In this work, we analyze both theoretically and empirically the effect of tied input-output embeddings---a popular technique that reduces the model size while often improving training. Interestingly, we found that this technique is connected to the distributional hypothesis, often portrayed by the famous J.Firth quote "A word is characterized by the company it keeps". In particular, we find that words (or, more generally, symbols) with similar semantics are encoded in similar input embeddings, while words that appear in similar contexts are encoded in similar output embeddings. As a consequence of these findings, the tying of the input and output embeddings is encouraged only when the distributional hypothesis holds for the underlying data. These results also provide insight into the embeddings of foundation language models (which are known to be semantically organized). Further, we complement the theoretical findings with several experiments supporting the claims, replicable with this package.

Requirements

To reproduce our experiments you will need to have installed python3 with the following packages installed. We provide also a package version used at the time of writing. However, the requirements.txt does not specify version numbers. You will also need make to run makefiles.

package	version
click	8.1.7
jsonlines	4.0.01.4
matplotlib	3.8.2
numpy	1.26.3
pandas	2.2.0
rich	13.7.0
seaborn	0.13.1
termcolor	2.4.0
torch	2.1.2
tqdm	4.66.1
transformers	4.37.0

To install these packages, we suggest using a virtual environment, for example running:

python3 -m venv venv
source venv/bin/activate
python3 -m pip install -r requirements.txt

It should create a minimal python3 environment to run the experiments.

Experiments

We prepared a makefile for each experiment. These makefiles can be found in the directory semeqv/problem/xor/:

name	experiment
`MakefileTransformerSmall.mk`	small transformer architecture (main experiment)
`MakefileTransformerMedium.mk`	larger transformer architecture
`MakefileTransformerSmallYDH.mk`	no cond.differ. symbols
`MakefileTransformerSmallNDH.mk`	no cond.eqv symbols
`MakefileLSTMSmall.mk`	small LSTM architecture
`MakefileLSTMMedium.mk`	larger LSTM architecture
`MakefileMLPSmall.mk`	small MLPMixer architecture
`MakefileMLPMedium.mk`	larger MLPMixer architecture

If you desire to run all experiments, the file makefile provides the command all to run everything: make all. Otherwise, to run a specific experiment you can run make -f <path> figs. This will train the model and plot figures as the one presented in the paper. For example, make -f semeqv/problem/xor/MakefileTransformerSmall.mk figs will train the models used in the main experiments and it will plot the relative figures.

Customizing Experiments

The code is organized to provide easily customizable experiments. To achieve this, we used the click python package to build a tree-like command line interface (CLI). Our CLI provides a command to choose dataset type, callbacks, loggers, loss function, model architecture, optimizer, and scheduler. For example, the command:

python3 semeqv/cli.py  
		--seed 42  
	cli callbacks cdists --split "train" \
		--input_embedding_path  ./input \
		--output_embedding_path ./output \
	cli callbacks log --epoch_log_path ./valid.log --split "validation" \
	cli loss cross-entropy \
	cli xor dataset default --batch_size 100 --split "train" \
	cli xor dataset default --batch_size 100 --split "train" \
	cli xor model transformer \
		--layers 1 --heads 1 \
		--embedding_size 4 \
		--activation "gelu" \
	cli optimizers adamw --learning_rate 0.0001 \
	cli schedulers cosine noscheduler \
	cli savers default-saver --dirpath $(dir $@) \
	cli train

This command works by setting up the class Trainer in Trainer.py with a build pattern with methods like set_loss_fn(self, value) or set_model(self, value). The final command cli train finally launches the training. Of course, there are many more options than the one shown here. The interested reader can launch the command help at any point of the previous command to inspect the possible command or options available to be specified in the place of help. We also encourage checking click for comprehensive documentation.

Docker

To ensure reproducibility in the future, we also provide a docker file to build an environment in which to run experiments. We suggest using docker with a GPU to run the experiments (see the nvidia-container-toolkit). Once you have built the container with the command make docker-build, you can enter the docker environment with make docker-run. From there, we need can run the commands already discussed. Note that, the docker-run command automatically binds to the current working directory (make sure to run the command from the project root). Therefore, data will be generated outside the virtual environment.

Bibtex

@inproceedings{bertolottitying,
  title={By Tying Embeddings You Are Assuming the Distributional Hypothesis},
  author={Bertolotti, Francesco and Cazzola, Walter},
  booktitle={Forty-first International Conference on Machine Learning}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
semeqv		semeqv
.gitignore		.gitignore
makefile		makefile
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantica

Abstract

Requirements

Experiments

Customizing Experiments

Docker

Bibtex

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantica

Abstract

Requirements

Experiments

Customizing Experiments

Docker

Bibtex

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages