All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- speech mining with SpeechMatrix
- ALTI+
- BLASER
- many tests for the mining pipeline and different modules of
stopes - the
Launchercan now retry jobs when running on a flaky slurm cluster - different margin implementations in mining
- possibility to take the best neighbour when running the margin instead of the first one (fast)
- mine large datasets by splitting them in sub-languages
- when mining, keep metadata about what pairs come from the forward and backward pass
- when mining, choose if you want to do only forward, backward or both passes
- embeddings for mining are now stored in real npy files with headers
StopesModuleis notasyncanymore, just the APIs ofLauncher. You should write yourrunfunction as a normal non-async function- mining neighbours is now optimized to have a smaller memory load
- progress bar of pipelines is simplified to avoid overly busy logs
- do not rely on existing line count files and compute them as part of the pipeline in the mining
- many improvements in the mining code
- many fixes in the NMT eval pipeline
Initial release